USING TWITTER (Southwest v/s Delta Airlines) Siddharth Marathe Gourang Amrujkar Sanyuja Desai Sujan Srinivas Pavan [email protected] [email protected] [email protected] [email protected] [email protected]
edu 664491998 669570815 667526586 667828323 654641988 Abstract– We use sentiment analysis to analyze opinions and emotions from the text. Itfinds the sentiment of the person with respect to the content given. Here thecontent is in the form of tweets. Twitter is one of the major platforms wherepeople voice their opinions about any subject. Executing sentiment analysis inTwitter was more difficult because of the use of slang words and because ofmisspellings. With the help of TextBlob and IBM Watson Natural LanguageUnderstanding API we tried to compare sentiments for Southwest and DeltaAirlines. I. PROBLEM DEFINITIONSince thedawn of the internet age, it has become essential for Airline companies to keepan eye out for how customers react to their services on the internet.
Since ithas been made easy by the Internet for anyone to go online and share theirviews and opinions, it has become important for businesses to keep track ofthis valuable information. Out of all the platforms on the web, Facebook andTwitter have been the standout portals for people to share their thoughts andopinions. GivenTwitter’s character limit of 140 characters and over 320 million users, it hasbecome one of the key portals to keep track of popularity for companies andbusinesses alike. To address this, need of understanding consumer opinionsthrough Social media monitoring, Data Mining techniques have found an increasedapplication in this field.
Sentiment Analysis is the Data Mining techniquewhich has been developed to better understand people’s opinions on a wide rangeof topics across the Social Web. SentimentAnalysis could help a company understand their customers better and discern theiropinions about their products/services. It could also help companies keep trackof how their products/services are performing in comparison to othercompetition on the market. Although the presence of technological reviewpublications has allowed companies to keep track of the quality of theirproducts on the web, it has also become essential to consider the consumeropinions regarding a company’s products and services on the web.
This problemof tracking and deriving meaningful insights from a plethora of unstructureddata on the web is effectively addressed by Sentiment Analysis. II. OBJECTIVESThe primary aim of thisproject was to extract tweets on Southwest and Delta Airlines from Twitter,based on the hashtags that customers use on the network. For the analysis, the dataprovided by a social media website like Twitter was required and relevantattributes were extracted. This dataset of tweets was acquired through theTwitter API, which is a streamlined process that allows developers to extractcontent from the network. Once this information onthe different Airlines that is Southwest, Delta, and people’s comments/opinionsabout them was gathered, the next step was to perform sentiment analysis to geta percentage value of the tweets in terms of their positivity/negativity.Airline companies can study this data to improve their marketing strategies tofair better than their competitors. The keywords that areassociated with these positive and negative tweets were considered to betterunderstand the customers.
Airline companies can use this analysis to detectpatterns in customer behaviors and try to collect an early feedback about theirairline services or product. Another part of the analysis is to predict theunseen tweets based on their positivity/negativity using data miningalgorithms. Data analyst, if working for a one airline company can compare thereal-time trends of their airlines or their competitors to stay ahead of itscompetitors. III. SOURCES OF DATAThe primary source of data tocarry out the sentiment analysis is Twitter. This is because the analysis iscarried out with the help of the user’s comments or opinions on the network.Twitter allows public access to extract the tweets with the help of using theirpublic API (Application Programming Interface). It can be accessed by creatinga developer’s account where calls can be made to the API to extract tweets ordata with a specific keyword.
After creating the twitter application, the APIkey and the API secret is obtained along with the access token and the accesssecret. This data can be simplified with the help of python. The python library can beconnected to the Twitter API with the help of an operator called the ‘Tweepy’.
This operator extracts the data which is downloaded from the Twitter API. Thiscan be done by using the API key, API secret, access token and access secretobtained previously. The data regarding a specific airline can be extractedfrom twitter by using keywords which can determine the nature of the comment.We extracted 1000 tweets each for two airline companies, SouthWest and Deltafor our experiment.
IV. DATA CLEANINGA. Data PreprocessingThe data extracted fromtwitter is a set of 2000 tweets which had numerous entities contributing in theuncleanliness of the data. There was a prominent necessity to improve thequality of the dataset by cleaning it thoroughly for getting precise resultsfor our experiment. The tweets had unnecessary emojis, web URLs, retweets,repeated unimportant words and many stopwords which won’t account for anyeminent change in the accuracy of our result. These factors are effectivelyremoved by using various techniques.
B. Normalizing and TokenizingThe Natural LanguageToolkit (NLTK) was the best way for Normalizing and tokenizing any textual data.The Natural Language Toolkit is an open source Python library for naturallanguage processing. The module called Tweet Tokenizer was used in the programto clean the tweets which helped downsizing the data expect for deletion of theemojis. Tweet Tokenizer tokenized each word in the tweet by chopping the textup into pieces called ‘Tokens’ and at the same time throwing away thepunctuation marks. Later the text is also normalized which helped our code togroup same words which were typed differently in different tweets. That is, itidentifies and makes the tokens same for same words. For example, the word USAand U.
S.A are the same. Normalizing the tokens also takes care of differentascent and diacritics (eg.
Cliché = Cliche) and case folding (e.g. CAT=cat). C. Stemming and Lemmatization “We use PorterStemmer module of the NLTKlibrary for stemming and lemmatize the tokens.
After the tokens are made,PorterStemmer is used to further refine each token by giving us the root (stem)of each token (eg. ‘Organization’,’Organized’,’Organizes’ have the stem’Organize’.) Every token is reduced to its stem.D.
Removal of StopWords Stopwordsare the words those are common in English language. These are the words thatare repeated in day-to-day conversations very frequently. We can define our ownstop words collection by using collection frequency or we can use thepredefined collection of words from NLTK corpus. This removes the most frequentstop words which do not help identifying the sentiment of the tweet text.
Henceour data is now cleaned for further analysis. V. SENTIMENT ANALYSISWe make use of two major pythonpackages in our analysis: TextBlob TextBlob is a Pythonlibrary for processing textual data. It provides a simple API for diving intocommon natural language processing (NLP) tasks such as part-of-speech tagging,noun phrase extraction, sentiment analysis, classification, translation. Every clean tweet is passed through TextBlob. It uses Naive BayesAnalyzer to analyze sentiment of every tweet.
The output for every tweet is inthe form of percent positivity and percent negativity present in the tweet andthe overall resulting sentiment (Positive or Negative) appeared in the tweet.These percentages and class of the tweets are stored in csv file or excelsheet. These are the results those are later used for visualization. 2. IBM Watson Natural Language understanding API. Natural LanguageUnderstanding uses natural language processing to analyze semantic features ofany text. Provide plain text, HTML, or a public URL, and Natural LanguageUnderstanding returns results for the features you specify. The service cleansHTML before analysis by default, which removes most advertisements and otherunwanted content.
This works just as textBlob but it’s better to useIBM Watson Natural Language understanding package for huge amount of data.Hence we use the results obtained from text blob for our conclusions. VI. GENERAL TRENDANALYSIS TextBlob is a high-level library built over top of NLTK library. First, we call clean tweet method toremove links, special characters, etc. from the tweet using some simple regex. Then,as we pass tweet to create a TextBlob object, following processing is done overtext by TextBlob library: Tokenize the tweet i.
e., split words from body of text. Remove Stopwords from the tokens. (Stopwords are the commonly used words which are irrelevant in text analysis like I, am, you, are, etc.) Do POS (part of speech) tagging of the tokens and select only significant features/tokens like adjectives, adverbs, etc. Pass the tokens to a sentiment classifier which classifies the tweet sentiment as positive, negative or neutral by assigning it a polarity between -1.0 to 1.
0. Afteranalyzing all the tweets of both southwest airlines and Delta airlines, wereceived p_pos values each tweet. We classified all the p_pos values of boththe airlines for time of the day. We then analyzed this by line graphs, whichare as follows: This is an example of Delta airlineson a random day, where most of the positivity percentage is more than 0.5. Thisindicates a good sign as in customers might have loved the service, on time arrival,comfortability etc.
on the flight. Most of the positive tweets are in the earlymornings and late evenings. This is an example ofSouthwest airlines on a random day, where the average positivity percentage isless than 0.
5. Customers might have complaints regarding service, delay, noboot space etc. on the flight. Although the trend improves in the lateevenings, the airlines still has less positivity tweets. VII. CONCLUSION Whenwe observe the trend, we see that the tweets for Delta are normally distributedover the day, whereas same cannot be said about Southwest.
The positive tweets forSouthwest are seen towards the end of the day, there are distinct reasons thatcan be affecting this such as the number of flights during the day by both theairlines or even when people tweet, right after the flight or towards the endof the day. The positivity rate is very low for Southwest during majority ofthe time in the given sample day. Hence, we can conclude that a lot of negativetweets were registered which resulted in the erratic nature of the trend asshown above. VIII. FUTURE SCOPE Further,this analysis can be improved by implementing sentiment analysis on a deeperlevel by incorporating the following points: ? Comparing the number ofFlights along with the time of tweets, this way we get to know how airlines isdoing as the day progresses and is time of flight affecting the service.
? Analyzing total sample sizeof number of positive, negative and neutral tweets as if we have three negativetweets and one positive tweet on a given sample day, the ratio of negativetweets will turn out to be 0.75. Thus, even if number of neutral tweetsincrease result will be negative. But when seen closely it is just 3:1 ratio.