(Southwest v/s Delta Airlines)
Siddharth Marathe Gourang Amrujkar
Sanyuja Desai Sujan Srinivas Pavan Rajshekhar
664491998 669570815 667526586 667828323 654641988
– We use sentiment analysis to analyze opinions and emotions from the text. It
finds the sentiment of the person with respect to the content given. Here the
content is in the form of tweets. Twitter is one of the major platforms where
people voice their opinions about any subject. Executing sentiment analysis in
Twitter was more difficult because of the use of slang words and because of
misspellings. With the help of TextBlob and IBM Watson Natural Language
Understanding API we tried to compare sentiments for Southwest and Delta
I. PROBLEM DEFINITION
dawn of the internet age, it has become essential for Airline companies to keep
an eye out for how customers react to their services on the internet. Since it
has been made easy by the Internet for anyone to go online and share their
views and opinions, it has become important for businesses to keep track of
this valuable information. Out of all the platforms on the web, Facebook and
Twitter have been the standout portals for people to share their thoughts and
Twitter’s character limit of 140 characters and over 320 million users, it has
become one of the key portals to keep track of popularity for companies and
businesses alike. To address this, need of understanding consumer opinions
through Social media monitoring, Data Mining techniques have found an increased
application in this field. Sentiment Analysis is the Data Mining technique
which has been developed to better understand people’s opinions on a wide range
of topics across the Social Web.
Analysis could help a company understand their customers better and discern their
opinions about their products/services. It could also help companies keep track
of how their products/services are performing in comparison to other
competition on the market. Although the presence of technological review
publications has allowed companies to keep track of the quality of their
products on the web, it has also become essential to consider the consumer
opinions regarding a company’s products and services on the web. This problem
of tracking and deriving meaningful insights from a plethora of unstructured
data on the web is effectively addressed by Sentiment Analysis.
The primary aim of this
project was to extract tweets on Southwest and Delta Airlines from Twitter,
based on the hashtags that customers use on the network.
For the analysis, the data
provided by a social media website like Twitter was required and relevant
attributes were extracted. This dataset of tweets was acquired through the
Twitter API, which is a streamlined process that allows developers to extract
content from the network.
Once this information on
the different Airlines that is Southwest, Delta, and people’s comments/opinions
about them was gathered, the next step was to perform sentiment analysis to get
a percentage value of the tweets in terms of their positivity/negativity.
Airline companies can study this data to improve their marketing strategies to
fair better than their competitors.
The keywords that are
associated with these positive and negative tweets were considered to better
understand the customers. Airline companies can use this analysis to detect
patterns in customer behaviors and try to collect an early feedback about their
airline services or product. Another part of the analysis is to predict the
unseen tweets based on their positivity/negativity using data mining
algorithms. Data analyst, if working for a one airline company can compare the
real-time trends of their airlines or their competitors to stay ahead of its
III. SOURCES OF DATA
The primary source of data to
carry out the sentiment analysis is Twitter. This is because the analysis is
carried out with the help of the user’s comments or opinions on the network.
Twitter allows public access to extract the tweets with the help of using their
public API (Application Programming Interface). It can be accessed by creating
a developer’s account where calls can be made to the API to extract tweets or
data with a specific keyword. After creating the twitter application, the API
key and the API secret is obtained along with the access token and the access
secret. This data can be simplified with the help of python.
The python library can be
connected to the Twitter API with the help of an operator called the ‘Tweepy’.
This operator extracts the data which is downloaded from the Twitter API. This
can be done by using the API key, API secret, access token and access secret
obtained previously. The data regarding a specific airline can be extracted
from twitter by using keywords which can determine the nature of the comment.
We extracted 1000 tweets each for two airline companies, SouthWest and Delta
for our experiment.
IV. DATA CLEANING
The data extracted from
twitter is a set of 2000 tweets which had numerous entities contributing in the
uncleanliness of the data. There was a prominent necessity to improve the
quality of the dataset by cleaning it thoroughly for getting precise results
for our experiment. The tweets had unnecessary emojis, web URLs, retweets,
repeated unimportant words and many stopwords which won’t account for any
eminent change in the accuracy of our result. These factors are effectively
removed by using various techniques.
Normalizing and Tokenizing
The Natural Language
Toolkit (NLTK) was the best way for Normalizing and tokenizing any textual data.
The Natural Language Toolkit is an open source Python library for natural
language processing. The module called Tweet Tokenizer was used in the program
to clean the tweets which helped downsizing the data expect for deletion of the
emojis. Tweet Tokenizer tokenized each word in the tweet by chopping the text
up into pieces called ‘Tokens’ and at the same time throwing away the
punctuation marks. Later the text is also normalized which helped our code to
group same words which were typed differently in different tweets. That is, it
identifies and makes the tokens same for same words. For example, the word USA
and U.S.A are the same. Normalizing the tokens also takes care of different
ascent and diacritics (eg. Cliché = Cliche) and case folding (e.g. CAT=cat).
Stemming and Lemmatization
“We use PorterStemmer module of the NLTK
library for stemming and lemmatize the tokens. After the tokens are made,
PorterStemmer is used to further refine each token by giving us the root (stem)
of each token (eg. ‘Organization’,’Organized’,’Organizes’ have the stem
‘Organize’.) Every token is reduced to its stem.
D. Removal of StopWords
are the words those are common in English language. These are the words that
are repeated in day-to-day conversations very frequently. We can define our own
stop words collection by using collection frequency or we can use the
predefined collection of words from NLTK corpus. This removes the most frequent
stop words which do not help identifying the sentiment of the tweet text. Hence
our data is now cleaned for further analysis.
V. SENTIMENT ANALYSIS
We make use of two major python
packages in our analysis:
TextBlob is a Python
library for processing textual data. It provides a simple API for diving into
common natural language processing (NLP) tasks such as part-of-speech tagging,
noun phrase extraction, sentiment analysis, classification, translation.
Every clean tweet is passed through TextBlob. It uses Naive Bayes
Analyzer to analyze sentiment of every tweet. The output for every tweet is in
the form of percent positivity and percent negativity present in the tweet and
the overall resulting sentiment (Positive or Negative) appeared in the tweet.
These percentages and class of the tweets are stored in csv file or excel
sheet. These are the results those are later used for visualization.
2. IBM Watson Natural Language understanding API.
Understanding uses natural language processing to analyze semantic features of
any text. Provide plain text, HTML, or a public URL, and Natural Language
Understanding returns results for the features you specify. The service cleans
HTML before analysis by default, which removes most advertisements and other
This works just as textBlob but it’s better to use
IBM Watson Natural Language understanding package for huge amount of data.
Hence we use the results obtained from text blob for our conclusions.
VI. GENERAL TREND
TextBlob is a high-level library built over top of NLTK library. First, we call clean tweet method to
remove links, special characters, etc. from the tweet using some simple regex. Then,
as we pass tweet to create a TextBlob object, following processing is done over
text by TextBlob library:
Tokenize the tweet
i.e., split words from body of text.
from the tokens. (Stopwords are the commonly used words which are
irrelevant in text analysis like I, am, you, are, etc.)
Do POS (part of
speech) tagging of the tokens and select only significant features/tokens
like adjectives, adverbs, etc.
Pass the tokens to
a sentiment classifier which
classifies the tweet sentiment as positive, negative or neutral by
assigning it a polarity between -1.0 to 1.0.
analyzing all the tweets of both southwest airlines and Delta airlines, we
received p_pos values each tweet. We classified all the p_pos values of both
the airlines for time of the day. We then analyzed this by line graphs, which
are as follows:
This is an example of Delta airlines
on a random day, where most of the positivity percentage is more than 0.5. This
indicates a good sign as in customers might have loved the service, on time arrival,
comfortability etc. on the flight. Most of the positive tweets are in the early
mornings and late evenings.
This is an example of
Southwest airlines on a random day, where the average positivity percentage is
less than 0.5. Customers might have complaints regarding service, delay, no
boot space etc. on the flight. Although the trend improves in the late
evenings, the airlines still has less positivity tweets.
we observe the trend, we see that the tweets for Delta are normally distributed
over the day, whereas same cannot be said about Southwest.
The positive tweets for
Southwest are seen towards the end of the day, there are distinct reasons that
can be affecting this such as the number of flights during the day by both the
airlines or even when people tweet, right after the flight or towards the end
of the day. The positivity rate is very low for Southwest during majority of
the time in the given sample day. Hence, we can conclude that a lot of negative
tweets were registered which resulted in the erratic nature of the trend as
VIII. FUTURE SCOPE
this analysis can be improved by implementing sentiment analysis on a deeper
level by incorporating the following points:
? Comparing the number of
Flights along with the time of tweets, this way we get to know how airlines is
doing as the day progresses and is time of flight affecting the service.
? Analyzing total sample size
of number of positive, negative and neutral tweets as if we have three negative
tweets and one positive tweet on a given sample day, the ratio of negative
tweets will turn out to be 0.75. Thus, even if number of neutral tweets
increase result will be negative. But when seen closely it is just 3:1 ratio.