Twitter (X) Sentiment Analysis

Introduction

The Olympic Games represent not just an athletic spectacle, but a global convergence of culture, politics, and media. The public sentiment of this event is a critical element in understanding the global perception of the games, their organization, and their socio-political implications.

This analysis was carried out to show Twitter's perception of the Paris 2024 Olympic Games.

Tweets Scraping

This was done using the Twikit Python library and was a bit difficult due to Twitter's rules restricting mining of tweets. I was able to find my way around with help from Mehran Shakarami's github repository.

The Search Parameters included phrases like 'olympics', 'paris2024', 'parisolympics2024', 'olympics2024' so that it returns tweets containing them. Also limiting them to tweets written in English and between the 16th of July and the 12th of August 2024.

I was able to obtain 9027 and 6 columns rows of data.

Dataframe showing first 5 rows (Usernames blurred for Data Privacy)

Usernames blurred for Data Privacy.

Data Cleaning

I cleaned the data by checking for duplicate rows, null values and dropping columns irrelevant to the analysis.

819 duplicate rows were found, as well as 1 null row. The columns 'Tweet_count,' 'Username', and 'Created At' were also dropped.

Leaving 8207 rows and 3 columns

Tweets Processing

Due to the nature of the data, I had to remove links, punctuations, emojis, and stop words, as well as usernames and hashtags, using the 'processTweets' function to prepare the data for analysis.

I also used Tokenization to split sentences into smaller units and remove unnecessary elements.

I then used the 'getAdjectives' function to extract the adjectives from the processed tweets (using the POS tagging module in the nltk library), and the Lemmatization function to return words to their base form.

# Function to remove punctuations, links, emojis, and stop words
def processTweets(tweet):
    tweet = tweet.lower()  #has to be in place
    # Remove urls
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
    # Remove user @ references and '#' from tweet
    tweet = re.sub(r'\@\w+|\#|\d+', '', tweet)
    # Remove stopwords
    tweet_tokens = word_tokenize(tweet)  # convert string to tokens
    filtered_words = [w for w in tweet_tokens if w not in stop_words]
    filtered_words = [w for w in filtered_words if w not in emojis]
    filtered_words = [w for w in filtered_words if w in word_list]

    # Remove punctuations
    unpunctuated_words = [char for char in filtered_words if char not in string.punctuation]
    unpunctuated_words = ' '.join(unpunctuated_words)

    return "".join(unpunctuated_words)  # join words with a space in between them


# function to obtain adjectives from tweets
def getAdjectives(tweet):
    tweet = word_tokenize(tweet)  # convert string to tokens
    tweet = [word for (word, tag) in pos_tag(tweet)
             if tag in ["JJ", "JJR", "JJS"]]  # pos_tag module in NLTK library
    return " ".join(tweet)  # join words with a space in between them

# Defining my NLTK stop words and my user-defined stop words
stop_words = list(stopwords.words('english'))
user_stop_words = ['usa', 'china', 'japan', 'canada', 'nigeria','paris', '100m', 'marathon', 'swimming', 'basketball', 'gymnastics',
                   'gold', 'silver', 'bronze', 'olympics', 'games', 'paralympics', 'winter', 'summer', 'ioc','olympics', 
                   'opening ceremony', 'closing ceremony', 'medal tally', 'world record', 'fifa', 'iaaf','go team',
                   'good luck', 'congratulations', 'well done']


alphabets = list(string.ascii_lowercase)
stop_words = stop_words + user_stop_words + alphabets
word_list = words.words()  # all words in English language
emojis = list(UNICODE_EMOJI.keys())  # full list of emojis

# function to return words to their base form using Lemmatizer
def preprocessTweetsSentiments(tweet):
    tweet_tokens = word_tokenize(tweet)
    lemmatizer = WordNetLemmatizer() # instatiate an object WordNetLemmatizer Class
    lemma_words = [lemmatizer.lemmatize(w) for w in tweet_tokens]
    return " ".join(lemma_words)

After applying the above functions to the dataset, this was the resulting data frame:

NB: Null cells show that the parsed Text did not make it through the processTweets or the getAdjectives function.

Data Exploration

I extracted the top words (adjectives) associated with the Olympics and their frequency:

# Plotting the top 10 words
word_count.head(10).plot(kind='barh', x='Words', y='Count', legend=False, color='skyblue')
plt.xlabel('Frequency')
plt.ylabel('Words')
plt.title('Top 10 Words in Word Cloud')
plt.gca().invert_yaxis()  # Invert y-axis to show the highest at the top
plt.show()

I also visualized this using a word cloud:


# Combine all text from the 'Tweets_Adjectives' column into a single string
text = " ".join(tweet for tweet in tweets['Tweets_Adjectives'])

# Create and generate a word cloud image
wordcloud = WordCloud(width=800, 
    height=400, 
    max_words=1000, 
    background_color='black', 
    colormap='Blues').generate(text)

# Display the word cloud image
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  # Remove axes
plt.show()

Sentiment Analysis

Now to the main aspect of this project. Using the TextBlob Library, I was able to extract tweet sentiments. TextBlob works by giving each sentence a Subjectivity and Polarity score.

# create a function to get subjectivity score
def getsubjectivity(tweet):
    return TextBlob(tweet).sentiment.subjectivity

#create function to return polarity score
def getpolarity(tweet):
    return TextBlob(tweet).sentiment.polarity

#create a function to obtain sentiment category
def getsentimentTextBlob(polarity):
    if polarity < 0:
        return "negative"
    elif polarity == 0:
        return "neutral"
    else:
        return "positive"

The Polarity score decides what category the statement is going to be in. Between 0 and 1, where <0 is negative, 0 is neutral, and >0 is positive.

The distribution of the sentiments are shown below:

Insights

While most tweets were neutral (45%), 40% of tweets about the Olympics were positive, while 15% were negative.
Tweets on the Olympics garnered a Total of 2,856,131 Retweets and 21,371,721 Likes (Indicative of its popularity).

NB. These insights are not representative of the Twitter community as a whole, as I only mined a subset of Tweets.

Twitter Sentiment Analysis of the Paris2024 Olympics

Table of contents