A Guide to Mining and Analysing Tweets with R (2024)

Simple Steps to Writing an Insightful Twitter Analytics Report

Published in

Towards Data Science

10 min read

Sep 30, 2019

A Guide to Mining and Analysing Tweets with R (3)

Twitter provides us with vast amounts of user-generated language data — a dream for anyone wanting to conduct textual analysis. More than that, tweets allow us to gain insights into the online public behaviour. As such, analysing Twitter has become a crucial source of information for brands and agencies.

Several factors have given Twitter considerable advantages over other social media platforms for analysis. First, the limited character size of tweets provides us with a relatively hom*ogeneous corpora. Second, the millions of tweets published everyday allows access to large data samples. Third, the tweets are publicly available and easily accessible as well as retrievable via APIs.

Nonetheless, extracting these insights still requires a bit of coding and programming knowledge. This is why, most often, brands and agencies rely on easy-to-use analytics tools such as SproutSocial and Talkwalker who provide these insights at a cost in just one click.

In this article, I help you to break down these barriers and provide you with a simple guide on how to extract and analyse tweets with the programming software R.

Here are 3 reasons why you might chose to do so:

Using R is for free, i.e. you will be able to produce a Twitter Analytics Report for free and learn how to code at the same time!
R allows you infinite opportunities for analysis. Using it to analyse Twitter therefore allows you to conduct tailor-made analysis depending on what you wish to analyse instead of relying on a one-size-fits-all report
R allows you to analyse any Twitter account you want even if you don’t have the log-in details. This is a huge advantage compared to many analytics tools that require you to have the log-in details in order to analyse the information in the first place.

Convinced? Let’s get started, then!

In order to get started, you first need to get a Twitter API. This will allow you to retrieve the tweets — without it, you cannot do anything. Getting a Twitter API is easy. First make sure you have a Twitter account, otherwise create one. Then, apply for a developer account via the following website: https://developer.twitter.com/en/apply-for-access.html. You’ll need to fill in an application form, which includes explaining a little a bit more what you wish you analyse.

Once you application has been accepted by Twitter (which doesn’t take too long), you’ll receive the following credentials that you need to keep safe:

Consumer key
Consumer Secret
Access Token
Access Secret

Once you have the information above, start R and download the package “rtweet”, which I will use to extract the tweets.

install.packages("rtweet")
library (rtweet)

Then, set up the authentification to connect to Twitter. You do this by entering the name of your app, consumer key and consumer secret — all of it is information you have received when applying for the Twitter API. You will be re-directed to a Twitter page and asked to accept the authentification. Once this is done, you can return to R and start the analysis of your tweets!

twitter_token <- create_token(
 app = ****,
 consumer_key = ****,
 consumer_secret = ****,
 set_renv = TRUE)

Searching for tweets

Depending on the analysis you wish to perform, you may want to search for tweets that contain a specific word or hashtag. Note that you can only extract tweets from the past 6 to 9 days, so keep this in mind for your analysis.

To do this, simply use the search_tweets function followed by a few specifications: the number of tweets to extract (n), whether or not to include retweets and the language of the tweets. As an example, see the line of code below.

climate <- search_tweets(“climate”, n=1000, include_rts=FALSE, lang=”en”)

Search for a specific user account

Alternatively, you may want to analyse a specific user account. In this case, use the get_timeline function followed by the twitter handle and number of tweets you wish to extract. Note that here you can only extract the last 3200 tweets.

In this example, I chose to exract the tweets of Bill Gates. The advantage here is that Bill Gates’ account counts 3169 tweets overall, which is under the 3200 threshold.

Gates <- get_timeline("@BillGates", n= 3200)

In this part, I show you 8 key insights you should include in every Twitter Analytics Report. To do this, let’s delve into the Twitter account of Bill Gates a bit more!

1. SHOW WHAT WORKS BEST AND WHAT DOESN’T

The first part of any report should deliver clear information as to what worked best and what didn’t. Finding out the best and least performing tweets gives a quick and clear overall picture of the account.

In order to do this, you first need to distinguish between organic tweets, retweets and replies. The following line of code shows you how to remove the retweets and replies from your sample to keep only the organic tweets — content-wise, these are the ones you want to analyse!

# Remove retweets
Gates_tweets_organic <- Gates_tweets[Gates_tweets$is_retweet==FALSE, ] # Remove replies
Gates_tweets_organic <- subset(Gates_tweets_organic, is.na(Gates_tweets_organic$reply_to_status_id))

Then, you’ll want to analyse engagement by looking at the variables: favorite_count (i.e. the number of likes) or retweet_count (i.e. the number of retweets). Simply arrange them in descending order (with a minus “-” before the variable) to find the one with the highest number of likes or retweets or ascending order (without the minus) to find the one with lowest number of engagements.

Gates_tweets_organic <- Gates_tweets_organic %>% arrange(-favorite_count)
Gates_tweets_organic[1,5]Gates_tweets_organic <- Gates_tweets_organic %>% arrange(-retweet_count)
Gates_tweets_organic[1,5]

2. SHOW THE RATIO OF REPLIES/RETWEETS/ORGANIC TWEETS

Analysing the ratio of replies, retweets and organic tweets can tell you a great deal about the type of account you’re analysing. No one likes a Twitter account that exclusively retweets for instance, without any individual content. Finding a good ratio of replies, retweets and organic tweets is therefore a key metric to monitor if one wishes to improve the performance of his or her account.

As a first step, make sure to create three different data sets. As you’ve already created a dataset containing only the organic tweets in the previous steps, simply now create a dataset containing only the retweets and one containing only the replies.

# Keeping only the retweets
Gates_retweets <- Gates_tweets[Gates_tweets$is_retweet==TRUE,]# Keeping only the replies
Gates_replies <- subset(Gates_tweets, !is.na(Gates_tweets$reply_to_status_id))

Then, create a separate data frame containing the number of organic tweets, retweets, and replies. These numbers are easy to find: they are the number of observations for your three respective datasets.

# Creating a data frame
data <- data.frame(
 category=c("Organic", "Retweets", "Replies"),
 count=c(2856, 192, 120)
)

Once you’ve done that, you can start preparing your data frame for a donut chart as shown below. This includes adding columns that calculate the ratios and percentages and some visualisation tweaks such as specifying the legend and rounding up your data.

# Adding columns 
data$fraction = data$count / sum(data$count)
data$percentage = data$count / sum(data$count) * 100
data$ymax = c*msum(data$fraction)
data$ymin = c(0, head(data$ymax, n=-1))# Rounding the data to two decimal points
data <- round_df(data, 2)# Specify what the legend should say
Type_of_Tweet <- paste(data$category, data$percentage, "%")ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Type_of_Tweet)) +
 geom_rect() +
 coord_polar(theta="y") + 
 xlim(c(2, 4)) +
 theme_void() +
 theme(legend.position = "right")

A Guide to Mining and Analysing Tweets with R (4)

3. SHOW WHEN THE TWEETS ARE PUBLISHED

Thanks to the date and hour extracted with each tweet, understanding when Bill Gates tweets most is very easy to analyse. This can give us an overall overview of the activity of the account and can be a useful metric to be analysed against the most and least performing tweets.

In this example, I analyse the frequency of tweets by year. Note that you can also do so by month by simply changing “year” to “month” in the following line of code. Alternatively, you can also analyse the publishing behaviour by hour with the R packages hms and scales.

colnames(Gates_tweets)[colnames(Gates_tweets)=="screen_name"] <- "Twitter_Account"ts_plot(dplyr::group_by(Gates_tweets, Twitter_Account), "year") +
 ggplot2::theme_minimal() +
 ggplot2::theme(plot.title = ggplot2::element_text(face = "bold")) +
 ggplot2::labs(
 x = NULL, y = NULL,
 title = "Frequency of Tweets from Bill Gates",
 subtitle = "Tweet counts aggregated by year",
 caption = "\nSource: Data collected from Twitter's REST API via rtweet"
 )

A Guide to Mining and Analysing Tweets with R (5)

4. SHOW FROM WHERE THE TWEETS ARE PUBLISHED

Analysing the source of the platform from which tweets are published is another cool insight to have. One of the reasons is that we can to a certain extent deduct whether or not Bill Gates is the one tweeting or not. As a result, this helps us define the personality of the tweets.

In this step, you’re interested in the source variable collected by the rtweet package. The following line of codes shows you how to aggregate this data by type of source and count the frequency of tweets for each type respectively. Note that I have only kept the sources for which more than 11 tweets were published to simplify the visualisation process.

Gates_app <- Gates_tweets %>% 
 select(source) %>% 
 group_by(source) %>%
 summarize(count=n())Gates_app <- subset(Gates_app, count > 11)

Once this is done, the process is similar to the donut chart already created previously!

data <- data.frame(
 category=Gates_app$source,
 count=Gates_app$count
)data$fraction = data$count / sum(data$count)
data$percentage = data$count / sum(data$count) * 100
data$ymax = c*msum(data$fraction)
data$ymin = c(0, head(data$ymax, n=-1))data <- round_df(data, 2)Source <- paste(data$category, data$percentage, "%")ggplot(data, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=Source)) +
 geom_rect() +
 coord_polar(theta="y") + # Try to remove that to understand how the chart is built initially
 xlim(c(2, 4)) +
 theme_void() +
 theme(legend.position = "right")

A Guide to Mining and Analysing Tweets with R (6)

Note that most of the tweets from Bill Gates originate from Twitter Web Client, Sprinklr and Hootsuite — an indication that Bill Gates is most likely not the one tweeting himself!

5. SHOW THE MOST FREQUENT WORDS FOUND IN THE TWEETS

A Twitter Analytics Report should of course include an analysis of the content of the tweets and this includes finding out which words are used most.

Because you’re analysing textual data, make sure to clean it first and remove it from any character that you don’t want to show in your analysis such as hyperlinks, @ mentions or punctuations. The lines of code below provide you with basic cleaning steps for tweets.

Gates_tweets_organic$text <- gsub("https\\S*", "", Gates_tweets_organic$text)Gates_tweets_organic$text <- gsub("@\\S*", "", Gates_tweets_organic$text) Gates_tweets_organic$text <- gsub("amp", "", Gates_tweets_organic$text) Gates_tweets_organic$text <- gsub("[\r\n]", "", Gates_tweets_organic$text)Gates_tweets_organic$text <- gsub("[[:punct:]]", "", Gates_tweets_organic$text)

As a second step, make sure to remove stop words from the text. This is important for your analysis of the most frequent words as you don’t want the most common used words such as “to” or “and” to appear as these don’t carry much meaning for your analysis.

tweets <- Gates_tweets_organic %>%
 select(text) %>%
 unnest_tokens(word, text)tweets <- tweets %>%
 anti_join(stop_words)

You can then plot the most frequent words found in the tweets by following the simple steps below.

tweets %>% # gives you a bar chart of the most frequent words found in the tweets
 count(word, sort = TRUE) %>%
 top_n(15) %>%
 mutate(word = reorder(word, n)) %>%
 ggplot(aes(x = word, y = n)) +
 geom_col() +
 xlab(NULL) +
 coord_flip() +
 labs(y = "Count",
 x = "Unique words",
 title = "Most frequent words found in the tweets of Bill Gates",
 subtitle = "Stop words removed from the list")

A Guide to Mining and Analysing Tweets with R (7)

6. SHOW THE MOST FREQUENTLY USED HASHTAGS

You can do the same analysis with the hashtags. In this case, you’ll want to use the hashtags variable from the rtweet package. A nice way to visualise these is using a word cloud as shown below.

Gates_tweets_organic$hashtags <- as.character(Gates_tweets_organic$hashtags)
Gates_tweets_organic$hashtags <- gsub("c\\(", "", Gates_tweets_organic$hashtags)set.seed(1234)
wordcloud(Gates_tweets_organic$hashtags, min.freq=5, scale=c(3.5, .5), random.order=FALSE, rot.per=0.35, 
 colors=brewer.pal(8, "Dark2"))

A Guide to Mining and Analysing Tweets with R (8)

7. SHOW THE ACCOUNTS FROM WHICH MOST RETWEETS ORIGINATE

Retweeting extensively from one account is usually not what someone looks for in a Twitter account. A helpful insight is therefore to monitor and understand from which accounts most retweets originate. The variable you’ll want to analyse here is retweet_screen_name and the process to visualise it is similar to the one described previously using word clouds.

set.seed(1234)
wordcloud(Gates_retweets$retweet_screen_name, min.freq=3, scale=c(2, .5), random.order=FALSE, rot.per=0.25, 
 colors=brewer.pal(8, "Dark2"))

A Guide to Mining and Analysing Tweets with R (9)

8. PERFORM A SENTIMENT ANALYSIS OF THE TWEETS

Finally, you may want to add a sentiment analysis at the end of your Twitter Analytics Report. This is easy to do with the package “syuzhet” and allows you to further deepen your analysis by grasping the tone of the tweets. No one likes a Twitter account that only spreads angry or sad tweets. Capturing the tone of your tweets and how they balance out is a good indication of your account’s performance.

library(syuzhet)# Converting tweets to ASCII to trackle strange characters
tweets <- iconv(tweets, from="UTF-8", to="ASCII", sub="")# removing retweets, in case needed 
tweets <-gsub("(RT|via)((?:\\b\\w*@\\w+)+)","",tweets)# removing mentions, in case needed
tweets <-gsub("@\\w+","",tweets)ew_sentiment<-get_nrc_sentiment((tweets))
sentimentscores<-data.frame(colSums(ew_sentiment[,]))names(sentimentscores) <- "Score"sentimentscores <- cbind("sentiment"=rownames(sentimentscores),sentimentscores)rownames(sentimentscores) <- NULLggplot(data=sentimentscores,aes(x=sentiment,y=Score))+
 geom_bar(aes(fill=sentiment),stat = "identity")+
 theme(legend.position="none")+
 xlab("Sentiments")+ylab("Scores")+
 ggtitle("Total sentiment based on scores")+
 theme_minimal()

A Guide to Mining and Analysing Tweets with R (10)

In this article, I aimed to show how to extract and analyse tweets using the free-to-use programming software R. I hope you found this guide helpful to build your own Twitter Analytics Report that includes:

Showing which tweets worked best and which didn’t
The ratio of organic tweets/replies/retweets, the time of tweet publication and the platforms from which tweets are published. These are all insights regarding the tweeting behaviour.
The most frequent words used in the tweets, hashtags, from which accounts most retweets originate and a sentiment analysis capturing the tone of the tweets. These are all insights on the content of the tweets.

Mining Tweets of US candidates on mass shootings before and after the 2018 midterms

Article written in collaboration with Carlos Ahumada

medium.com

I regularly write articles about Data Science and Natural Language Processing. Follow me on Twitter or Medium to check out more articles like these or simply to keep updated about the next ones!

A Guide to Mining and Analysing Tweets with R (2024)

FAQs

How to mine tweets using R? ›

To do this, simply use the search_tweets function followed by a few specifications: the number of tweets to extract (n), whether or not to include retweets and the language of the tweets. As an example, see the line of code below.

Discover More Details ›

How to do sentiment analysis on Twitter data using r? ›

Twitter-Sentiment-Analysis-using-R

Extraction of Tweets (i) Create twitter application. ...
Cleaning Tweets. ...
Loading Word Database. ...
Algorithms used. ...
Calculating percentage. ...
Histogram tab : histogram plot.

Learn More Now ›

How do you analyze tweets on Twitter? ›

To analyze tweets effectively, you need to define your objective, collect and clean the data, analyze the data using relevant methods, interpret the results, and draw conclusions. By following these steps, you can gain a deeper understanding of Twitter conversations and make informed decisions based on your findings.

Tell Me More ›

Which package is used in R to perform data analysis of Twitter? ›

You can use the package httr to make HTTP requests to the Twitter API. If you haven't already installed this, please install the package in your console. You will also need to install jsonlite to work with our JSON object and dplyr for data manipulation. You can now begin writing your R script to connect to the API.

Get More Info Here ›

Is it illegal to scrape tweets? ›

Remember, while scraping Twitter data can provide valuable insights, it's essential to respect privacy, minimize impact on Twitter's servers, and adhere to legal and ethical guidelines. Scraping is completely legal if the data gathered is considered openly available.

Discover More Details ›

How do people dig up old tweets? ›

Finding Your Old Tweets with Twitter's Advanced Search Function. You can also use Twitter's own Advanced Search feature to search your old tweets. Go to https://twitter.com/search-advanced or just click on “Advanced search” on the right. and enter your username into the “From these accounts” field under “Accounts.”

Which algorithm is best for Twitter sentiment analysis? ›

Twitter Sentiment Analysis machine learning uses Natural Language Processing (NLP) classification algorithms to cope with the challenge. Wherein Logistic Regression, Bernoulli Naive Bayes, and Support Vector Machine (SVM) are well-known classifiers.

Get More Info ›

How do I extract data from Twitter for sentiment analysis? ›

We follow these 3 major steps in our program:

Authorize twitter API client.
Make a GET request to Twitter API to fetch tweets for a particular query.
Parse the tweets. Classify each tweet as positive, negative or neutral.

Aug 7, 2024

Learn More Now ›

Can you use R for sentiment analysis? ›

The foundational steps involve loading the text file into an R Corpus, then cleaning and stemming the data before performing analysis. I will demonstrate these steps and analysis like Word Frequency, Word Cloud, Word Association, Sentiment Scores and Emotion Classification using various plots and charts.

Learn More ›

How to analyze the sentiment of your own tweets? ›

To put some data behind the question of how you are feeling, you can use Python, Twitter's recent search endpoint to explore your Tweets from the past seven days, and Microsoft Azure's Text Analytics Cognitive Service to detect languages and determine sentiment scores.

Learn More ›

Is the Twitter API still free? ›

Is it official that free access to the Twitter API is no longer available despite it still being an option on the developer site? As announced previously, we have deprecated our legacy Twitter API access tiers. Thus, essential and elevated packages are no longer available.

Get More Info Here ›

How do I pull Twitter Analytics? ›

If you're already signed in to your account, you can access Analytics through the sidebar menu on the desktop version of Twitter. Click the ellipsis icon at the bottom of the menu and select Analytics. On the Analytics dashboard, you'll find a few different tabs where you can access the various features available.

Read On ›

How to extract tweets from Twitter using R? ›

You can replace with any other person twitter profile name by changing in single quotes and “n” is value to extract no of tweets. Now Run the code in lines 31 and 32 to create a CSV file of Modi's tweets and to know the location of CVS file type “getwd()” which will show path of the CSV file as show in the below image.

Discover More Details ›

What is the R tool for data analysis? ›

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. It complements workflows that require the use of other software. You can read more about the language and find documentation on the R Project Website.

Discover More Details ›

What database does Twitter use for tweets? ›

Twitter started with MySQL as the primary data store, from a single instance the persistence layer grew to a large number of clusters. Twitter has had one of the biggest deployments of MySQL right from its inception. It has MySQL clusters with thousands of nodes serving millions of queries per second.

Read The Full Story ›

How do I extract all my tweets? ›

How to download all your Tweets

First, head to the settings menu on your Twitter app or page.
Go to "Your account" and select "Download an archive of your data"
Enter your password and confirm it, then choose whether you'd like to receive a verification code by email or text message.

More items...

Nov 18, 2022

Read The Full Story ›

How do you capture tweets? ›

Right click at a tweet and select the "Capture" option.

Keep Reading ›

Can you Wayback Machine tweets? ›

Use Wayback Machine

The Wayback Machine is a neat website for finding deleted Tweets. It doesn't archive Tweets but has a library of screenshots from popular accounts. Here's how to use it. Visit Wayback Machine's website, type “twitter.com/” and add the Twitter handle you want to search.

Get More Info Here ›

How do I extract all tweets from my account? ›

Export tweets from a user

Use the search bar below or login to your Tweet Binder account, and type into the search bar.
To get all the tweets sent by a particular account use the “from:@handle” operator.
Review the web version of your report. Click “Download” to export tweets to excel or pdf format.