While we might not be Twitter fans, we have to admit that it has a huge influence on the world (who doesn't know about Trump's tweets). Twitter data is not only gold in terms of insights, but Twitter-storms are available for analysis in near real-time. This means we can learn about the big waves of thoughts and moods around the world as they arise.
As any place filled with riches, Twitter has security guards blocking us from laying our hands on the data right away ⛔️ Some authentication steps (really straightforward) are needed to call their APIs for data collection. Since our goal today is learning to extract insights from data, we have already gotten a green-pass from security ✅ Our data is ready for usage in the datasets folder — we can concentrate on the fun part! 🕵️♀️🌎
Note: Here is the documentation for this call, and here a full overview on Twitter's APIs.
# Loading json module
import json
# Loading WW_trends and US_trends data
WW_trends = json.loads(open('datasets/WWTrends.json').read())
US_trends = json.loads(open('datasets/USTrends.json').read())
# Inspecting data by printing out WW_trends and US_trends variables
print(WW_trends)
print(US_trends)
%%nose
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_pandas_loaded():
assert 'json' in globals(), \
'Did you import the json module?'
def test_WW_trends_correctly_loaded():
correct_ww_trends = json.loads(open('datasets/WWTrends.json').read())
assert correct_ww_trends == WW_trends, "The variable WW_trends should contain the data in WWTrends.json."
def test_US_trends_correctly_loaded():
correct_us_trends = json.loads(open('datasets/USTrends.json').read())
assert correct_us_trends == US_trends, "The variable WW_trends should contain the data in USTrends.json."
Our data was hard to read! Luckily, we can resort to the json.dumps() method to have it formatted as a pretty JSON string.
# Pretty-printing the results. First WW and then US trends.
print("WW trends:")
print (json.dumps(WW_trends, indent=1))
print("\n", "US trends:")
print (json.dumps(US_trends, indent=1))
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# Not sure what to check here
# One or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
def strip_comment_lines(cell_input):
"""Returns cell input string with comment lines removed."""
return '\n'.join(line for line in cell_input.splitlines() if not line.startswith('#'))
last_input = strip_comment_lines(In[-2])
def test_info_command():
assert 'json.dumps(WW_trends' in last_input, \
"We expected the json.dumps method with the correct input object."
assert 'json.dumps(US_trends' in last_input, \
"We expected the json.dumps method with the correct input object."
🕵️♀️ From the pretty-printed results (output of the previous task), we can observe that:
We have an array of trend objects having: the name of the trending topic, the query parameter that can be used to search for the topic on Twitter-Search, the search URL and the volume of tweets for the last 24 hours, if available. (The trends get updated every 5 mins.)
At query time #BeratKandili, #GoodFriday and #WeLoveTheEarth were trending WW.
"tweet_volume" tell us that #WeLoveTheEarth was the most popular among the three.
Results are not sorted by "tweet_volume".
There are some trends which are unique to the US.
It’s easy to skim through the two sets of trends and spot common trends, but let's not do "manual" work. We can use Python’s set data structure to find common trends — we can iterate through the two trends objects, cast the lists of names to sets, and call the intersection method to get the common names between the two sets.
# Extracting all the WW trend names from WW_trends
world_trends = set([trend['name']
for trend in WW_trends[0]['trends']])
# Extracting all the US trend names from US_trends
us_trends = set([trend['name']
for trend in US_trends[0]['trends']])
# Let's get the intersection of the two sets of trends
common_trends = world_trends.intersection(us_trends)
# Inspecting the data
print(world_trends, "\n")
print(us_trends, "\n")
print (len(common_trends), "common trends:", common_trends)
%%nose
# One or more tests of the students code.
# The @solution should pass the tests.
# The purpose of the tests is to try to catch common errors and to
# give the student a hint on how to resolve these errors.
def test_ww_trends():
correct_world_trends = set([trend['name'] for trend in WW_trends[0]['trends']])
assert world_trends == correct_world_trends, \
'The variable world_trends does not have the expected trend names.'
def test_us_trends():
correct_us_trends = set([trend['name'] for trend in US_trends[0]['trends']])
assert us_trends == correct_us_trends, \
'The variable us_trends does not have the expected trend names.'
def test_common_trends():
correct_common_trends = world_trends.intersection(us_trends)
assert common_trends == correct_common_trends, \
'The variable common_trends does not have the expected common trend names.'
🕵️♀️ From the intersection (last output) we can see that, out of the two sets of trends (each of size 50), we have 11 overlapping topics. In particular, there is one common trend that sounds very interesting: #WeLoveTheEarth — so good to see that Twitteratis are unanimously talking about loving Mother Earth! 💚
Note: We could have had no overlap or a much higher overlap; when we did the query for getting the trends, people in the US could have been on fire obout topics only relevant to them.
We have found a hot-trend, #WeLoveTheEarth. Now let's see what story it is screaming to tell us!
If we query Twitter's search API with this hashtag as query parameter, we get back actual tweets related to it. We have the response from the search API stored in the datasets folder as 'WeLoveTheEarth.json'. So let's load this dataset and do a deep dive in this trend.
# Loading the data
tweets = json.loads(open('datasets/WeLoveTheEarth.json').read())
# Inspecting some tweets
tweets[0:2]
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_tweets_loaded_correctly():
correct_tweets_data = json.loads(open('datasets/WeLoveTheEarth.json').read())
assert correct_tweets_data == tweets, "The variable tweets should contain the data in WeLoveTheEarth.json."
🕵️♀️ Printing the first two tweet items makes us realize that there’s a lot more to a tweet than what we normally think of as a tweet — there is a lot more than just a short text!
But hey, let's not get overwhemled by all the information in a tweet object! Let's focus on a few interesting fields and see if we can find any hidden insights there.
# Extracting the text of all the tweets from the tweet object
texts = [tweet['text']
for tweet in tweets ]
# Extracting screen names of users tweeting about #WeLoveTheEarth
names = [user_mention['screen_name']
for tweet in tweets
for user_mention in tweet['entities']['user_mentions']]
# Extracting all the hashtags being used when talking about this topic
hashtags = [hashtag['text']
for tweet in tweets
for hashtag in tweet['entities']['hashtags']]
# Inspecting the first 10 results
print (json.dumps(texts[0:10], indent=1),"\n")
print (json.dumps(names[0:10], indent=1),"\n")
print (json.dumps(hashtags[0:10], indent=1),"\n")
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_extracted_texts():
correct_text = [tweet['text'] for tweet in tweets ]
assert texts == correct_text, \
'The variable texts does not have the expected text data.'
def test_extracted_names():
correct_names = [user_mention['screen_name']
for tweet in tweets
for user_mention in tweet['entities']['user_mentions']]
assert correct_names == names, \
'The variable names does not have the expected user names.'
def test_extracted_hashtags():
correct_hashtags = [hashtag['text']
for tweet in tweets
for hashtag in tweet['entities']['hashtags']]
assert correct_hashtags == hashtags, \
'The variable hashtags does not have the expected hashtag data.'
🕵️♀️ Just from the first few results of the last extraction, we can deduce that:
Observing the first 10 items of the interesting fields gave us a sense of the data. We can now take a closer look by doing a simple, but very useful, exercise — computing frequency distributions. Starting simple with frequencies is generally a good approach; it helps in getting ideas about how to proceed further.
# Importing modules
from collections import Counter
# Counting occcurrences/ getting frequency dist of all names and hashtags
for item in [names, hashtags]:
c = Counter(item)
# Inspecting the 10 most common items in c
print (c.most_common(10), "\n")
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_counter():
for item in [names, hashtags]: correct_counter = Counter(item)
assert c == correct_counter, \
"The variable c does not have the expected values."
🕵️♀️ Based on the last frequency distributions we can further build-up on our deductions:
We have been able to extract so many insights. Quite powerful, isn't it?!
Let's further analyze the data to find patterns in the activity around the tweets — did all retweets occur around a particular tweet?
If a tweet has been retweeted, the 'retweeted_status' field gives many interesting details about the original tweet itself and its author.
We can measure a tweet's popularity by analyzing the retweetcount and favoritecount fields. But let's also extract the number of followers of the tweeter — we have a lot of celebs in the picture, so can we tell if their advocating for #WeLoveTheEarth influenced a significant proportion of their followers?
Note: The retweet_count gives us the total number of times the original tweet was retweeted. It should be the same in both the original tweet and all the next retweets. Tinkering around with some sample tweets and the official documentaiton are the way to get your head around the mnay fields.
# Extracting useful information from retweets
retweets = [
(tweet['retweet_count'],
tweet['retweeted_status']['favorite_count'],
tweet['retweeted_status']['user']['followers_count'],
tweet['retweeted_status']['user']['screen_name'],
tweet['text'])
for tweet in tweets
if 'retweeted_status' in tweet
]
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_retweets():
correct_retweets = [
(tweet['retweet_count'],
tweet['retweeted_status']['favorite_count'],
tweet['retweeted_status']['user']['followers_count'],
tweet['retweeted_status']['user']['screen_name'],
tweet['text'])
for tweet in tweets
if 'retweeted_status' in tweet
]
assert correct_retweets == retweets, \
"The retweets variable does not have the expected values. Check the names of the extracted field and their order."
Let's manipulate the data further and visualize it in a better and richer way — "looks matter!"
# Importing modules
import matplotlib.pyplot as plt
import pandas as pd
# Visualizing the data in a pretty and insightful format
df = pd.DataFrame(
retweets,
columns=['Retweets','Favorites','Followers','ScreenName','Text']).groupby(
['ScreenName','Text','Followers']).sum().sort_values(by=['Followers'], ascending=False)
df.style.background_gradient()
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def strip_comment_lines(cell_input):
"""Returns cell input string with comment lines removed."""
return '\n'.join(line for line in cell_input.splitlines() if not line.startswith('#'))
last_input = strip_comment_lines(In[-2])
def test_df_creation_command():
assert 'retweets' in last_input, \
"The input data for DataFrame creation is not as expected."
assert 'columns=' in last_input, \
"The columns parameter is missing."
assert 'groupby' in last_input, \
"The groupby method is missing."
assert 'sum()' in last_input, \
"The sum method is missing."
assert 'sort_values' in last_input, \
"The sort_values method is missing."
assert 'ascending' in last_input, \
"The ascending parameter is missing."
def test_dataframe():
correct_dataframe = pd.DataFrame(retweets, columns=['Retweets','Favorites', 'Followers', 'ScreenName', 'Text']).groupby(
['ScreenName','Text','Followers']).sum().sort_values(by=['Followers'], ascending=False)
assert correct_dataframe.equals(df), \
"The created dataframe does not match the expected dataframe."
🕵️♀️ Our table tells us that:
The large differences in reactions could be explained by the fact that this was Lil Dicky's music video. Leo still got more traction than Katy or Ellen because he played some major role in this initiative.
Can we find some more interesting patterns in the data? From the text of the tweets, we could spot different languages, so let's create a frequency distribution for the languages.
# Extracting language for each tweet and appending it to the list of languages
tweets_languages = []
for tweet in tweets:
tweets_languages.append(tweet['lang'])
# Plotting the distribution of languages
%matplotlib inline
plt.hist(tweets_languages)
%%nose
# %%nose needs to be included at the beginning of every @tests cell
# One or more tests of the student's code
# The @solution should pass the tests
# The purpose of the tests is to try to catch common errors and
# to give the student a hint on how to resolve these errors
def test_tweet_languages():
correct_tweet_languages = []
for tweet in tweets: correct_tweet_languages.append(tweet['lang'])
assert correct_tweet_languages == tweets_languages, \
"The tweets_languages variable does not have the expected values."
last_value = _
def test_plot_exists():
assert type(last_value) == type(plt.hist(tweets_languages)), \
'A plot was not the last output of the code cell.'
def strip_comment_lines(cell_input):
"""Returns cell input string with comment lines removed."""
return '\n'.join(line for line in cell_input.splitlines() if not line.startswith('#'))
last_input = strip_comment_lines(In[-2])
def test_plot_command():
assert 'plt.hist(tweets_languages)' in last_input, \
"We expected the plt.hist() method in your input."
🕵️♀️ The last histogram tells us that:
Why is this sort of information useful? Because it can allow us to get an understanding of the "category" of people interested in this topic (clustering). We could also analyze the device type used by the Twitteratis, tweet['source']
, to answer questions like, "Does owning an Apple compared to Andorid influences people's propensity towards this trend?". I will leave that as a further exercise for you!
What an exciting journey it has been! We started almost clueless, and here we are.. rich in insights.
From location based comparisons to analyzing the activity around a tweet to finding patterns from languages and devices, we have covered a lot today — let's give ourselves a well-deserved pat on the back! ✋
# Congratulations!
print("High Five!!!")
%%nose
# %%nose needs to be included at the beginning of every @tests cell
def test_nothing_task_10():
assert True, "Nothing to test."