Monday, 2 March 2015

Data Analysis using Twitter API and Python

As the title suggests, I'll be working here with the Twitter Search API, to get some tweets based on a search paramenter and try to analyze some information out of the Data received.

For the purpose of working with the Twitter API, I'll be using the "tweepy" library in Python (http://www.tweepy.org/), which makes it a lot easier to access and use the Twitter API.

I have added comments throughout the code (wherever required) so as to make it easier to understand. I'll be using "pandas" library in Python (http://pandas.pydata.org/) to convert the returned JSON format into a DataFrame to work on it and analyze it.

I am using the IPython Notebook Server on AWS, which I created in my last post

So let's begin...


Process

1. First step would be to import the required libraries. We'll be needing three main libraries - "tweepy", "pandas" and "matplotlib.pyplot". 
"tweepy" is used to access the Twitter API; "pandas" is used to analyze the data received; and "pyplot" is used to plot the analyzed data, so as to make it easier to understand.


# Import the required libraries.
import tweepy
import pandas as pd
import matplotlib.pyplot as plt


2. I'll also use a short line of code to make the graphs/plot look a bit prettier than normal


# Make the graphs prettier
pd.set_option('display.mpl_style', 'default')


3. Now its time to go to apps.twitter.com and create a new app and get the consumer_key and consumer_secret key from that app. You can search online if you have any difficulties in doing so, but it is a very straight-forward process.
We will assign these keys to respective variables for use, and also connect to the Twitter API using these along-with "tweepy".


consumerKey = '<Your Consumer Key>'
consumerSecret = '<Your Consumer Secret Key>'

#Use tweepy.OAuthHandler to create an authentication using the given 
    key and secret
auth = tweepy.OAuthHandler(consumer_key=consumerKey, 
    consumer_secret=consumerSecret)

#Connect to the Twitter API using the authentication
api = tweepy.API(auth)


4. Now we will perform a Search operation on the Twitter feed to get some data. For the purpose of this tutorial, I am going to search for "#Oscars2015" as it was a most recent and very popular event which many people tweeted about throughout the World.


#Perform a basic search query where we search for the
    '#Oscars2015' in the tweets
result = api.search(q='%23Oscars2015') #%23 is used to specify '#'

# Print the number of items returned by the search query to 
    verify our query ran. Its 15 by default
len(result)



5. The next step would be to study the data in one tweet so as to see its type and check out what all information can be used to generate some analysis


tweet = result[0] #Get the first tweet in the result

# Analyze the data in one tweet to see what we require
for param in dir(tweet):
#The key names beginning with an '_' are hidden ones and usually 
 not required, so we'll skip them
    if not param.startswith("_"):
        print "%s : %s\n" % (param, eval('tweet.'+param))



6. I have not displayed the output above as its a lot of information which is returned regarding one tweet and I want to save space. Its pretty easy to understand though.
Each tweet is in JSON format, and there is a lot of information about the author of the tweet, the text of the tweet, the location, etc. 
Its always a good idea to go through the data once to understand it and then try and utilize it for analysis. At least that is what I tend to do.
Now we know the data of a tweet, and so we can get a good amount of data to use it for the purpose of analysis. For this I'll download first 5000 tweet based on the same search query (#Oscars2015), as it was one of the trending topics on Twitter. I'll make use of the "tweepy Cursor" for the same.


results = []

#Get the first 5000 items based on the search query
for tweet in tweepy.Cursor(api.search, q='%23Oscars2015').items(5000):
    results.append(tweet)

# Verify the number of items returned
print len(results)


Output of above:

7. Now I have my data, but as it is in JSON format it will be somewhat hard to utilize for data analysis. We have to separate out only the data which is required for the analysis. 
To do so, we'll use the pandas library of Python. This is one of the most handy library's for a Data Scientist/Analyst, as it helps in representing data and also analyzing it in a sophisticated and user-friendly manner.


# Create a function to convert a given list of tweets into 
  a Pandas DataFrame.
# The DataFrame will consist of only the values, which I 
  think might be useful for analysis...


def toDataFrame(tweets):

    DataSet = pd.DataFrame()

    DataSet['tweetID'] = [tweet.id for tweet in tweets]
    DataSet['tweetText'] = [tweet.text for tweet in tweets]
    DataSet['tweetRetweetCt'] = [tweet.retweet_count for tweet 
    in tweets]
    DataSet['tweetFavoriteCt'] = [tweet.favorite_count for tweet 
    in tweets]
    DataSet['tweetSource'] = [tweet.source for tweet in tweets]
    DataSet['tweetCreated'] = [tweet.created_at for tweet in tweets]


    DataSet['userID'] = [tweet.user.id for tweet in tweets]
    DataSet['userScreen'] = [tweet.user.screen_name for tweet 
    in tweets]
    DataSet['userName'] = [tweet.user.name for tweet in tweets]
    DataSet['userCreateDt'] = [tweet.user.created_at for tweet 
    in tweets]
    DataSet['userDesc'] = [tweet.user.description for tweet in tweets]
    DataSet['userFollowerCt'] = [tweet.user.followers_count for tweet 
    in tweets]
    DataSet['userFriendsCt'] = [tweet.user.friends_count for tweet 
    in tweets]
    DataSet['userLocation'] = [tweet.user.location for tweet in tweets]
    DataSet['userTimezone'] = [tweet.user.time_zone for tweet 
    in tweets]

    return DataSet

#Pass the tweets list to the above function to create a DataFrame
DataSet = toDataFrame(results)



8. Let's verify our data-frame by getting the first 5 and last 2 records in it


# Let's check the top 5 records in the Data Set
DataSet.head(5)






# Similarly let's check the last 2 records in the Data Set
DataSet.tail(2)





9. OK, now we have our data in an organized and easy to understand way. Its time to look at it and do some cleaning operations.
This step is required so that we only use valid values/records in our data and neglect the invalid values/records as they don't have any significance and also so that they don't have any negative impact on our results or analysis
While going through this data-frame, I saw that there are some values in the column 'userTimezone' which corresponds to 'None'. I have planned to use this column to see the Time Zone, where people were most active on tweeting about the 2015 Oscars. For this, I have to first remove the records with 'None' values in it, as they are not at all useful for me.



# 'None' is treated as null here, so I'll remove all the records having 
  'None' in their 'userTimezone' column
DataSet = DataSet[DataSet.userTimezone.notnull()]

# Let's also check how many records are we left with now
len(DataSet)


10. We can see that after removing the records with the 'None' values, I am left with 3376 records, out of my initial 5000 records.
These records are valid ones and we can use them for our anaysis.
The next step will be to count the number of tweets in each Time Zone, which can be done using the following one line of code (that is the beauty of "pandas")



# Count the number of tweets in each time zone and get the first 10
tzs = DataSet['userTimezone'].value_counts()[:10]
print tzs



11. Here i am only using the top 10 timezones, so lets slice the data-frame to get only the first 10.
Now I have my top 10 Time Zones from where the most number of tweets were generated for Oscars 2015 (based on the 3376 records which were remaining from the initial 5000 records, after cleaning). These top 10 are shown in the table format above. But a table is not always a good way of representing data, so we'll plot this data on a bar-graph to make it more presentable and understanding to the users



# Create a bar-graph figure of the specified size
plt.rcParams['figure.figsize'] = (15, 5)

# Plot the Time Zone data as a bar-graph
tzs.plot(kind='bar')


# Assign labels and title to the graph to make it more presentable
plt.xlabel('Timezones')
plt.ylabel('Tweet Count')
plt.title('Top 10 Timezones tweeting about #Oscars2015')




Analysis

From the graph above, we can make out that most number of Tweets about the Oscars 2015 were generated from the time zone of Rome(Central European Timezone), which was followed by Pacific Time(US & Canada) and then by Easter Time(US & Canada).

This is good to know as we can see how much active the people are in Europe about the Oscars. Even more than the Time Zone in which the Oscars Ceremony took place, i.e., Pacific Time.

It is also kind of surprising to me, because I thought that the Hollywood Stars would be still posting tweets, even when they were attending the Oscars Ceremony to keep the media and their fans updated about it. But I think the people in Rome/Europe are more excited about Oscars, even more than the Hollywood itself... ;) 


Remarks

This was just an attempt on explaining how to use Twitter Search API with python and 
perform some Data Analysis. I just took "#Oscars2015" as an example, but you can try with something different and more useful.

Feel free to drop any comments or suggestions or any other topic which you might want me to write a post on and I'll try to answer it asap.

Hope you guys enjoyed this short tutorial. Enjoy and Happy Tweeting!!

15 comments:

  1. Survey analysis is often presumed to be difficult - the reality may not be so. Survey analysis in a survey research can be classified into two - quantitative and qualitative data analysis. This article tries to throw light on the basic differences between the two techniques. See more qualitative research data analysis

    ReplyDelete
  2. I really enjoy this article, very nice article and its safe my day
    thank You so much

    ReplyDelete
  3. I'm trying to use Tweepy in iPython Notebook but it won't import https://stackoverflow.com/questions/35006515/installed-tweepy-but-unable-to-import-into-ipython-notebook. Suggestions?

    ReplyDelete
    Replies
    1. Can you try installing tweepy from the ipython notebook itself. You can run shell commands in notebook using '!'. Just prepend any command that you want to run from terminal with '!'. Example:
      !sudo pip install tweepy

      Delete
  4. thanks for good article.
    i cant get 5000 twit and show error 429. is the way i get 5000 twit. thanks.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  5. Boss..I do not have any basic knowledge about API.Can u suggest me the sites which I can refer to get very basic knowledge ( starting from Level 0 ) and if possible videos too....

    ReplyDelete
  6. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  7. Simply awesome to start working with python Twitter API for analytics.

    ReplyDelete
  8. Hi
    excellent work.
    But I have one doubt regarding the query

    Most of cases, people search with single word. But If I search with 2 or word as a single unit how can I perform that.


    api.search, q='%23Oscars2015')

    In my case - as below



    If I use search parameter to a combined word like q="Delta Airlines" , I found tweets but , that done not contain "Delta Airlines" as a single search key.

    My result - @Delta free idea. For valued customers how about a text message for when the flight is about to board.

    *Expected search result if I search in Twitter site * - I need this result

    Apparently someone's phone battery (not sure which kind yet) just exploded on my Delta flight 2557 o.O Seems like everyone's alright, though

    That's a new one: A Delta flight attendant just implored everyone not to use their Samsung Galaxy Note 7s in flight

    How can I achieve this ?



    ReplyDelete
  9. This is a really well written blog. I’ll be sure to bookmark it and return to read more of your useful information. Thanks for the post. Python Training in Chennai | Big Data Analytics Training in Chennai

    ReplyDelete
  10. That is a great point I failed to make, thank you for bringing it up! Getting around to other blogs, comments and making friends is also a great way to grow your readership. twitter followers

    ReplyDelete
  11. Hi, I have read your blog. Really very informative and excellent post I had ever seen about Big Data. Thank you for sharing such a wonderful blog to our vision. Learn Big Data Analytics Training in Chennai to know more details about this technology.
    Python Training in Chennai | Java Training in chennai

    ReplyDelete
  12. I really enjoyed reading your blog. I really appreciate your information which you shared with us. Thanks for sharing..,


    Python Online Training


    ReplyDelete
  13. Thanks a lot! You made a new blog entry to answer my question; I really appreciate your time and effort.
    java training institute in chennai |
    java training center in chennai

    ReplyDelete