I have had a project in mind where I would download all the tweets sent to celebrities for the last one year and do a sentiment analysis on them and evaluate who had the most positive fans.
Then I discovered that you can at max retrieve twitter mentions for the last 7 days using tweepy/twitter API. I scavenged the net but couldn't find any ways to download tweets for the last one year.
Anyways, I decided to do the project on last 7 days data only and wrote the following code:
try:
while 1:
for results in tweepy.Cursor(twitter_api.search, q="@celebrity_handle").items(9999999):
item = (results.text).encode('utf-8').strip()
wr.writerow([item, results.created_at]) # write to a csv (tweet, date)
I am using the Cursor
search api because the other way to get mentions (the more accurate one) has a limitation of retrieving the last 800 tweets only.
Anyways, after running the code overnight, I was able to download only 32K tweets. Around 90% of them were Retweets.
Is there a better more efficient way to get mentions data?
Do keep in mind, that:
Any suggestions would be welcome but at the current moment, I am out of ideas.
I would use the search api. I did something similar with the following code. It appears to have worked exactly as expected. I used it on a specific movie star, and pulled 15568 tweets, upon a quick scan all of which appear to be @mentions of them. (I pulled from their entire timeline.)
In your case, on a search you'd want to run, say, daily, I'd store the id of the last mention you pulled for each user, and set that value as "sinceId" each time you rerun the search.
As an aside, AppAuthHandler is much faster than OAuthHandler and you won't need user authentication for these kinds of data pulls.
auth = tweepy.AppAuthHandler(consumer_token, consumer_secret)
auth.secure = True
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
searchQuery = '@username'
this is what we're searching for. in your case i would make a list and iterate through all of the usernames in each pass of the search query run.
retweet_filter='-filter:retweets'
this filters out retweets
inside each api.search call below i would put the following in as the query parameter:
q=searchQuery+retweet_filter
the following code (and the api setup above) is from this link:
tweetsPerQry = 100
# this is the max the API permits
fName = 'tweets.txt'
# We'll store the tweets in a text file.
If results from a specific ID onwards are reqd, set sinceId to that ID. else default to no lower limit, go as far back as API allows
sinceId = None
If results only below a specific ID are, set max_id to that ID. else default to no upper limit, start from the most recent tweet matching the search query.
max_id = -1L
//however many you want to limit your collection to. how much storage space do you have?
maxTweets = 10000000
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
since_id=sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))