某个人想获取包含关键词“IAmsterdam”(阿姆斯特丹的城市营销活动)的大量推文数据集,他想使用推特的 Streaming API 和 REST API,但无法获得足够大的数据集来进行情感分类分析。源代码如下:
REST API 代码:
from TwitterSearch import *
import time
import sys
import codecs
#change to more convenient output type (utf-8)
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
try:
tso = TwitterSearchOrder()
tso.set_keywords(['IAmsterdam'])
tso.set_language('en')
ts = TwitterSearch(
consumer_key = '6ZnWpt6HZ1kOVSEjfFwUnLia6',
consumer_secret = .... ,
access_token = '2815625730-qSoq6TWyfzqpPJvY71DNAQwGUAfoQu23KgLcPg1',
access_token_secret = ...
)
sleep_for = 60 # sleep for 60 seconds
last_amount_of_queries = 0 # used to detect when new queries are done
for tweet in ts.search_tweets_iterable(tso):
print( '@%s tweeted: %s' % ( tweet['user']['screen_name'], tweet['text'] ) )
current_amount_of_queries = ts.get_statistics()[0]
if not last_amount_of_queries == current_amount_of_queries:
last_amount_of_queries = current_amount_of_queries
time.sleep(sleep_for)
except TwitterSearchException as e:
print(e)
STREAMING API 代码:
import time, sys, codecs
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
#Variables that contains the user credentials to access Twitter API
access_token = "2815625730-qSoq6TWyfzqpPJvY71DNAQwGUAfoQu23KgLcPg1"
access_token_secret = ....
consumer_key = "6ZnWpt6HZ1kOVSEjfFwUnLia6"
consumer_secret = ....
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
try:
#print data
tweet = data.split(',"text":"')[1].split('","source')[0]
print tweet
saveThis = str('')+ tweet #saves time+actual tweet
saveFile = open('amsiams6415.txt','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close()
return True
except BaseException, e:
print 'failed ondata,',str(e)
time.sleep(5)
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'Amsterdam'
stream.filter(track=['IAmsterdam'], languages=['en'])
- 解决方案
- 使用搜索服务,例如 Topsy 或 Gnip,这些服务允许用户访问历史推文数据。这些服务通常是付费的,但它们可以提供大量数据。
- 使用 Twitter 的存档服务。Twitter 存档服务允许用户访问所有公开推文的数据存档。该服务是免费的,但它可能无法提供实时数据。
- 使用 Twitter 的 Firehose API。Twitter Firehose API 允许用户访问所有推文的数据流。该服务是付费的,但它可以提供实时数据。
上述方法都可以用于获取包含稀有主题的大量推特数据集。具体选择哪种方法取决于具体的需求和预算。