最近在编写一个 Python 脚本,连接到 Twitter Firehose 并将数据发送下游进行处理。之前一直运行良好,但现在我只想获取文本内容。(这不是有关如何从 Twitter 提取数据或如何对 ASCII 字符进行编码/解码的问题)。
当我像这样直接启动脚本时:
python -u fetch_script.py
它运行良好,可以看到消息出现在屏幕上。例如:
Cuz I'm checking you out >on Facebook<
RT @SearchlightNV: #BarryLies has crapped on all honest patriotic hard-working citizens in the USA but his abuse of WWII Vets is sick #2A…
"Why do men chase after women? Because they fear death."~Moonstruck
RT @SearchlightNV: #BarryLies has crapped on all honest patriotic hard-working citizens in the USA but his abuse of WWII Vets is sick #2A…
Never let anyone tell you not to chase your dreams. My sister came home crying today, because someone told her she's not good enough.
"I can't even ask anyone out on a date because if it doesn't end up in a high speed chase, I get bored."
RT @ColIegeStudent: Double-checking the attendance policy while still in bed
Well I just handed my life savings to ya.. #trustingyou #abouttomakebankkkkk
Zillow $Z and Redfin useless to Wells Fargo Home Mortgage, $WFC, and FannieMae $FNM. Sale history LTV now 48%, $360 appraisal fee 4 no PMI.
The latest Dump and Chase Podcast http://somedomain.com/viaRSA9W3i check it out and subscribe on iTunes, or your favorite android app #Isles
但是,如果我尝试将它们输出到文件,就像这样:
python -u fetch_script.py >fetch_output.txt
它会立即抛出错误:
ERROR:tornado.application:Uncaught exception, closing connection.
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tornado/iostream.py", line 341, in wrapper
callback(*args)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 331, in wrapped
raise_exc_info(exc)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 302, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/streaming/twitter-stream.py", line 203, in parse_json
self.parse_response(response)
File "/usr/local/streaming/twitter-stream.py", line 226, in parse_response
self._callback(response)
File "fetch_script.py", line 57, in callback
print msg['text']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 139: ordinal not in range(128)
ERROR:tornado.application:Exception in callback <functools.partial object at 0x187c2b8>
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/tornado/ioloop.py", line 458, in _run_callback
callback()
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 331, in wrapped
raise_exc_info(exc)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 302, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tornado/iostream.py", line 341, in wrapper
callback(*args)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 331, in wrapped
raise_exc_info(exc)
File "/usr/local/lib/python2.7/dist-packages/tornado/stack_context.py", line 302, in wrapped
ret = fn(*args, **kwargs)
File "/usr/local/streaming/twitter-stream.py", line 203, in parse_json
self.parse_response(response)
File "/usr/local/streaming/twitter-stream.py", line 226, in parse_response
self._callback(response)
File "fetch_script.py", line 57, in callback
print msg['text']
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 139: ordinal not in range(128)
P.S
Little more context:
An error is happening in callback function:
def callback(self, message):
if message:
msg = message
msg_props = pika.BasicProperties()
msg_props.content_type = 'application/text'
msg_props.delivery_mode = 2
#print self.count
print msg['text']
#self.count += 1
...
However If I remove ['text'] and would live only print msg both cases are working like a charm.
解决方案
Python 在将输出写入控制台时会设置 stdout 的编码,但在写入文件时不会。以下脚本可以重现该问题:
import sys
msg = {'text':u'\2026'}
sys.stderr.write('default encoding: %s\n' % sys.stdout.encoding)
print msg['text']
运行以上脚本会显示错误:
$ python bad.py>/tmp/xxx
default encoding: None
Traceback (most recent call last):
File "fix.py", line 5, in <module>
print msg['text']
UnicodeEncodeError: 'ascii' codec can't encode character u'\x82' in position 0: ordinal not in range(128)
在以上脚本中添加编码:
import sys
msg = {'text':u'\2026'}
sys.stderr.write('default encoding: %s\n' % sys.stdout.encoding)
encoding = sys.stdout.encoding or 'utf-8'
print msg['text'].encode(encoding)
问题就解决了:
$ python good.py >/tmp/xxx
default encoding: None
$ cat /tmp/xxx
6
因此,我们可以通过在脚本中显式设置 stdout 的编码来解决此问题。
以下是一个修改后的脚本示例:
import sys
msg = {'text':u'\2026'}
sys.stdout.encoding = 'utf-8'
print msg['text']
运行以上脚本不会再抛出编码错误,并且输出将被正确地重定向到文件中。