NLTK基础教程学习笔记(二)
night李 2018-01-31 23:51:04 浏览388 评论0摘要: Python基础:字典(dictionary)也是最常用到的一种数据结构。在其他语言中被称为关联数组/存储。字典是一种键值索引型的数据结构,其索引键可以是一种不可变的类型,例如字符串和数字常被用来充当索引键。
Python基础:
字典(dictionary)也是最常用到的一种数据结构。在其他语言中被称为关联数组/存储。字典是一种键值索引型的数据结构,其索引键可以是一种不可变的类型,例如字符串和数字常被用来充当索引键。
Python的字典结构是哈希表实现之一。哈希表是一种操作起来非常容易的字典结构,其优势在于通过简短的代码就能建立起非常复杂的数据结构。
例子用字典来获取文本中各单词出现的频率:
mystring="Monty Python! And the holy Grail !\n "
word_frep={}
for tok in mystring.split():
if tok in word_frep:
word_frep[tok]+=1
else:
word_frep[tok]=1
print(word_frep)
结果:
{'holy': 1, 'the': 1, 'Python!': 1, '!': 1, 'Grail': 1, 'And': 1, 'Monty': 1}
NLTK入门:
先介绍了一个简单的爬虫例子,爬取了Python官网主页上的文本信息:
import urllib.request
response=urllib.request.urlopen('http://python.org/')
html=response.read()
print(len(html))
这里和书上的不同对于我用的python3.5,urllib2包已经不能用了,用urllib.request代替。
结果;
48907
接下来做一次探索性数据分析(EDA),对于一段文本域而言,EDA可能包含多重含义,这里只会涉及一个简单的例子,即该文档的主体术语类型。文字的主体和出现的频率等。
对于之前从Python主页爬的文字域,我们先清除其中的html标签,做法是先用正则表达式选取其中的标记,包括数字和字符,转换为一个列表;
版本1:
import urllib.request
response=urllib.request.urlopen('http://python.org/')
html=response.read()
#print(len(html))
tokens=[tok for tok in html.split()]
print("Total no of tokens: " +str (len(tokens)))
print(tokens[0:100])
结果;
Total no of tokens:2932
[b'<!doctype', b'html>', b'<!--[if', b'lt', b'IE', b'7]>', b'<html', b'class="no-js ', b'ie6 ', b'lt-ie7 ', b'lt-ie8 ', b'lt-ie9
">', b'<![endif]-->', b'<!--[if', b'IE', b'7]>', b'<html', b'class="no-js ', b'ie7 ', b'lt-ie8 ', b'lt-ie9 ">', b'<![endif]-->', b'<!--[if', b'IE', b'8]>', b'<html', b'class="no-js ', b'ie8 ', b'lt-ie9 ">', b'<![endif]-->', b'<!--[if', b'gt', b'IE', b'8]><!--><html', b'class="no-js "', b'lang="en "', b'dir="ltr ">', b'<!--<![endif]-->', b'<head>', b'<meta', b'charset="utf-8 ">', b'<meta', b'http-equiv="X-UA-Compatible "', b'content="IE=edge ">', b'<link', b'rel="prefetch "', b'href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js ">', b'<meta', b'name="application-name "', b'content="Python.org
">', b'<meta', b'name="msapplication-tooltip
"', b'content="The ', b'official ', b'home ', b'of ', b'the ', b'Python ', b'Programming ', b'Language
">', b'<meta', b'name="apple-mobile-web-app-title
"', b'content="Python.org ">', b'<meta', b'name="apple-mobile-web-app-capable "', b'content="yes ">', b'<meta', b'name="apple-mobile-web-app-status-bar-style "', b'content="black ">', b'<meta', b'name="viewport "', b'content="width=device-width, ', b'initial-scale=1.0 ">', b'<meta', b'name="HandheldFriendly "', b'content="True
">', b'<meta', b'name="format-detection "', b'content="telephone=no ">', b'<meta', b'http-equiv="cleartype "', b'content="on
">', b'<meta', b'http-equiv="imagetoolbar
"', b'content="false ">', b'<script', b'src="/static/js/libs/modernizr.js "></script>', b'<link', b'href="/static/stylesheets/style.css
"', b'rel="stylesheet "', b'type="text/css "', b'title="default "', b'/>', b'<link', b'href="/static/stylesheets/mq.css
"', b'rel="stylesheet "', b'type="text/css "', b'media="not ', b'print, ', b'braille, ']
版本2:
import urllib.request
import re
response=urllib.request.urlopen('http://python.org/ ')
html=response.read()
html=html.decode('utf-8 ')
tokens=re.split('\W+ ',html)
print(len(tokens))
print(tokens[0:100])
结果:
6221
[' ', 'doctype ', 'html ', 'if ', 'lt ', 'IE ', '7', 'html', 'class', 'no', 'js', 'ie6', 'lt', 'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '7', 'html', 'class', 'no', 'js',
'ie7', 'lt', 'ie8', 'lt', 'ie9', 'endif', 'if', 'IE', '8', 'html', 'class', 'no', 'js', 'ie8', 'lt', 'ie9', 'endif', 'if', 'gt', 'IE', '8',
'html', 'class', 'no', 'js', 'lang', 'en', 'dir', 'ltr', 'endif', 'head', 'meta', 'charset', 'utf', '8', 'meta', 'http', 'equiv', 'X', 'UA', 'Compatible', 'content',
'IE', 'edge', 'link', 'rel', 'prefetch', 'href', 'ajax', 'googleapis', 'com', 'ajax', 'libs', 'jquery', '1', '8', '2',
'jquery', 'min', 'js', 'meta', 'name', 'application', 'name', 'content', 'Python', 'org', 'meta', 'name', 'msapplication', 'tooltip', 'content', 'The', 'official']
注python3要用上
html=html.decode('utf-8')
否则会报错:
cannot use a string pattern on a bytes-like object
接下来用NLTK的方式清理这些标签:
import nltk
import urllib
from bs4 import BeautifulSoup
response=urllib.request.urlopen('http://python.org/')
html=response.read()
html=html.decode('utf-8')
soup=BeautifulSoup(html,'lxml')
clean=soup.get_text()
tokens=[tok for tok in clean.split()]
print(tokens[:100])
结果:
['Welcome', 'to', 'Python.org', '{', '"@context":', '"http://schema.org",', '"@type":', '"WebSite",', '"url":', '"https://www.python.org/",', '"potentialAction":', '{', '"@type":', '"SearchAction",', '"target":', '"https://www.python.org/search/?q={search_term_string}",', '"query-input":', '"required', 'name=search_term_string"', '}', '}', 'var', '_gaq', '=', '_gaq', '||', '[];', "_gaq.push(['_setAccount',", "'UA-39055973-1']);", "_gaq.push(['_trackPageview']);", '(function()', '{', 'var', 'ga', '=', "document.createElement('script');", 'ga.type', '=', "'text/javascript';", 'ga.async', '=', 'true;', 'ga.src', '=', "('https:'", '==', 'document.location.protocol', '?', "'https://ssl'", ':', "'http://www')", '+', "'.google-analytics.com/ga.js';", 'var', 's', '=', "document.getElementsByTagName('script')[0];", 's.parentNode.insertBefore(ga,', 's);', '})();', 'Notice:', 'While', 'Javascript', 'is', 'not', 'essential', 'for', 'this', 'website,', 'your', 'interaction', 'with', 'the', 'content', 'will', 'be', 'limited.', 'Please', 'turn', 'Javascript', 'on', 'for', 'the', 'full', 'experience.', 'Skip', 'to', 'content', '▼', 'Close', 'Python', 'PSF', 'Docs', 'PyPI', 'Jobs', 'Community', '▲', 'The', 'Python', 'Network']
下面是用nltk进行词频的统计:
import nltk
import urllib
from bs4 import BeautifulSoup
response=urllib.request.urlopen('http://python.org/')
html=response.read()
html=html.decode('utf-8')
soup=BeautifulSoup(html,'lxml')
clean=soup.get_text()
tokens=[tok for tok in clean.split()]
#print(tokens[:100])
Freq_dist_nltk=nltk.FreqDist(tokens)
print(Freq_dist_nltk)
for k,v in Freq_dist_nltk.items():
print(str(k)+':'+str(v))
结果:
<FreqDist with 614 samples and 1117 outcomes>
up:2
==:1
now:3
document.getElementsByTagName('script')[0];:1
Best:1
"url"::1
-:1
core:1
[];:1
Statements:1
ga.src:1
s.parentNode.insertBefore(ga,:1
tkInter,:1
international:1
Trac,:1
Legal:1
Beginner’s:1
While:1
'Apple'),:1
here.:2
FOSDEM:2
Brochure:2
2018-01-09:1
Windows:3
programmers:1
Stories:3
Essays:2
_gaq.push(['_setAccount',:1
Interpretation:1
Up:1
Chat:1
discussed:1
Logo:2
window.jQuery:1
Search:1
List:2
comprehensions:1
processing:1
?:1
b,:1
PyCon:2
'Banana'),:1
<:1
Easy:1
Diversity:3
1.:1
Contributing:1
Top:2
Learn:3
go.:1
User:4
even:1
Notice::1
for?:1
Arts:2
sure:1
programs:1
functions:2
it's:1
control:3
document.location.protocol:1
web2py:1
Government:2
Event:2
-,:1
tools:1
/:4
our:2
Industrial:1
A:2
Smaller:1
Scientific:3
compound:1
knows:1
About:2
Tru64,:1
rendering:1
Security:1
classic:1
Sign:3
have:1
Check:1
Solaris,:1
Copyright:1
quickly,:1
Javascript:2
=:14
Launch:1
Runs:1
987:1
Events:11
an:3
Whet:1
Getting:2
hire:1
Unicode):1
join:1
Mailing:2
Foundation:3
will:1
Roundup:1
Web:1
Hi,:1
learn.:1
Developer's:3
Submit:3
Django,:1
built-in:1
(PEPs)::1
GO:1
Types:1
I'm:2
lists.:1
Girls:1
alpha:1
used:2
Not:1
name):1
math:1
Started:3
Latest:1
Proposals:1
Python!"):1
21:1
to:17
55:1
expected;:1
syntax:2
Documentation:3
Latest::1
other:4
6,:1
}:2
programmers.:1
Engineering:2
limited.:1
Website:1
Talks:2
PSF:4
faster:1
can:3
lists:1
Numeric::1
3.6.4,:1
Implementations:2
377:1
Django:1
n::1
list:2
Data:1
pipeline.:1
—:2
Larger:1
relaunched:1
programming:4
PyGObject,:1
appetite:1
(1,:1
website,:1
Intuitive:1
structure:1
Python.:2
Linux,:1
©2001-2018.:1
argument:1
IPython:1
Forums:2
What:1
a,:2
Kivy,:1
SciPy,:1
Pyramid,:1
for,:1
output:1
For:1
available.:2
own:1
one:1
**:1
"WebSite",:1
//:1
Meetup:1
You’d:1
Skip:1
(with:1
144:1
arithmetic:1
running:1
Conduct:2
turn:1
motion:1
Archive:4
allows:1
['Banana',:1
is:',:1
Light:1
users:1
community-run:1
Contact:1
Compaq:1
speak:1
indentation:1
Issue:1
fruits:1
0:1
course.:1
operators:1
Upcoming:1
straightforward::1
re-code):1
Tracker:1
3.6.4:5
The:5
General:1
fruit:1
systems:1
Welcome:1
fib(n)::1
8:2
all:1
Become:1
versions!:1
release:1
job:1
you:1
'.google-analytics.com/ga.js';:1
daily.:1
languages:1
610:1
document.write('<script:1
manipulated:1
together:1
Quick:1
new:1
Powered:1
product):1
≡:1
print(a,:1
of:17
Calculations:1
Tim:1
language,:1
...:7
2018:2
Defined:1
input('What:1
essential:1
frames:1
2018-02-02:2
end=':1
Fibonacci:1
Flask,:1
use?:1
Lists:4
%s.':1
Please:1
functions.:2
'Lime']:1
development:1
Initiatives:1
standard:1
"potentialAction"::1
types:1
s:1
languages):1
[2,:1
enumerate:1
Donate:1
picture:1
Looking:1
"@type"::2
():1
fib(1000):1
protect,:1
Enhancement:1
in:8
more:2
Jobs:2
+:1
Register:1
Menu:1
Code:2
Hello,:1
thousands:1
experienced:1
document.createElement('script');:1
2018-01-23:1
Practices:1
(and:1
future:1
grouping.:1
by:3
[(0,:1
News:11
0,:1
['BANANA',:1
for…:1
true;:1
day.:1
library,:1
Flow:1
with:7
pick:1
number:2
Ansible,:1
4,:1
Software:6
(function():1
as:2
ga:1
testing.:1
OpenStack:1
easy:2
5.666666666666667:1
machines:1
last:1
find:1
Mac:2
environment:1
production:2
Source:2
print("Hello,:1
Special:2
3:8
'http://www'):1
'):1
fourth:2
keyword:1
docs.python.org:1
▲:3
'LIME']:1
ga.type:1
per:1
System:1
3.:1
Fortenberry:1
Bug:1
Success:3
Awards:2
Input,:1
Development::3
Pandas,:1
3.7.0a4:1
print('Hi,:1
statements:1
arrays:1
name?:1
the:19
"http://schema.org",:1
twists,:1
parentheses:1
Platforms:2
"@context"::1
community:1
This:1
usual:1
growth:1
lets:1
are:5
Books:2
Alternative:2
~800:1
Audio/Visual:2
effectively.:1
way:1
Back:2
89:1
Development:2
'Apple',:1
ILM:1
})();:1
arguments,:2
is::1
Our:1
some:1
Reset:1
"query-input"::1
*:2
Functions:1
four:1
"target"::1
Chelyabinsk:1
Rackspace:1
sliced:1
2:3
place:1
source:1
list(enumerate(fruits)):1
content:2
Privacy:1
def:1
License:2
key,:1
not:1
Simple:2
FAQ:2
very:1
PyPI:1
34:1
interaction:1
n:1
installers:1
Python:60
Conferences:2
float:1
numbers:1
Wiki:2
Guide:6
'Lime')]:1
Facebook:1
Buildbot,:1
||:2
extensible:1
PyQt,:1
::1
>>>:24
experience.:1
assignment:1
Community:7
Pythology:1
numbers::1
#:9
compositing:1
facilitate:1
quickly:1
series:1
IRC:3
advance:1
a+b:1
ga.async:1
arbitrary:1
Administration::1
var:3
2017-12-06:1
_gaq:2
for:11
Salt,:1
or:2
Core:1
PEP:2
Magic:1
Education:2
board:1
and:22
Legon:1
your:4
tens:1
language:2
mission:1
support:1
was:1
>_:1
Python,:1
while:2
Site:1
Applications:2
name?\n'):1
17:2
about:3
2017-12-19:2
that:5
version:1
Thousands:1
available:3
returns:1
jobs.python.org:1
position:1
'APPLE',:1
Interactive:1
_gaq.push(['_trackPageview']);:1
Groups:2
modeling,:1
promote,:1
X,:1
233:1
Merchandise:2
understands.:1
Non-English:2
Network:1
batch:1
3.5.5rc1:1
optional:1
Mentorship:1
product:5
floor:1
Expect:1
on:4
8]:1
Interest:2
%:1
which:1
'text/javascript';:1
print(loud_fruits):1
Socialize:1
0.5:1
Member:1
{:3
capable:1
One-Day:1
candidate:1
Whether:1
full:1
('https:':1
Group:4
Close:1
"https://www.python.org/search/?q={search_term_string}",:1
In:2
developer,:1
fruits]:1
function:1
data:1
loud_fruits:1
Experienced:1
Downloads:2
Status:1
you're:2
expression:1
flow:2
python-dev:1
online.:1
indexed,:1
3::4
Google+:1
(known:1
is:16
src="/static/js/libs/jquery-1.8.2.min.js"><\/script>'):1
Python's:1
Use:1
its:1
3.4.8rc1:1
trying:1
clean:1
use:1
13:1
2018-02-03:3
Policy:1
3.6.4rc1:1
download:1
"SearchAction",:1
Beginner's:2
(2,:1
Conference::1
work:3
+,:1
planned:1
code:4
s);:1
range:1
this:2
overview.:1
loop:1
RSS:1
be:3
All:3
Bottle,:1
Business:2
tutorials:1
Twitter:1
runs:1
beginners:1
defining:2
GUI:1
Speed:1
OS:3
provide:1
learn:1
division:2
IRIX,:1
'https://ssl':1
[fruit.upper():1
if,:1
▼:1
Start:1
diverse:1
Python.org:1
Shell:1
1:5
first:1
name:1
X:2
Download:1
releases:3
384:1
More:9
wxPython:1
Compound:1
Other:2
any:1
b:2
pipeline:1
along:1
related:1
integrate:1
PySide,:1
print('The:1
5:2
print():1
"required:1
Docs:6
&:3
Tornado,:1
Python!:1
name=search_term_string":1
"https://www.python.org/",:1
'UA-39055973-1']);:1
Get:1
simple:2
Help:3
guides,:1
Quotes:2
a:10
Index:2
mandatory:1
2.7.14:1
图表:

用云栖社区APP,舒服~
【云栖快讯】新年大招!云栖社区为在读大学生/研究生准备了一份学(huan)习(zhuang)攻略,发布博文即有机会赢得iPad mini 4等大奖,学习换装两不误!欢迎报名参与~ 详情请点击 评论 (0) 点赞 (0) 收藏 (0)相关文章
- NLTK基础教程学习笔记(一)
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…
- 《NLTK基础教程——用NLTK和Python库构建机器…