闯关任务
- 闯关任务1 Python实现wordcount
- 闯关任务2 Vscode连接InternStudio debug笔记
对实战营感兴趣的可以打开查看:oss.lingkongstudy.com.cn/blog/202406…
任务分析
任务1 使用Python实现对一段英文文本的词频统计。英文分词最为简单,以空格分隔一个单词。提取文本单词前需要去除标点符号。同时,统计时应忽略大小写,需要将单词统一转为小写。 示例如下: Eg:
Input:
"""Hello world!
This is an example.
Word count is fun.
Is it fun to count words?
Yes, it is fun!"""
Output:
{'hello': 1,'world!': 1,'this': 1,'is': 3,'an': 1,'example': 1,'word': 1,
'count': 2,'fun': 1,'Is': 1,'it': 2,'to': 1,'words': 1,'Yes': 1,'fun': 1 }
任务2 Vscode连接InternStudio连接开发机,记录debug程序的过程。debug程序较为简单,连接开发机后,安装Python相关插件即可进行debug。
VSCODE连接开发机
首先登录InternStudio studio.intern-ai.org.cn/console/ins… 复制ssh连接命令。没有开发机的需要创建开发机。
打开VSCODE,确保已安装插件remote-ssh
安装后,在VSCODE中,按F1,弹出提示框中输入remote-ssh,选择连接到ssh主机
粘贴复制的ssh连接命令
若提示需要输入密码,到InternStudio开发机——ssh连接 中复制密码粘贴过来。其他提示选择默认第一项即可。
连接成功后,打开资源管理器,点击打开文件夹。为方便,打开/root/文件夹即可。
创建文件wordcount.py文件,将题目代码粘贴下来:
题目代码如下:
text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""
def wordcount(text):
pass
分词、DEBUG
第一步的建议就是首先将文本转为全小写。 第二步直接按空格提取每个词。
代码如下:
text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""
def wordcount(text):
lower = text.lower()
print(lower)
word_arr = lower.split(' ')
print(word_arr)
wordcount(text)
执行命令 python wordcount.py 结果如图:
截取的词包含着标点、换行符。同时使用print()较多会 影响代码美观。
选择扩展,搜索python,安装扩展python和python-debug
安装完成后,首先在代码行号前用鼠标点击,上打上“小红点”,又称之为“断点”,然后点击运行与调试,调试器选择 Python Debugger。
选择调试Python文件
执行调试(即Debug)后,程序将执行到断点的前一行代码,左侧将显示当前变量的值,右侧将显示一条工具栏。
工具栏只需要关注前三个,功能分别是:下一个断点、下一行、下一步。
第一个按钮将使程序执行到下一个断点之前一行;第二个将程序执行下一行代码;第三个将使程序执行到下一步,若代码执行了函数,将跳入函数内部。
去除符号、分词
去除符号可以放在转为小写后。 去除符号的代码如下:
def wordcount(text):
lower = text.lower()
# 去除标点
rm_dot = lower.replace('.','')
rm_comma = rm_dot.replace(',','')
# 替换换行符为空格
rm_line_sep = rm_comma.replace('\n',' ')
word_arr = rm_line_sep.split(' ')
print(word_arr)
此时,输出结果如下:
此时仍有空字符,但是可以在统计时忽略掉。
统计词频
最为简单的实现就是:遍历词数组,将词和词频放入字典中,遇到已在字典中的词则词频+1
直接放代码:
def wordcount(text):
lower = text.lower()
# 去除标点
rm_dot = lower.replace('.','')
rm_comma = rm_dot.replace(',','')
# 替换换行符为空格
rm_line_sep = rm_comma.replace('\n',' ')
word_arr = rm_line_sep.split(' ')
word_count = dict()
for w in word_arr:
# 忽略空字符的统计
if w == '':
continue
if word_count.get(w) is None:
word_count[w] = 1
else:
word_count[w] = word_count.get(w) + 1
print(word_count)
输出结果如下图:
至此,任务已经完成了。
优化
实际上去除标点这一步并不完美,而且效果很差,首先是标点符号并没有覆盖全面,即使将所有标点符号列出来,多次执行replace()函数浪费了许多性能。
优化途径有2种:
1、使用正则表达式:
对于像it's这种,需要保留的使用表达式r'\w+\S*\w',不需要保留则使用r'\w+'即可。
def wordcount(text: str):
pattern = r'\w+\S*\w'
matcher = re.compile(pattern)
word_arr = matcher.findall(text)
word_count = dict()
for w in word_arr:
# 忽略空字符的统计
if w == '':
continue
if word_count.get(w) is None:
word_count[w] = 1
else:
word_count[w] = word_count.get(w) + 1
print(word_count)
2、遍历所有字符,即每一个字母和符号,将标点符号以外的字符复制到新的字符串。这里就不再演示了。
调包侠
python那么多库,调包就过了,放个调包版本。
import re
import pandas as pd
text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""
def wordcount(text: str):
# raw_words = text.split(' ')
pattern = r'\w+\S*\w'
matcher = re.compile(pattern)
mt = matcher.findall(text)
dtf = pd.Series(mt)
dtf = dtf.str.lower()
wc = dtf.value_counts()
print(wc.to_json())
wordcount(text)