第三期书生大模型实战营第一关 Python基础知识使用Python实现对一段英文文本的词频统计。英文分词最为简单，以空

闯关任务

闯关任务1 Python实现wordcount
闯关任务2 Vscode连接InternStudio debug笔记

对实战营感兴趣的可以打开查看：oss.lingkongstudy.com.cn/blog/202406…

任务分析

任务1 使用Python实现对一段英文文本的词频统计。英文分词最为简单，以空格分隔一个单词。提取文本单词前需要去除标点符号。同时，统计时应忽略大小写，需要将单词统一转为小写。示例如下： Eg:

Input:

"""Hello world!  
This is an example.  
Word count is fun.  
Is it fun to count words?  
Yes, it is fun!"""

Output:

{'hello': 1,'world!': 1,'this': 1,'is': 3,'an': 1,'example': 1,'word': 1, 
'count': 2,'fun': 1,'Is': 1,'it': 2,'to': 1,'words': 1,'Yes': 1,'fun': 1  }

任务2 Vscode连接InternStudio连接开发机，记录debug程序的过程。debug程序较为简单，连接开发机后，安装Python相关插件即可进行debug。

VSCODE连接开发机

首先登录InternStudio studio.intern-ai.org.cn/console/ins… 复制ssh连接命令。没有开发机的需要创建开发机。

打开VSCODE，确保已安装插件remote-ssh

安装后，在VSCODE中，按F1，弹出提示框中输入remote-ssh，选择连接到ssh主机

粘贴复制的ssh连接命令

若提示需要输入密码，到InternStudio开发机——ssh连接中复制密码粘贴过来。其他提示选择默认第一项即可。

连接成功后，打开资源管理器，点击打开文件夹。为方便，打开/root/文件夹即可。

创建文件wordcount.py文件，将题目代码粘贴下来：

题目代码如下：

text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text):
    pass

分词、DEBUG

第一步的建议就是首先将文本转为全小写。第二步直接按空格提取每个词。

代码如下：

 text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text):
    lower = text.lower()
    print(lower)
    word_arr = lower.split(' ')
    print(word_arr)

wordcount(text)

执行命令 python wordcount.py 结果如图：

截取的词包含着标点、换行符。同时使用print()较多会影响代码美观。

选择扩展，搜索python，安装扩展python和python-debug

安装完成后，首先在代码行号前用鼠标点击，上打上“小红点”，又称之为“断点”，然后点击运行与调试，调试器选择 Python Debugger。

选择调试Python文件

执行调试（即Debug）后，程序将执行到断点的前一行代码，左侧将显示当前变量的值，右侧将显示一条工具栏。

工具栏只需要关注前三个，功能分别是：下一个断点、下一行、下一步。

第一个按钮将使程序执行到下一个断点之前一行；第二个将程序执行下一行代码；第三个将使程序执行到下一步，若代码执行了函数，将跳入函数内部。

去除符号、分词

去除符号可以放在转为小写后。去除符号的代码如下：

def wordcount(text):
    lower = text.lower()

    # 去除标点
    rm_dot = lower.replace('.','')
    rm_comma = rm_dot.replace(',','')

    # 替换换行符为空格
    rm_line_sep = rm_comma.replace('\n',' ')

    word_arr = rm_line_sep.split(' ')

    print(word_arr)

此时，输出结果如下：

此时仍有空字符，但是可以在统计时忽略掉。

统计词频

最为简单的实现就是：遍历词数组，将词和词频放入字典中，遇到已在字典中的词则词频+1

直接放代码：

def wordcount(text):
    lower = text.lower()

    # 去除标点
    rm_dot = lower.replace('.','')
    rm_comma = rm_dot.replace(',','')

    # 替换换行符为空格
    rm_line_sep = rm_comma.replace('\n',' ')

    word_arr = rm_line_sep.split(' ')

    word_count = dict()

    for w in word_arr:
        # 忽略空字符的统计
        if w == '':
            continue
        if word_count.get(w) is None:
            word_count[w] = 1
        else:
            word_count[w] = word_count.get(w) + 1


    print(word_count)

输出结果如下图：

至此，任务已经完成了。

优化

实际上去除标点这一步并不完美，而且效果很差，首先是标点符号并没有覆盖全面，即使将所有标点符号列出来，多次执行replace()函数浪费了许多性能。

优化途径有2种： 1、使用正则表达式：对于像it's这种，需要保留的使用表达式r'\w+\S*\w'，不需要保留则使用r'\w+'即可。

def wordcount(text: str):

    pattern = r'\w+\S*\w'
    
    matcher = re.compile(pattern)
    
    word_arr = matcher.findall(text)
    
    word_count = dict()

    for w in word_arr:
        # 忽略空字符的统计
        if w == '':
            continue
        if word_count.get(w) is None:
            word_count[w] = 1
        else:
            word_count[w] = word_count.get(w) + 1
    print(word_count)

2、遍历所有字符，即每一个字母和符号，将标点符号以外的字符复制到新的字符串。这里就不再演示了。

调包侠

python那么多库，调包就过了，放个调包版本。


import re
import pandas as pd

text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text: str):

    # raw_words = text.split(' ')

    pattern = r'\w+\S*\w'
    
    matcher = re.compile(pattern)
    
    mt = matcher.findall(text)

    dtf = pd.Series(mt)
  
    dtf = dtf.str.lower()
    wc = dtf.value_counts()
    print(wc.to_json())



wordcount(text)

第三期书生大模型实战营 第一关 Python基础知识