第三期书生大模型实战营 第一关 Python基础知识

69 阅读5分钟

闯关任务

  • 闯关任务1 Python实现wordcount
  • 闯关任务2 Vscode连接InternStudio debug笔记

对实战营感兴趣的可以打开查看:oss.lingkongstudy.com.cn/blog/202406…

任务分析

任务1 使用Python实现对一段英文文本的词频统计。英文分词最为简单,以空格分隔一个单词。提取文本单词前需要去除标点符号。同时,统计时应忽略大小写,需要将单词统一转为小写。 示例如下: Eg:

Input:

"""Hello world!  
This is an example.  
Word count is fun.  
Is it fun to count words?  
Yes, it is fun!"""

Output:

{'hello': 1,'world!': 1,'this': 1,'is': 3,'an': 1,'example': 1,'word': 1, 
'count': 2,'fun': 1,'Is': 1,'it': 2,'to': 1,'words': 1,'Yes': 1,'fun': 1  }

任务2 Vscode连接InternStudio连接开发机,记录debug程序的过程。debug程序较为简单,连接开发机后,安装Python相关插件即可进行debug。

VSCODE连接开发机

首先登录InternStudio studio.intern-ai.org.cn/console/ins… 复制ssh连接命令。没有开发机的需要创建开发机。

image.png

打开VSCODE,确保已安装插件remote-ssh

image.png

安装后,在VSCODE中,按F1,弹出提示框中输入remote-ssh,选择连接到ssh主机

image.png

粘贴复制的ssh连接命令

image.png

若提示需要输入密码,到InternStudio开发机——ssh连接 中复制密码粘贴过来。其他提示选择默认第一项即可。

连接成功后,打开资源管理器,点击打开文件夹。为方便,打开/root/文件夹即可。

image.png

创建文件wordcount.py文件,将题目代码粘贴下来:

image.png

题目代码如下:

text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text):
    pass

分词、DEBUG

第一步的建议就是首先将文本转为全小写。 第二步直接按空格提取每个词。

代码如下:

 text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text):
    lower = text.lower()
    print(lower)
    word_arr = lower.split(' ')
    print(word_arr)

wordcount(text)

执行命令 python wordcount.py 结果如图:

image.png

截取的词包含着标点、换行符。同时使用print()较多会 影响代码美观。

选择扩展,搜索python,安装扩展python和python-debug

image.png

安装完成后,首先在代码行号前用鼠标点击,上打上“小红点”,又称之为“断点”,然后点击运行与调试,调试器选择 Python Debugger。

image.png

选择调试Python文件

image.png

执行调试(即Debug)后,程序将执行到断点的前一行代码,左侧将显示当前变量的值,右侧将显示一条工具栏。

image.png

工具栏只需要关注前三个,功能分别是:下一个断点、下一行、下一步。

第一个按钮将使程序执行到下一个断点之前一行;第二个将程序执行下一行代码;第三个将使程序执行到下一步,若代码执行了函数,将跳入函数内部。

去除符号、分词

去除符号可以放在转为小写后。 去除符号的代码如下:

def wordcount(text):
    lower = text.lower()

    # 去除标点
    rm_dot = lower.replace('.','')
    rm_comma = rm_dot.replace(',','')

    # 替换换行符为空格
    rm_line_sep = rm_comma.replace('\n',' ')

    word_arr = rm_line_sep.split(' ')

    print(word_arr)

此时,输出结果如下:

image.png

此时仍有空字符,但是可以在统计时忽略掉。

统计词频

最为简单的实现就是:遍历词数组,将词和词频放入字典中,遇到已在字典中的词则词频+1

直接放代码:

def wordcount(text):
    lower = text.lower()

    # 去除标点
    rm_dot = lower.replace('.','')
    rm_comma = rm_dot.replace(',','')

    # 替换换行符为空格
    rm_line_sep = rm_comma.replace('\n',' ')

    word_arr = rm_line_sep.split(' ')

    word_count = dict()

    for w in word_arr:
        # 忽略空字符的统计
        if w == '':
            continue
        if word_count.get(w) is None:
            word_count[w] = 1
        else:
            word_count[w] = word_count.get(w) + 1


    print(word_count)

输出结果如下图:

image.png

至此,任务已经完成了。

优化

实际上去除标点这一步并不完美,而且效果很差,首先是标点符号并没有覆盖全面,即使将所有标点符号列出来,多次执行replace()函数浪费了许多性能。

优化途径有2种: 1、使用正则表达式: 对于像it's这种,需要保留的使用表达式r'\w+\S*\w',不需要保留则使用r'\w+'即可。

def wordcount(text: str):

    pattern = r'\w+\S*\w'
    
    matcher = re.compile(pattern)
    
    word_arr = matcher.findall(text)
    
    word_count = dict()

    for w in word_arr:
        # 忽略空字符的统计
        if w == '':
            continue
        if word_count.get(w) is None:
            word_count[w] = 1
        else:
            word_count[w] = word_count.get(w) + 1
    print(word_count)

2、遍历所有字符,即每一个字母和符号,将标点符号以外的字符复制到新的字符串。这里就不再演示了。

调包侠

python那么多库,调包就过了,放个调包版本。


import re
import pandas as pd

text = """
Got this panda plush toy for my daughter's birthday,
who loves it and takes it everywhere. It's soft and
super cute, and its face has a friendly look. It's
a bit small for what I paid though. I think there
might be other options that are bigger for the
same price. It arrived a day earlier than expected,
so I got to play with it myself before I gave it
to her.
"""

def wordcount(text: str):

    # raw_words = text.split(' ')

    pattern = r'\w+\S*\w'
    
    matcher = re.compile(pattern)
    
    mt = matcher.findall(text)

    dtf = pd.Series(mt)
  
    dtf = dtf.str.lower()
    wc = dtf.value_counts()
    print(wc.to_json())



wordcount(text)