在处理文本数据时,我们经常需要将大字符串切分成多个子字符串,每个子字符串包含一定数量的单词。例如,我们想要将美国独立宣言文本分割成每 5 个单词为一组的子字符串。
2、解决方案
2.1、使用 split()方法
split() 方法可以将字符串按照指定的分隔符切割成列表。我们可以使用空格作为分隔符,将字符串分割成单词列表。然后,我们可以使用列表推导式或循环来将单词列表分成包含 'n' 个单词的子字符串。
text = """
United States Declaration of Independence
When in the course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the powers of the earth, the separate and equal station to which the laws of nature and of nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation.
We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable rights, that among these are life, liberty, and the pursuit of happiness. That to secure these rights, governments are instituted among men, deriving their just powers from the consent of the governed, that whenever any form of government becomes destructive of these ends, it is the right of the people to alter or abolish it, and to institute a new government, laying its foundation on such principles, and organizing its powers in such form, as to them shall seem most likely to effect their safety and happiness. Prudence, indeed, will dictate that governments long established should not be changed for light and transient causes; and accordingly all experience hath shewn, that mankind are more disposed to suffer, while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, pursuing invariably the same object, evinces a design to reduce them under absolute despotism, it is their right, it is their duty, to throw off such government, and to provide new guards for their future security. Such has been the patient sufferance of these colonies; and such is now the necessity which constrains them to alter their former systems of government. The history of the present king of Great-Britain is a history of repeated injuries and usurpations, all having in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world.
"""
n = 5 # number of words in each substring
# split the text into words
words = text.split()
# create a list of substrings, each containing n words
substrings = []
for i in range(0, len(words), n):
substrings.append(" ".join(words[i:i+n]))
# print the first 10 substrings
print(substrings[:10])
2.2、使用 NLTK 库
也可以使用 NLTK 自然语言处理库来将字符串切分成包含 'n' 个单词的子字符串。NLTK 提供了多种分词器,我们可以使用这些分词器将字符串分割成单词列表,然后使用列表推导式或循环来将单词列表分成包含 'n' 个单词的子字符串。
import nltk
# load the English Punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
# tokenize the text
sentences = tokenizer.tokenize(text)
# create a list of substrings, each containing n words
substrings = []
for sentence in sentences:
words = nltk.word_tokenize(sentence)
for i in range(0, len(words), n):
substrings.append(" ".join(words[i:i+n]))
# print the first 10 substrings
print(substrings[:10])
2.3、使用正则表达式
也可以使用正则表达式来将字符串切分成包含 'n' 个单词的子字符串。我们可以使用正则表达式来匹配包含 'n' 个单词的子字符串,然后使用 finditer() 方法来查找所有匹配的子字符串。
import re
# create a regular expression to match substrings containing n words
pattern = r'\b\w+\b[ ]{1, %d}\b\w+\b' % (n-1)
# find all matches of the regular expression in the text
matches = re.finditer(pattern, text)
# create a list of substrings, each containing n words
substrings = []
for match in matches:
substrings.append(match.group())
# print the first 10 substrings
print(substrings[:10])