解决在字符串中选择性包含子串的问题

29 阅读2分钟

我们需要在较大字符串中搜索目标文本。我们当前的代码可以在找到字符串中的目标文本后,显示前后各 40 个字符。但是,我们需要做的不是显示前后 40 个字符,而是显示目标文本前后各两个句子。

import re

sentence = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."

sub = "biopsychosocial model"

def find_all_substrings(string, sub):
    starts = [match.start() for match in re.finditer(re.escape(sub), string.lower())]
    return starts

substrings = find_all_substrings(sentence, sub)
for pos in substrings: print(sentence[pos-40:pos+40])

现在,我们想了解如何显示目标文本前后各两个句子。

2、解决方案

一种解决方法是首先将文本分成句子,然后找到所有包含子串的目标句子(及其索引)。然后,只需截取目标句子周围的句子。

from nltk.tokenize import sent_tokenize

text = "In addition, participation in life situations can be somewhat impaired because of communicative disabilities associated with the disorder and parents’ lack of resources for overcoming this aspect of the disability (i.e. communication devices). The attitudes of service providers are also important. The Australian Rett syndrome research program is based on a biopsychosocial model which integrates aspects of both medical and social models of disability and functioning. The investigation of environmental factors such as equipment and support available to individuals and families and the social capital of the communities in which they live is likely to be integral to understanding the burden of this disorder. The program will use the ICF framework to identify those factors determined to be most beneficial and cost effective in optimising health, function and quality of life for the affected child and her family."
sentences = sent_tokenize(text)

sub = "biopsychosocial model"
matching_indices = [i for i, sentence in enumerate(sentences) if sub in sentence]

n_sent_padding = 2
displayed_sentences = [
    ' '.join(sentences[i-n_sent_padding:i+n_sent_padding+1])
    for i in matching_indices
]

这将找到每个包含子串的句子的索引(放置在 matching_indices 中),然后 displayed_sentences 包含匹配句子前后(根据 n_sent_padding)的句子。