利用正则表达式提取多行文本中的特定模式

74 阅读2分钟

给定一段文本 data,其中包含多组以 PMID、TI 和 AB 开头的句子。我们要从这段文本中提取出这些句子,以便进行进一步的处理。

我们尝试使用正则表达式来完成这个任务,但遇到了困难。

2. 解决方案

方案1:re.compilere.finditer 相结合

import re

reg4 = re.compile(r'^(?:PMID- (?P<pmid>[0-9]+)|TI  - (?P<title>.*?)^PG|AB  - (?P<abstract>.*?)^AD)', re.MULTILINE | re.DOTALL)
for i in reg4.finditer(data):
    print i.groupdict()

方案2:冗长正则表达式

reg4 = re.compile(r'''
        ^                     # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        (?:                   # Non capturing group with multiple options, first option:
            PMID-\s           # Literal "PMID-" followed by a space
            (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        |                     # Next option:
            TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
            (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
            ^PG               # The characters PG at the start of a line
        |                     # Next option
            AB\s{2}-\s        # "AB  - "
            (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
            ^AD               # "AD" at the start of a line
        )
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

方案3:使用整体正则表达式

reg4 = re.compile(r'''
        ^                 # Start of a line (due to re.MULTILINE, this may match at the start of any line)
        PMID-\s           # Literal "PMID-" followed by a space
        (?P<pmid>[0-9]+)  # Then a string of one or more digits, group as 'pmid'
        .*?               # Next part:
        TI\s{2}-\s        # "TI", two spaces, a hyphen and a space
        (?P<title>.*?)    # The title, a non greedy match that will capture everything up to...
        ^PG               # The characters PG at the start of a line
        .*?               # Next option
        AB\s{2}-\s        # "AB  - "
        (?P<abstract>.*?) # The abstract, a non greedy match that will capture everything up to...
        ^AD               # "AD" at the start of a line
        ''', re.MULTILINE | re.DOTALL | re.VERBOSE)
for i in reg4.finditer(data):
    print i.groupdict()

方案4:使用 Python 内置方法

def parse_data(data):
    """
    Parses the data and extracts the PMID, title, and abstract.

    Args:
        data (str): The data to parse.

    Returns:
        list of dicts: A list of dictionaries, where each dictionary contains the PMID, title, and abstract.
    """

    # Split the data by newlines
    lines = data.split('\n')

    # Create a list to store the results
    results = []

    # Iterate over the lines
    for line in lines:
        # If the line starts with "PMID-", extract the PMID
        if line.startswith("PMID-"):
            pmid = line.split('-')[1].strip()

        # If the line starts with "TI  -", extract the title
        elif line.startswith("TI  -"):
            title = line.split('-')[1].strip()

        # If the line starts with "AB  -", extract the abstract
        elif line.startswith("AB  -"):
            abstract = line.split('-')[1].strip()

        # If all three fields have been extracted, add them to the results list
        if pmid and title and abstract:
            results.append({
                "pmid": pmid,
                "title": title,
                "abstract": abstract
            })

    # Return the results
    return results

函数parse_data将输入文本分割成行,然后逐行解析,使用 startswith 方法来检测行的开头是否匹配特定的模式(如 "PMID-"、"TI -" 和 "AB -"),并提取相应的内容。当所有三个字段(PMID、标题和摘要)都被提取出来时,它们将被添加到结果列表中。最后,函数返回包含所有解析结果的列表。

方案5:使用第三方库 可以使用一些第三方库来简化正则表达式的使用,如 re2regex。这些库提供了更强大的正则表达式功能,并使正则表达式更加容易编写和理解。

**注意:**在使用正则表达式时,需要考虑性能问题。如果正则表达式过于复杂,可能会导致性能下降。因此,在编写正则表达式时,应注意保持其简洁性。