使用 Python 在两个文件中查找匹配项在对测序数据进行分析时，有时需要查找候选基因的功能。通过编辑现有的特定数据库

在对测序数据进行分析时，有时需要查找候选基因的功能。通过编辑现有的特定数据库，可以将候选基因与数据库进行比较，并输出候选基因的功能。此时，如果只具备基本的 Python 技能，可以使用 Python 代码来加快查找候选基因功能的工作。

例如，file1 文件中包含候选基因：

Gene
AQP7
RLIM
SMCO3
COASY
HSPA6

而数据库 file2.csv 中包含基因及其功能：

Gene   function 
PDCD6  Programmed cell death protein 6 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding；//爬虫IP免费获取；
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a 
CDC2   Cell division cycle 2, G1 to S and G2 to M, isoform CRA_a

期望的输出结果是：

Gene(from file1) ,function(matching from file2)

曾尝试使用以下代码进行匹配，但只得到了空白页面：

file1 = 'file1.csv'
file2 = 'file2.csv'
output = 'file3.txt'

with open(file1) as inf:
    match = set(line.strip() for line in inf)

with open(file2) as inf, open(output, 'w') as outf:
    for line in inf:
        if line.split(' ',1)[0] in match:
            outf.write(line)

使用交集函数也无法正常工作：

with open('file1.csv', 'r') as ref:
    with open('file2.csv','r') as com:
       with open('common_genes_function','w') as output:
           same = set(ref).intersection(com)
                print same

2、解决方案

方案一：使用 Pandas

可以使用 Pandas 库的 merge 函数来合并两个文件。假设基因和功能之间以制表符分隔，代码如下：

import pandas as pd

# 打开文件作为 Pandas 数据集
file1 = pd.read_csv(filepath1, sep = '\t')
file2 = pd.read_csv(filepath2, sep = '\t')

# 通过列 'Gene' 合并文件，使用 'inner'，以便只保留两个数据集中都有的交集
file3 = pd.merge(file1, file2, how = 'inner', on = ['Gene'], suffixes = ['1','2'])

# 将合并后的数据保存至文件
file3.to_csv(filepath3, sep = ',')

方案二：使用正则表达式

也可以使用正则表达式来匹配两个文件中的基因和功能。代码如下：

import re

# 创建一个字典来存储基因和功能的对应关系
gene_function = {}

# 打开数据库文件并读取内容
with open('file2.csv','r') as input:
    # 将第一行标题略过
    lines = [line.strip() for line in input.readlines()[1:]]

    # 逐行读取内容并提取基因和功能
    for line in lines:
        match = re.search("(\w+)\s+(.*)",line)
        gene = match.group(1)
        function = match.group(2)

        # 将基因和功能添加到字典中
        gene_function[gene] = function

# 打开候选基因文件并读取内容
with open('file1.csv','r') as input:
    # 将第一行标题略过
    genes = [i.strip() for i in input.readlines()[1:]]

    # 逐个处理候选基因
    for gene in genes:
        # 如果该基因在字典中，则打印基因和功能
        if gene in gene_function:
            print "{}, {}".format(gene, gene_function[gene])

这两种方法都可以有效地查找两个文件中的匹配项。