生信1--从fasta文件中根据bed文件位置信息大量提取指定区域的片段作为生信工程师，终于掌握了python面对对象编

作为生信工程师，终于掌握了python面对对象编程的思路，开始磨刀霍霍使用屠龙之术。但是在此之前还是好好从简单的代码出发

好记性不如烂笔头，所以尝试记录一下成长历程同时发布一些简单生信脚本，从简单到难，基于python开发

这个脚本首先思路就是把fasta格式转换成python字典，key储存的是以“>”字符开头的标签，value储存的是测序的read（如果是标准fasta，会分成多行，要注意）

#!/usr/bin/env python
usages: python extract_reads_by_location.py input_fasta_file bed_file > output_fasta_file

import re
import sys

def structure_fasta_dict(fasta_file): #读取fasta文件，返回字典结构
    with open(fasta_file,"r") as input_fasta:
        fasta_dict = {}
        for line in input_fasta:
            line = line.strip() #strip()方法移除字符串头尾指定的字符（默认为空格或换行符）,不会删除中间部分字符
            if line == '':
                continue
            if line.startswith('>'):
                read_name = line.lstrip('>')
                read_name = re.sub('\..*', '', read_name)
                '''把>标签行整理成bed文件中read标签的格式(比如fasta文件中>标签行：'gi|41406098|ref|NC_002944.2| Mycobacterium avium subsp. paratuberculosis K-10, complete sequence'（ncbi标准格式）；bed文件中第一列read标签：如NC_002944.2（只需其中一部分）)，运用自己构建的正则表达提取。例句只提取第一个空格前的'''
                fasta_dict[read_name] = ''
            else:
                fasta_dict[read_name] += line
        return fasta_dict

def main(fasta_file, bedfile):
    fasta_dict = structure_fasta_dict(fasta_file)
    with open(bedfile,"r") as input_bed:
        for line in input_bed:
            line = line.strip().split('\t')
            if line[0] in fasta_dict.keys(): #如果bed文件和fasta不匹配，添加这个判断，可以不报错
                outname = line[0] + ":" +line[1] + '-' + line[2]
                print('>' + outname)
                s_pos = int(line[1])-1
                e_pos = int(line[2])-1
                print(fasta_dict[line[0]][s_pos:e_pos])

if __name__ == '__main__':
    fasta_file = sys.argv[1]
    bed_file = sys.argv[2]
    main(fasta_file, bed_file)

代码就在上面，可以下载后根据需求自己改写，两个参数，第一个fasta文件，第二个bed文件。可以再添加一个输出代码块，我这里直接linux里面 > 输出文件即可。

不足之处（当然可以用线程的生信软件） 1.如果fasta数据量太大，速度过慢（字典太大，后续加入拆分成多个fasta+并行的功能） 2.为了更美观，可以添加命令行解析参数（click模块，getopt模块，fire模块......）

有空再优化，安利一款生信软件seqkit，非常好用

如果有机会讲到pysam，到时候可以用pysam包实现同样的功能

structure_fasta_dict()另一种写法，不用正则表达需要对fasta格式>标签行有深刻的认识，这也是生信人应该融会贯通的知识。

def structure_fasta_dict(fasta_file): #读取fasta文件，返回字典结构
    with open(fasta_file, 'r') as input_fasta:
        fasta_dict = {}
        for line in input_fasta:
            if line[0] == '>':
                header = line[1:].rstrip()
                id = header.split()[0]
                fasta_dict[id] = []
            else:
                fasta_dict[id].append(line.rstrip())
        for id, read in fasta_dict.items():
                fasta_dict[id] = ''.join(read)
    return fasta_dict

对于编程来说，条条大路通罗马