生物信息学的数据格式有多种,不同的软件支持的格式不尽相同,因此我们有时需要将一种格式转换为另一种格式,如把Phylip格式转换成Fasta格式,把GenPept格式转换成Fasta格式等。实现格式转换可以使用BioPerl、BioPython或EMBOSS软件包中的seqret工具,也可以用shell编程实现。
下面是登录号为BAH04252.1的GenPept格式(以前称为GenBank格式)蛋白质序列:
LOCUS BAH04252 173 aa linear PLN 05-AUG-2014 DEFINITION LEAFY, partial [Cardamine alpina]. ACCESSION BAH04252 VERSION BAH04252.1 GI:219565447 DBSOURCE accession AB378290.1 KEYWORDS . SOURCE Cardamine alpina ORGANISM Cardamine alpina Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae; Pentapetalae; rosids; malvids; Brassicales; Brassicaceae; Cardamineae; Cardamine. REFERENCE 1 AUTHORS Ikeda,H., Fujii,N. and Setoguchi,H. TITLE Application of the isolation with migration model demonstrates the pleistocene origin of geographic differentiation in Cardamine nipponica (Brassicaceae), an endemic Japanese alpine plant JOURNAL Mol. Biol. Evol. 26 (10), 2207-2216 (2009) PUBMED 19567916 REMARK DOI:10.1093/molbev/msp128 REFERENCE 2 (residues 1 to 173) AUTHORS Ikeda,H., Senni,K., Fujii,N. and Setoguchi,H. TITLE Direct Submission JOURNAL Submitted (01-FEB-2008) Contact:Hajime Ikeda Okayama University, Institute of Plant Science and Resources; 2-20-1 Chuo, Kurashiki, Okayama 710-0046, Japan FEATURES Location/Qualifiers source 1..173 /organism="Cardamine alpina" /sub_species="alpina" /db_xref="taxon:352360" /haplotype="2" /note="amplified by TFL1 Marker" Protein <1..>173 /name="LEAFY" Region 34..>168 /region_name="FLO_LFY" /note="Floricaula / Leafy protein; pfam01698" /db_xref="CDD:250804" CDS 1..173 /gene="LEAFY" /coded_by="join(AB378290.1:<1..420,AB378290.1:907..>1009)" /note="similar to LEAFY protein" /codon_start=3 ORIGIN 1 wnptratvqa lppvppppqq qpattqtaaf gmrlgglegl fgaygirfyt aakiaelgft 61 astlvgmrde eleemmnsls hifrwellvg erygikaavr aerrrlqeee eesskrrhll 121 lsaagdsgth haldalsqed dwtglseepv qqqdnqtdaa gnnggyweag kgk //
将GenBank格式转换成Fasta格式的shell脚本示例如下:
[xiezy@ibi98 seq_convert]$ cat seq_convert.sh #!/bin/bash # Usage: ./seq_convert.shFastaFile if [ "$1" == "-h" -o "$1" == "--help" ] then echo "Usage: ./seq_convert.shFastaFile" exit 0 elif [ $# -eq 0 ] then seq_flag=0 while read line do # find accession and save its value if echo $line |grep "^ACCESSION" >/dev/null then acc=`echo "$line" |cut -d' ' -f4-` continue # find definition and save its value elif echo $line |grep "^DEFINITION" >/dev/null then def=`echo "$line" |cut -d' ' -f3-` continue # determine the end of a sequence elif echo "$line" |grep "^\/\/" >/dev/null then seq_flag=0 continue # output the processed sequence elif [ $seq_flag -eq 1 ] then echo "$line" |perl -pe "s/[ \d]//g" |tr a-z A-Z continue # determine the begining of a sequence and output the annotation line elif echo "$line" |grep "^ORIGIN" >/dev/null then seq_flag=1 echo ">${acc} ${def}" fi done else echo "Wrong arguments!" exit 1 fi
该shell脚本从标准输入读取数据,读一行处理一行。如果运行时发现参数-h或--help,则显示用法,参数为其他内容时提示错误的参数。运行结果:
[xiezy@ibi98 seq_convert]$ cat seq.gb |./seq_convert.sh
>BAH04252 LEAFY, partial [Cardamine alpina].
WNPTRATVQALPPVPPPPQQQPATTQTAAFGMRLGGLEGLFGAYGIRFYTAAKIAELGFT
ASTLVGMRDEELEEMMNSLSHIFRWELLVGERYGIKAAVRAERRRLQEEEEESSKRRHLL
LSAAGDSGTHHALDALSQEDDWTGLSEEPVQQQDNQTDAAGNNGGYWEAGKGK
直接用EMBOSS里的seqret工具,也可以实现这种格式转换:
[xiezy@ibi98 seq_convert]$ seqret seq.gb -out fasta::seq.fasta Read and write (return) sequences [xiezy@ibi98 seq_convert]$ cat seq.fasta >BAH04252 BAH04252.1 LEAFY, partial [Cardamine alpina]. wnptratvqalppvppppqqqpattqtaafgmrlggleglfgaygirfytaakiaelgft astlvgmrdeeleemmnslshifrwellvgerygikaavraerrrlqeeeeesskrrhll lsaagdsgthhaldalsqeddwtglseepvqqqdnqtdaagnnggyweagkgk
seqret支持的格式很多,但有些格式之间是没法转换的。另外从信息丰富的格式转成信息较少的格式(如上面的GenPept转换成Fasta)没有问题,但反过来,即使能转换,也会缺失很多信息。