Biopython

Biopython
原作者	Chapman B, Chang J
首次發布	2002年12月17日，21年前
當前版本	1.81（2023年2月12日；穩定版本）; 183（2024年1月10日；穩定版本）;
原始碼庫	https://github.com/biopython/biopython
程式語言	Python和C語言
平台	跨平台
類型	生物信息學
許可協議	Biopython許可證
網站	biopython.org

Biopython項目包含一系列用於計算生物學和生物信息學的非商業性Python工具，是一個國際性開發者協會所創建的開源軟體集。^[1] ^[4]^[5]其中包含表示生物序列和序列注釋的類，且能夠讀取和寫入多種文件格式。它還允許通過編程訪問在線的生物學資料庫，例如美國國家生物技術信息中心（NCBI）資料庫。單獨模塊還能使Biopython的功能擴展到序列比對、蛋白質結構、群體遺傳學、系統發生學、序列基序和機器學習。Biopython項目意圖減少計算生物學中的代碼重複問題，並與相似項目一樣以Bio為前綴命名。^[6]

歷史

Biopython的開發始於1999年，並於2000年7月首次發布^[7]。同時期被開發的同類項目還有BioPerl、BioRuby和BioJava，這些項目的命名都體現了開發所使用的程式語言。該項目的早期開發人員包含Jeff Chang、Andrew Dalke和Brad Chapman，至今有100餘人做出貢獻。^[8]2007年時建立了類似的Python項目PyCogent。^[9]

Biopython最初可以訪問、索引和處理生物序列文件，這也是其主要目標。在之後幾年中，新增的其他模塊使其功能也涵蓋其他生物學領域（見主要特點和示例）。

1.77版本起，Biopython項目結束對Python 2的支持。 ^[10]

設計

Biopython儘可能遵循Python語言的慣例，以便Python用戶輕鬆使用。例如，Seq和SeqRecord對象可以通過切片進行操作，與Python的字符串和列表相似。其功能設計也與其他Bio*項目（如BioPerl）相似。^[7]

Biopython的每個功能區可讀寫常見的文件格式，並且許可證寬鬆，能與其他多數軟體的許可證兼容，因此能在許多軟體項目中使用Biopython。^[5]

主要特點和示例

序列

Biopython的一個核心概念是由Seq類表示的生物序列。^[11]Seq對象與Python字符串很相似：它支持Python切分符號，可與其他序列連接，並且不可變。此外，它有序列專用的方法，並能指定要使用的特定的生物字母表。

>>> # 该脚本创建 DNA 序列并执行一些典型的操作
>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("AGGCTTCTCGTA", IUPAC.unambiguous_dna)
>>> dna_sequence
Seq('AGGCTTCTCGTA', IUPACUnambiguousDNA())
>>> dna_sequence[2:7]
Seq('GCTTC', IUPACUnambiguousDNA())
>>> dna_sequence.reverse_complement()
Seq('TACGAGAAGCCT', IUPACUnambiguousDNA())
>>> rna_sequence = dna_sequence.transcribe()
>>> rna_sequence
Seq('AGGCUUCUCGUA', IUPACUnambiguousRNA())
>>> rna_sequence.translate()
Seq('RLLV', IUPACProtein())

序列注釋

SeqRecord類以SeqFeature對象的形式描述序列以及名稱、描述和特徵等信息。每個SeqFeature對象指定特徵的類型及其位置。特徵類型可以是「gene」、「CDS」（編碼序列）、「repeat_region」、「mobile_element」或其他，特徵在序列中的位置可以是精確的或近似的。

>>> # 该脚本从文件中加载带注释的序列并查看其部分内容。
>>> from Bio import SeqIO
>>> seq_record = SeqIO.read("pTC2.gb", "genbank")
>>> seq_record.name
'NC_019375'
>>> seq_record.description
'Providencia stuartii plasmid pTC2, complete sequence.'
>>> seq_record.features[14]
SeqFeature(FeatureLocation(ExactPosition(4516), ExactPosition(5336), strand=1), type='mobile_element')
>>> seq_record.seq
Seq("GGATTGAATATAACCGACGTGACTGTTACATTTAGGTGGCTAAACCCGTCAAGC...GCC", IUPACAmbiguousDNA())

輸入輸出

Biopython可以讀寫多種常見的序列格式，包括FASTA、FASTQ、GenBank、Clustal、PHYLIP和NEXUS。讀取文件時，文件中的描述性信息會填充Biopython類的成員，例如SeqRecord，因此可以將某種文件格式的記錄轉換成其他格式。

超大的序列文件可能占滿計算機的內存資源，因此Biopython提供了多種選項來訪問大型文件中的記錄。文件可以完全加載到Python資料結構（例如列表或字典）的內存中，以占用內存為代價提供快速訪問。也可以按需從磁碟讀取文件，這樣訪問性能較差，但內存用量較低。

>>> # 该脚本加载一个包含多个序列的文件，并以不同的格式保存每个序列。
>>> from Bio import SeqIO
>>> genomes = SeqIO.parse("salmonella.gb", "genbank")
>>> for genome in genomes:
...     SeqIO.write(genome, genome.id + ".fasta", "fasta")

訪問在線資料庫

Biopython用戶可以通過Bio.Entrez模塊從NCBI資料庫下載生物學數據。Entrez搜尋引擎提供的各項功能都可通過該模塊的功能實現，包括搜索、數據記錄下載。

>>> # 该脚本从 NCBI 核苷酸数据库下载基因组并将其保存在 FASTA 文件中。
>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> output_file = open("all_records.fasta", "w")
>>> Entrez.email = "my_email@example.com"
>>> records_to_download = ["FO834906.1", "FO203501.1"]
>>> for record_id in records_to_download:
...     handle = Entrez.efetch(db="nucleotide", id=record_id, rettype="gb")
...     seqRecord = SeqIO.read(handle, format="gb")
...     handle.close()
...     output_file.write(seqRecord.format("fasta"))

圖1：Bio創建的有根系統發育樹。Phylo顯示不同生物體Apaf-1同源物之間的關係^[12]

Bio.Phylo模塊提供了用於處理和可視化系統發生樹的工具，且支持多種文件格式的讀寫，包括Newick、Nexus和phyloXML。通過Tree和Clade對象支持常見的樹操作和遍歷。示例包括轉換和整理樹文件、從樹中提取子集、更改樹的根以及分析分支特徵（例如長度或分數）。^[13]

有根樹可以用ASCII或使用matplotlib繪製（見圖1），且Graphviz庫可用於創建無根布局（見圖2）。

基因組圖

圖3：pKPS77質粒上的基因圖， ^[14]使用Biopython中的GenomeDiagram模塊進行可視化

GenomeDiagram模塊為Biopython提供了可視化序列的方法。^[15]序列可以以線性或圓形形式繪製（參見圖 3），並且支持許多輸出格式，包括PDF和PNG 。製作軌跡然後向軌跡添加序列特徵可以創建圖表。通過遍歷序列的特徵和使用其屬性，可以決定是否、如何將其添加到圖表的軌跡，且可以對最終圖表的外觀進行更多控制。可以在不同軌跡之間繪製交叉連結，從而在單個圖表中比較多個序列。

高分子結構

2003年時Bio.PDB模塊被添加到Biopython^[16]，它可以從PDB和mmCIF文件加載分子結構，Structure對象是該模塊的核心，它以分層方式組織大分子結構：Structure對象包含Model對象，Model對象包含Chain對象，Chain對象包含Residue對象，Residue對象包含Atom對象。無序殘基和原子有自己的類， DisorderedResidue和DisorderedAtom ，描述它們的不確定位置。

使用Bio.PDB可以瀏覽大分子結構文件的各個組成部分，例如檢查蛋白質中的每個原子。可以進行常見的分析，例如測量距離或角度、比較殘留物以及計算殘留物深度。

群體遺傳學

Bio.PopGen模塊增加了對Biopython for Genepop的支持，Genepop是一個用於群體遺傳學統計分析的軟體包。 ^[17]這允許分析哈迪-溫伯格平衡、連鎖不平衡和群體等位基因頻率的其他特徵。

該模塊還可以使用fastsimcoal2程序，利用凝聚態理論進行群體遺傳模擬。^[18]

命令行工具的包裝

Biopython的許多模塊都包含常用工具的命令行包裝器，允許在Biopython中使用這些工具。這些包裝器包括BLAST、Clustal、PhyML、EMBOSS和SAMtools。用戶可以將通用封裝類子類化，以添加對其他命令行工具的支持。

參見

開放生物信息學基金會（英語：Open Bioinformatics Foundation）
BioPerl（英語：BioPerl）
BioRuby（英語：BioRuby）
BioJS（英語：BioJS）
BioJava

參考文獻

^ ^1.0 ^1.1 Chapman, Brad; Chang, Jeff. Biopython: Python tools for computational biology. ACM SIGBIO Newsletter. August 2000, 20 (2): 15–19. S2CID 9417766. doi:10.1145/360262.360268 .
^ Release biopython-181: Commit Release 1.81 (#4233). [2023年4月22日].
^ Release 183. 2024年1月10日 [2024年1月19日].
^ Cock, Peter JA; Antao, Tiago; Chang, Jeffery T; Chapman, Brad A; Cox, Cymon J; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel JL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 20 March 2009, 25 (11): 1422–3. PMC 2682512 . PMID 19304878. doi:10.1093/bioinformatics/btp163.
^ ^5.0 ^5.1 Refer to the Biopython website for other papers describing Biopython （頁面存檔備份，存於網際網路檔案館）, and a list of over one hundred publications using/citing Biopython （頁面存檔備份，存於網際網路檔案館）.
^ Mangalam, Harry. The Bio* toolkits—a brief overview. Briefings in Bioinformatics. September 2002, 3 (3): 296–302. PMID 12230038. doi:10.1093/bib/3.3.296 .
^ ^7.0 ^7.1 Chapman, Brad, The Biopython Project: Philosophy, functionality and facts (PDF), 11 March 2004 [11 September 2014], （原始內容存檔 (PDF)於2023-06-03）
^ List of Biopython contributors, [11 September 2014], （原始內容存檔於11 September 2014）
^ Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C. Py Cogent: A toolkit for making sense from sequence. Genome Biology. 2007, 8 (8): R171. PMC 2375001 . PMID 17708774. doi:10.1186/gb-2007-8-8-r171 .
^ Daley, Chris, Biopython 1.77 released, [6 October 2021], （原始內容存檔於2023-10-29）
^ Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek, Biopython Tutorial and Cookbook, 29 May 2014 [28 August 2014], （原始內容存檔於2015-01-01）
^ Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam. Surprising complexity of the ancestral apoptosis network. Genome Biology. 24 October 2007, 8 (10): R226. PMC 2246300 . PMID 17958905. doi:10.1186/gb-2007-8-10-r226 .
^ Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A. Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics. 21 August 2012, 13 (209): 209. PMC 3468381 . PMID 22909249. doi:10.1186/1471-2105-13-209 .
^ Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence. NCBI. [10 September 2014].
^ Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K. GenomeDiagram: a python package for the visualization of large-scale genomic data. Bioinformatics. March 2006, 22 (5): 616–617. PMID 16377612. doi:10.1093/bioinformatics/btk021 .
^ Hamelryck, Thomas; Manderick, Bernard. PDB file parser and structure class implemented in Python. Bioinformatics. 10 May 2003, 19 (17): 2308–2310. PMID 14630660. doi:10.1093/bioinformatics/btg299 .
^ Rousset, François. GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources. January 2008, 8 (1): 103–106. PMID 21585727. S2CID 25776992. doi:10.1111/j.1471-8286.2007.01931.x.
^ Excoffier, Laurent; Foll, Matthieu. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 1 March 2011, 27 (9): 1332–1334. PMID 21398675. doi:10.1093/bioinformatics/btr124 .

外部連結

官方網站
Biopython教程（頁面存檔備份，存於網際網路檔案館） (PDF （頁面存檔備份，存於網際網路檔案館）)
GitHub上的Biopython原始碼（頁面存檔備份，存於網際網路檔案館）

[Chapman2000-1] 1.0 ^1.1 Chapman, Brad; Chang, Jeff. Biopython: Python tools for computational biology. ACM SIGBIO Newsletter. August 2000, 20 (2): 15–19. S2CID 9417766. doi:10.1145/360262.360268 .

[wikidata-5d20dd2efcfef27b425de5f397ce6efa350d90e2-v3-2] Release biopython-181: Commit Release 1.81 (#4233). [2023年4月22日].

[wikidata-47ac6fb2b0c835333c2703f26b0f70bb93b7a247-v3-3] Release 183. 2024年1月10日 [2024年1月19日].

[Cock2009-4] Cock, Peter JA; Antao, Tiago; Chang, Jeffery T; Chapman, Brad A; Cox, Cymon J; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel JL. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 20 March 2009, 25 (11): 1422–3. PMC 2682512 . PMID 19304878. doi:10.1093/bioinformatics/btp163.

[lists-5] 5.0 ^5.1 Refer to the Biopython website for other papers describing Biopython （頁面存檔備份，存於網際網路檔案館）, and a list of over one hundred publications using/citing Biopython （頁面存檔備份，存於網際網路檔案館）.

[Mangalam2002-6] Mangalam, Harry. The Bio* toolkits—a brief overview. Briefings in Bioinformatics. September 2002, 3 (3): 296–302. PMID 12230038. doi:10.1093/bib/3.3.296 .

[Chapman2004-7] 7.0 ^7.1 Chapman, Brad, The Biopython Project: Philosophy, functionality and facts (PDF), 11 March 2004 [11 September 2014], （原始內容存檔 (PDF)於2023-06-03）

[Contributors-8] List of Biopython contributors, [11 September 2014], （原始內容存檔於11 September 2014）

[9] Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C. Py Cogent: A toolkit for making sense from sequence. Genome Biology. 2007, 8 (8): R171. PMC 2375001 . PMID 17708774. doi:10.1186/gb-2007-8-8-r171 .

[Python27EoL-10] Daley, Chris, Biopython 1.77 released, [6 October 2021], （原始內容存檔於2023-10-29）

[Tutorial-11] Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek, Biopython Tutorial and Cookbook, 29 May 2014 [28 August 2014], （原始內容存檔於2015-01-01）

[Zmasek2007-12] Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam. Surprising complexity of the ancestral apoptosis network. Genome Biology. 24 October 2007, 8 (10): R226. PMC 2246300 . PMID 17958905. doi:10.1186/gb-2007-8-10-r226 .

[Talevich2012-13] Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A. Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython. BMC Bioinformatics. 21 August 2012, 13 (209): 209. PMC 3468381 . PMID 22909249. doi:10.1186/1471-2105-13-209 .

[NC_023330.1-14] Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence. NCBI. [10 September 2014].

[Pritchard2006-15] Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K. GenomeDiagram: a python package for the visualization of large-scale genomic data. Bioinformatics. March 2006, 22 (5): 616–617. PMID 16377612. doi:10.1093/bioinformatics/btk021 .

[Hamelryck2003-16] Hamelryck, Thomas; Manderick, Bernard. PDB file parser and structure class implemented in Python. Bioinformatics. 10 May 2003, 19 (17): 2308–2310. PMID 14630660. doi:10.1093/bioinformatics/btg299 .

[Rousset2008-17] Rousset, François. GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources. January 2008, 8 (1): 103–106. PMID 21585727. S2CID 25776992. doi:10.1111/j.1471-8286.2007.01931.x.

[Excoffier2011-18] Excoffier, Laurent; Foll, Matthieu. fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios. Bioinformatics. 1 March 2011, 27 (9): 1332–1334. PMID 21398675. doi:10.1093/bioinformatics/btr124 .

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]