GenBank

format_list_bulleted Contenido keyboard_arrow_down
ImprimirCitar

GenBank is the NIH's (United States National Institutes of Health) genetic sequence database, a publicly available collection of DNA sequences. Update every two months.

GenBank is part of the International Nucleotide Sequence Database Collaboration, which is made up of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (European Molecular Biology Laboratory (EMBL)), and GenBank at the National Center for Biotechnology Information (NCBI). These organizations exchange data on a daily basis. GenBank and its collaborators receive genetic sequences produced in laboratories around the world, from more than 500,000 formally described species. GenBank continues to grow at an exponential rate, doubling the amount of information contained every 18 months. According to documentation from the GenBank version 250.0, as of June 2022, the database contains more than 2.45 billion sequences, comprising more than 17 trillion nucleotide bases.

Direct communications with GenBank are done using BankIt, which is a web-based format, or the stand-alone program Sequin. Upon receipt of a sequence, GenBank staff assign an accession number to the sequence and perform quality checks. The presentations are then published in the public database, where the entries are retrievable by Entrez or can be downloaded via FTP. Most Expressed Sequence Tag (EST), Sequence Tagged Site (STB), Genome Survey Sequence (SSG), and High-Throughput Genome Sequence (HTGS) submissions are submitted by large sequencing centers. GenBank's direct submissions group also processes complete microbial genome sequences.

History

Walter Goad of the Los Alamos National Laboratory Theoretical Biology and Biophysics group and others founded the Los Alamos Sequence Database (LANL) in 1979, culminating in 1982 with the creation of GenBank by the National Institutes of Health (NIH), National Science Foundation, US Departments of Energy and Defense. LANL collaborated with GenBank thanks to the work of Bolt, Beranek and Newman. By the end of 1983 there were more than 2,000 sequences stored on it.

In the mid-1980s, the Stanford University bioinformatics company Intelligenetics managed the GenBank project in collaboration with LANL. The GenBank project launched the BIOSCI/Bionet newsgroup, which was one of the first projects of the bioinformatics community on the Internet, and whose purpose was the promotion of free communications among bioscientists. From 1989 to 1992, the GenBank project transitioned into the newly created National Center for Biotechnology Information (NCBI).

Growth

According to GenBank version 250.0 (June 2022), the database stores more than 239 million loci and 1.39 billion nucleotides, corresponding to 239 million sequences from traditional GenBank records. It also includes additional and automatically processed data sets from traditional sequences. These come from unfinished sequencing projects using the Whole Genome Shotgun (WGS), Transcription Shotgun Assembly (TSA) and Targeted Loci Study (TLS).

The 20 organisms with the highest number of base pairs in the database are:

Agency Base pairs
Triticum aestivum2,15443744183 × 1011
SARS-CoV-2 1,65771825746 × 1011
Hordeum vulgare subsp. vulgare1,01344340096 × 1011
Musculus3,0614386913 × 1010
Homo sapiens2,7834633853 × 1010
Avena sativa2,1127939362 × 1010
Escherichia coli1,5517830491 × 1010
Klebsiella pneumoniae1,1144687122 × 1010
Danio rerio1,0890148966 × 1010
Bos taurus1,0650671156 × 1010
Triticum turgidum subsp. durum9,981529154 × 109
Zea mays7,412263902 × 109
Insular bird6,924307246 × 109
Secale cereale6,749247504 × 109
Rattus norvegicus6,548854408 × 109
Aegilops longissima5,920483689 × 109
Canis lupus familiaris5,776499164 × 109
Aegilops sharonensis5,272476906 × 109
Your scrofa5,179074907 × 109
Rhinatrema bivittatum5,178626132 × 109
Más resultados...
Tamaño del texto:
undoredo
format_boldformat_italicformat_underlinedstrikethrough_ssuperscriptsubscriptlink
save