GenBank
GenBank is the NIH's (United States National Institutes of Health) genetic sequence database, a publicly available collection of DNA sequences. Update every two months.
GenBank is part of the International Nucleotide Sequence Database Collaboration, which is made up of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (European Molecular Biology Laboratory (EMBL)), and GenBank at the National Center for Biotechnology Information (NCBI). These organizations exchange data on a daily basis. GenBank and its collaborators receive genetic sequences produced in laboratories around the world, from more than 500,000 formally described species. GenBank continues to grow at an exponential rate, doubling the amount of information contained every 18 months. According to documentation from the GenBank version 250.0, as of June 2022, the database contains more than 2.45 billion sequences, comprising more than 17 trillion nucleotide bases.
Direct communications with GenBank are done using BankIt, which is a web-based format, or the stand-alone program Sequin. Upon receipt of a sequence, GenBank staff assign an accession number to the sequence and perform quality checks. The presentations are then published in the public database, where the entries are retrievable by Entrez or can be downloaded via FTP. Most Expressed Sequence Tag (EST), Sequence Tagged Site (STB), Genome Survey Sequence (SSG), and High-Throughput Genome Sequence (HTGS) submissions are submitted by large sequencing centers. GenBank's direct submissions group also processes complete microbial genome sequences.
History
Walter Goad of the Los Alamos National Laboratory Theoretical Biology and Biophysics group and others founded the Los Alamos Sequence Database (LANL) in 1979, culminating in 1982 with the creation of GenBank by the National Institutes of Health (NIH), National Science Foundation, US Departments of Energy and Defense. LANL collaborated with GenBank thanks to the work of Bolt, Beranek and Newman. By the end of 1983 there were more than 2,000 sequences stored on it.
In the mid-1980s, the Stanford University bioinformatics company Intelligenetics managed the GenBank project in collaboration with LANL. The GenBank project launched the BIOSCI/Bionet newsgroup, which was one of the first projects of the bioinformatics community on the Internet, and whose purpose was the promotion of free communications among bioscientists. From 1989 to 1992, the GenBank project transitioned into the newly created National Center for Biotechnology Information (NCBI).
Growth
According to GenBank version 250.0 (June 2022), the database stores more than 239 million loci and 1.39 billion nucleotides, corresponding to 239 million sequences from traditional GenBank records. It also includes additional and automatically processed data sets from traditional sequences. These come from unfinished sequencing projects using the Whole Genome Shotgun (WGS), Transcription Shotgun Assembly (TSA) and Targeted Loci Study (TLS).
The 20 organisms with the highest number of base pairs in the database are:
Agency | Base pairs |
---|---|
Triticum aestivum | 2,15443744183 × 1011 |
SARS-CoV-2 | 1,65771825746 × 1011 |
Hordeum vulgare subsp. vulgare | 1,01344340096 × 1011 |
Musculus | 3,0614386913 × 1010 |
Homo sapiens | 2,7834633853 × 1010 |
Avena sativa | 2,1127939362 × 1010 |
Escherichia coli | 1,5517830491 × 1010 |
Klebsiella pneumoniae | 1,1144687122 × 1010 |
Danio rerio | 1,0890148966 × 1010 |
Bos taurus | 1,0650671156 × 1010 |
Triticum turgidum subsp. durum | 9,981529154 × 109 |
Zea mays | 7,412263902 × 109 |
Insular bird | 6,924307246 × 109 |
Secale cereale | 6,749247504 × 109 |
Rattus norvegicus | 6,548854408 × 109 |
Aegilops longissima | 5,920483689 × 109 |
Canis lupus familiaris | 5,776499164 × 109 |
Aegilops sharonensis | 5,272476906 × 109 |
Your scrofa | 5,179074907 × 109 |
Rhinatrema bivittatum | 5,178626132 × 109 |