GenBank

format_list_bulleted Contenido keyboard_arrow_down

ImprimirCitar

GenBank is the NIH's (United States National Institutes of Health) genetic sequence database, a publicly available collection of DNA sequences. Update every two months.

GenBank is part of the International Nucleotide Sequence Database Collaboration, which is made up of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (European Molecular Biology Laboratory (EMBL)), and GenBank at the National Center for Biotechnology Information (NCBI). These organizations exchange data on a daily basis. GenBank and its collaborators receive genetic sequences produced in laboratories around the world, from more than 500,000 formally described species. GenBank continues to grow at an exponential rate, doubling the amount of information contained every 18 months. According to documentation from the GenBank version 250.0, as of June 2022, the database contains more than 2.45 billion sequences, comprising more than 17 trillion nucleotide bases.

Direct communications with GenBank are done using BankIt, which is a web-based format, or the stand-alone program Sequin. Upon receipt of a sequence, GenBank staff assign an accession number to the sequence and perform quality checks. The presentations are then published in the public database, where the entries are retrievable by Entrez or can be downloaded via FTP. Most Expressed Sequence Tag (EST), Sequence Tagged Site (STB), Genome Survey Sequence (SSG), and High-Throughput Genome Sequence (HTGS) submissions are submitted by large sequencing centers. GenBank's direct submissions group also processes complete microbial genome sequences.

History

Walter Goad of the Los Alamos National Laboratory Theoretical Biology and Biophysics group and others founded the Los Alamos Sequence Database (LANL) in 1979, culminating in 1982 with the creation of GenBank by the National Institutes of Health (NIH), National Science Foundation, US Departments of Energy and Defense. LANL collaborated with GenBank thanks to the work of Bolt, Beranek and Newman. By the end of 1983 there were more than 2,000 sequences stored on it.

In the mid-1980s, the Stanford University bioinformatics company Intelligenetics managed the GenBank project in collaboration with LANL. The GenBank project launched the BIOSCI/Bionet newsgroup, which was one of the first projects of the bioinformatics community on the Internet, and whose purpose was the promotion of free communications among bioscientists. From 1989 to 1992, the GenBank project transitioned into the newly created National Center for Biotechnology Information (NCBI).

Growth

According to GenBank version 250.0 (June 2022), the database stores more than 239 million loci and 1.39 billion nucleotides, corresponding to 239 million sequences from traditional GenBank records. It also includes additional and automatically processed data sets from traditional sequences. These come from unfinished sequencing projects using the Whole Genome Shotgun (WGS), Transcription Shotgun Assembly (TSA) and Targeted Loci Study (TLS).

The 20 organisms with the highest number of base pairs in the database are:

Agency	Base pairs
Triticum aestivum	2,15443744183 × 10¹¹
SARS-CoV-2	1,65771825746 × 10¹¹
Hordeum vulgare subsp. vulgare	1,01344340096 × 10¹¹
Musculus	3,0614386913 × 10¹⁰
Homo sapiens	2,7834633853 × 10¹⁰
Avena sativa	2,1127939362 × 10¹⁰
Escherichia coli	1,5517830491 × 10¹⁰
Klebsiella pneumoniae	1,1144687122 × 10¹⁰
Danio rerio	1,0890148966 × 10¹⁰
Bos taurus	1,0650671156 × 10¹⁰
Triticum turgidum subsp. durum	9,981529154 × 10⁹
Zea mays	7,412263902 × 10⁹
Insular bird	6,924307246 × 10⁹
Secale cereale	6,749247504 × 10⁹
Rattus norvegicus	6,548854408 × 10⁹
Aegilops longissima	5,920483689 × 10⁹
Canis lupus familiaris	5,776499164 × 10⁹
Aegilops sharonensis	5,272476906 × 10⁹
Your scrofa	5,179074907 × 10⁹
Rhinatrema bivittatum	5,178626132 × 10⁹

Más resultados...