Bioinformatics
Bioinformatics can be broadly defined as the application of computational technologies and statistics to the management and analysis of biological data. The terms bioinformatics, computational biology, biological informatics and, sometimes, biocomputing, are used in many situations as synonyms, and refer to interdisciplinary fields of study closely linked that require the use or development of different techniques studied at the university in Computer Engineering as an applied science of the computer science discipline. Among these, the following can be highlighted: applied mathematics, statistics, computer science, artificial intelligence, chemistry and biochemistry with which the Computer Engineer solves problems when analyzing data, or simulating systems or mechanisms, all of them of a biological nature, and usually (but not exclusively) at the molecular level. The main nucleus One of these techniques is found in the use of computational resources to solve or investigate problems on scales of such magnitude that they surpass human discernment. Research in computational biology often overlaps with systems biology.
Major research efforts in these fields include sequence alignment, gene prediction, genome assembly, protein structural alignment, protein structure prediction, gene expression prediction, protein-protein interactions, and modeling. of evolution.
A constant in bioinformatics and computational biology projects is the use of mathematical tools to extract useful information from data produced by high productivity biological techniques, such as genome sequencing. In particular, the assembly of high-quality genomic sequences from fragments obtained after large-scale DNA sequencing is an area of high interest. Other objectives include the study of gene regulation to interpret gene expression profiles using data of DNA chips or mass spectrometry.
Concepts and scope
As stated in the introduction, the terms bioinformatics, computational biology and biocomputing are often used synonymously, appearing frequently in the basic literature undifferentiated in their common uses. However, there are specific areas of application of each term. The NIH (National Institutes of Health), for example, while previously acknowledging that no definition could completely eliminate the overlap between activities of different techniques, explicitly defines the bioinformatics and computational biology terms:
- Bioinformatics is the research, development or application of computer tools and approximations for the expansion of the use of biological, medical, behavioral or health data, including those tools that serve to acquire, store, organize, analyze or visualize such data.
- Computer biology would be the development and application of theoretical and data analysis methods, mathematical modeling and computer simulation techniques to study biological, behavioral and social systems.
In this way, bioinformatics would have more to do with information, while computational biology would do with hypotheses. On the other hand, the term biocomputing is usually framed in current research with biocomputers and, for example, T. Kaminuma defines it as follows:
- Biocomputation is the construction and use of computers that contain biological components or function as living organisms.
Apart from the formal definitions of reference organizations or institutions, the manuals on this subject provide their own operational definitions, logically linked to a greater or lesser extent with those already seen. As an example, David W. Mount, in his widely disseminated text on bioinformatics, states that:
... bioinformatics focuses more on the development of practical tools for data management and analysis (e.g., genomic reporting and sequential analysis), but with less emphasis on efficiency and accuracy.
On the other hand, and according to the same author:
...the computational biology is usually related to the development of new and efficient algorithms, which can be demonstrated work on a difficult problem, such as multiple alignment of sequences or assembly (or assembly) of genome fragments.
Finally, there is sometimes an explicit categorization of these concepts according to which bioinformatics is a subcategory of computational biology. For example, biologist Cynthia Gibas notes that:
Bioinformatics is the science of the use of information to understand biology. (...) Strictly speaking, bioinformatics is a subset of the major field of computational biology (as the latter) the application of quantitative analytical techniques in the modeling of biological systems.
However, and referring to his own text (Developing Bioinformatics Computer Skills), he immediately went on to clarify that:
...we will go from bioinformatics to computer biology and vice versa. The distinctions between the two are not important to our purpose here.
On many occasions, therefore, the terms will be interchangeable and, except in contexts of certain specialization, the ultimate meaning will remain clear using any of them.
History
In what follows, and in addition to the relevant facts directly related to the development of bioinformatics, some scientific and technological milestones will be mentioned that will serve to put such development in an appropriate context.
We will start this brief history in the 50s of the last century XX, years in which Watson and Crick proposed the double helix structure of DNA (1953), the first protein (bovine insulin) is sequenced by F. Sanger (1955), or the first integrated circuit is built by Jack Kilby in the Texas Instruments laboratories (1958).
The first decades: the 1960s and 1970s
In the 1960s, L. Pauling elaborated his theory on molecular evolution (1962), and Margaret Dayhoff, one of the pioneers of bioinformatics, published the first of the Atlas of Protein Sequences (1965), which will continue in later years, will become a basic work in the statistical development, some years later, of the PAM substitution matrices, and will be a precursor of the current protein databases. In the area of computer technology, the protocols for switching data packets over computer networks (1968) are presented at the ARPA (Advanced Research Projects Agency), which will allow linking shortly after several computers from different universities in the US: ARPANET (1969) was born, the embryo of what would later be the Internet.
The Needleman-Wunsch algorithm for sequence alignment is published in 1970, the Brookhaven Protein Data Bank (1971) is established, the first recombinant DNA molecule is created (Paul Berg, 1972), E. M. Southern develops the Southern blot technique for localization of specific DNA sequences (1976), began DNA sequencing and the development of software to analyze it (F. Sanger, software by R. Staden, 1977), and the first publication was published in 1978. complete gene sequence of an organism, the phage Φ-X174 (5,386 base pairs that encode 9 proteins). to the development of Ethernet (a communications protocol that will facilitate the interconnection of computers, mainly in local networks) by Robert Metcalfe (1973), and to the development of the TCP protocol (Transmission Control Protocol, protocol of transmission control) by Vinton Cerf and Robert Kahn (1974), one of the basic protocols for the Internet.
80s
In the 1980s, important advances were made in various areas:
- Scientists: After the sequencing of the fago δ-X174 at the end of the 1970s, in 1982 F. Sanger gets genome sequencing from the fago λ (fago lambda) using a new technique, sequencing shotgun (secution by perdigonada), developed by himself; also between 1981 and 1982 K. Wüthrich publishes the method of using MRI (nuclear Magnetic Resonance) to determine protein structures; Ford Doolittle works with the concept of motive sequence (surviving similaritiesas described in the summary of its article) in 1981; the discovery in 1983 of the RCP (Polymerase Chain Reaction, chain reaction of the polymerase) leads to the multiplication of DNA samples, which will allow its analysis; in 1987, D. T. Burke et al. describe the use of artificial yeast chromosomes (YAC, Yeast Artificial Chromosome), and Kulesh et al. lay the foundations of DNA chips.
- Bioinformals: As regards the development of algorithms, methods and programs, appears the Smith-Waterman algorithm (1981), the search algorithm in sequence databases (Wilbur-Lipman, 1983), FASTP/FASTN (fast search for similarities between sequences, 1985), the FASTA algorithm for comparison of sequences (Pearson and Lipman, 1988), and the hidden models of Markov begin to be used to analyze patterns and composition of the sequences (Churchill, 1989), which will allow later to locate genes and predict protein structures; there are important interect databases (GenBank, 1986).First Santa Fe Conference1985), which will be announced a year later by the U.S. Department of Energy. which will launch pilot projects to develop critical resources and technologies; in 1987 NIH (National Institutes of Health, national health institutes of the United States) begins to contribute funds to genome projects, while in 1988 the start of the Human Genome Initiative, better known finally as Human Genome Project (Human Genome Project).
- Technology: 1983 will see the appearance of the standard Compact Disc (CD) in its version to be read by a computer (Yellow Book); Jon Postel and Paul Mockapetris develop in 1984 the DNS domain name system, necessary for a correct and agile address on the Internet; in 1987 Larry Wall develops the PERL programming language, which is widely used later in bioinformatics; and at the end of the decade you will see the first major private companies with activities related to the genome, proteins, biochemistry, etc. (Genetics Computer Group – GCG, Oxford Molecular Group, Ltd.), and will generally experience major transformations years later.
1990s
In the 1990s we attended the following events:
- Scientists: in 1991 sequencing begins with EST (Expressed Sequence Tags, marking of sequences expressed); the following year is published the map of genetic ligation (low resolution) of the entire human genome; in 1995 the first genomes of bacteria are completely sequestered (low resolution)Haemophilus influenzae, Mycoplasma genitalium1,8 million base pairs -Mbps- and 0,58 Mbps respectively); in 1996, and in different steps (by chromosome), it is done the same with the first eukaryotic genome, the yeast genome (Saccharomyces cerevisiaewith 12 Mbps, as well as in 1997 with the genome Escherichia coli (4.7 Mbps), in 1998 with the first genome of a multicellular organism (97 Mbp of the Caenorhabditis elegans), to end the decade with the first human chromosome (22) completely sequenced in 1999 (33.4 Mbps).
- Bioinformals: Quick search for similarities between sequences with BLAST (1990); database fingerprints of proteins PRINTS, of Attwood and Beck (1994); ClustalW, oriented to the multiple alignment of sequences, in 1994, and PSI-BLAST in 1997; at the end of the decade T-Coffee developed, which was published in 2000. As regards institutional activities and new agencies, we have the presentation by the DoE and NIH to the US Congress. in 1990, of a joint effort plan in the United States Human Genome Project for five years; they create Sanger Centre (Hinxton, UK, 1993; now Sanger Institute) and the European Bioinformatics Institute (EBI, Hinxton, UK, 1992-1995).
- Technological: Tim Berners-Lee invents World Wide Web (1990) through application of network protocols that exploit the characteristics of hypertext; in 1991 there are the definitive protocols of the Internet (CERN) and the first version of the Linux operating system, which was later used in scientific applications; in 1998 Craig Venter case Celera, company that will perfect the sequence by perdigonada by F. Sanger and analyze the results with own software.
Early 21st century
It should be noted that in the 2000s multiple genome sequencing projects of different organisms were culminating: in 2000 the genome of Arabidopsis thaliana (100 Mb) and that of Drosophila melanogaster (180 Mbp). After a working draft of the DNA sequence of the human genome in 2000, the human genome (3 Gbp) was published in 2001. Shortly after, in 2003, and two years ahead of schedule, the Human Genome Project is completed. To mention some of the genomes analyzed in the following years, we will note that in 2004 the draft of the Rattus norvegicus genome appeared (rat), in 2005 that of the chimpanzee, in 2006 that of the rhesus macaque, in 2007 that of the domestic cat, and in 2008 the genome of a woman was sequenced for the first time. Thanks to the development of appropriate techniques, we are currently witnessing a flood of genome sequencing of all kinds of organisms.
In 2003, the National Institute of Bioinformatics was founded in Spain, supported by the Fundación Genoma España (founded, in turn, a year earlier and which aims to become a state instrument to promote research in this field). In 2004, the US FDA (Food and Drug Administration, agency for the food and drug administration) authorized the use of a DNA chip for the first time. of genetic variations in the human being). In 2008 UniProt presented the first draft of the complete proteome of the human being, with more than twenty thousand entries.
Little by little, the first bioinformatics programs are being perfected, and we see more complete versions like ClustalW 2.0 (rewritten in C++ in 2007).
Main research areas
Sequence analysis
Since the phage Φ-X174 was sequenced in 1977 (provisional sequence: the final full sequence would be published a year later), the DNA sequences of hundreds of organisms have been decoded and stored in databases. These data are analyzed to determine the genes that code for certain proteins, as well as regulatory sequences. A comparison of genes within a species or between species can show similarities between protein functions, or relationships between species (use of molecular phylogenetics to build phylogenetic trees).
With the increasing amount of data, it has long become impractical to analyze DNA sequences manually. Today computer programs are used to study the genome of thousands of organisms, containing billions of nucleotides. These programs can compensate for mutations (base swapped, deleted, or inserted) in the DNA sequence, to identify sequences that are related, but not identical. A variant of this sequence alignment is used in the sequencing process.
Sequencing known as "shotgun" (or by shot: it was used, for example, by the Institute for Genomic Research -The Institute for Genomic Research, TIGR, today J. Craig Venter Institute- to sequence the first genome of bacteria, Haemophilus influenzae) does not give a sequential list of nucleotides, but instead gives us the sequences of thousands of small DNA fragments (each about 600 to 800 nucleotides long).). The ends of these fragments overlap and, when aligned correctly, constitute the complete genome of the organism in question.
shotgun sequencing provides sequence data quickly, but the task of assembling the fragments can be quite complicated for very large genomes. In the case of the Human Genome Project, it took several months of processor time (on a circa 2000 DEC Alpha station) to assemble the fragments. Shotgun sequencing is the method of choice for all genome sequences today, and genomic assembly algorithms are a critical area of bioinformatics research.
Another aspect of bioinformatics in sequence analysis is the automatic search for genes and regulatory sequences within a genome. Not all nucleotides within a genome are genes. Within the genome of more advanced organisms, large parts of the DNA serve no obvious purpose. This DNA, known as 'junk DNA', may, however, contain as yet unrecognized functional elements. Bioinformatics serves to bridge the gap between genome and proteome projects (for example, in the use of DNA sequences for protein identification).
Annotating genomes
In the context of genomics, annotation is the process of marking genes and other biological features of the DNA sequence. The first software genome annotation system was designed in 1995 by Owen White, who was a member of the team that sequenced and analyzed the first genome to be decoded from an independent organism, the bacterium Haemophilus influenzae. White built software to locate genes (places in the DNA sequence that code for a protein), transfer RNA, and other features, as well as to make initial function attributions to those genes. Most current systems Genomic annotation software works in a similar way, but the programs available for genome analysis are continually changing and improving.
Computational Evolutionary Biology
Evolutionary biology is the study of the ancestral origin of species as well as their change over time. Computer science has supported evolutionary biologists in several key fields. It has allowed researchers to:
- Follow the evolution of a high number of organisms by measuring changes in their DNA, rather than doing so exclusively through their physical taxonomy or physiological observations.
- More recently, compare complete genomes, allowing the study of more complex evolutionary events, such as gene duplication, horizontal gene transfer, or prediction of significant factors in bacterial spice.
- Build complex computer models of populations to predict the result of the system over time.
- Follow and share information on a wide and growing number of species and organisms.
Future efforts will focus on reconstructing the increasingly complex phylogenetic tree of life. The computer science research area called evolutionary computation is occasionally confused with computational evolutionary biology, but the two areas are not related. This field focuses on the development of genetic algorithms and other problem-solving strategies with a marked evolutionary and genetic inspiration.
Measuring biodiversity
The biodiversity of an ecosystem can be defined as the complete genomic set of all species present in a particular environment, be it a biofilm in an abandoned mine, a drop of seawater, a handful of soil, or the complete biosphere of planet Earth. Databases are used to collect species names, as well as their descriptions, distributions, genetic information, population status and sizes, habitat needs, and how each organism interacts with other species. Specialized software is used to find, visualize and analyze the information; and, most importantly, to share it with other stakeholders. Computer simulation can model such things as population dynamics, or estimate the improvement of a variety's gene pool (in agriculture), or the threatened population (in plant biology). conservation). A very exciting potential in this field is the possibility of preserving the complete DNA sequences, or genomes, of species threatened with extinction, allowing the results of Nature's genetic experimentation to be recorded in silico for possible future reuse, even if such species were ultimately lost.
Significant examples include the Species 2000 or uBio projects.
Analysis of gene expression
Gene expression of many genes can be determined by measuring mRNA levels using multiple techniques, including DNA microarrays, EST (Expressed Sequence Tag) sequencing, Serial Analysis of Gene Expression (Serial Analysis of Gene Expression - SAGE), MPSS (Massively Parallel Signature Sequencing), or various in situ hybridization applications. All of these techniques are extremely prone to noise and/or subject to biological measurement bias, and one of the main areas of research in computational biology concerns the development of statistical tools to separate signal from noise in gene expression studies with high throughput. These studies are often used to determine the genes involved in a disorder: one could, for example, compare microarray data from cancerous epithelial cells with data from non-cancerous cells to determine which transcripts are activated or repressed in a particular population of cancer cells.
Regulation analysis
Gene regulation is the complex orchestration of events beginning with an extracellular signal such as a hormone, leading to an increase or decrease in the activity of one or more proteins. Bioinformatics techniques have been applied to explore various steps in This process. For example, analysis of a gene's promoter involves identifying and studying the motif sequences in the DNA surrounding the coding region of a gene. These motifs influence the extent to which that region is transcribed into mRNA. Expression data can be used to infer gene regulation: microarray data from a wide variety of states of an organism could be compared to formulate hypotheses about the genes involved in each state. In a unicellular organism, stages of the cell cycle could be compared across various stress conditions (heat shock, starvation, etc.). Clustering algorithms (clustering algorithms, or cluster analysis) could then be applied to that expression information to determine which genes are expressed simultaneously. Promoters of these genes can be searched for according to the abundance of sequences or regulatory elements.
Analysis of protein expression
Protein microarrays and high-throughput mass spectrometry can provide a snapshot of the proteins present in a biological sample. Bioinformatics is highly committed to supporting both procedures. The protein microarray approach faces similar problems to those for mRNA microarrays, while for mass spectrometry the problem is matching large amounts of mass data against masses predicted by protein sequence databases, in addition from the complicated statistical analysis of samples where multiple, but incomplete, peptides of each protein are detected.
Analysis of mutations in cancer
In cancer, the genomes of affected cells are rearranged in complex and/or even unpredictable ways. Massive sequencing efforts are underway to identify as yet unknown single base substitutions (or nucleotide point mutations) in a variety of genes in cancer. Bioinformaticians continue to produce automated systems to manage the significant volume of sequence data obtained, and create new algorithms and software to compare sequencing results against the growing collection of human genome sequences and germline polymorphisms. New physical detection technologies are being used, such as oligonucleotide microarrays to identify chromosome gains and losses (a technique called comparative genomic hybridization), and single nucleotide polymorphism arrays to detect known mutation points These detection methods simultaneously measure many hundreds of thousands of positions along the genome, and when used with high throughput to analyze thousands of samples, generate terabytes of data per experiment. Thus, the massive amounts and new types of data provide new opportunities for bioinformaticians. Considerable variability, or noise, is often found in the data, so methods such as hidden Markov models and change-point analysis are being developed to infer actual changes in gene copy number (number of copies). copies of a particular gene in an individual's genotype, the magnitude of which may be high in cancer cells).
Another type of data that requires novel computer developments is the analysis of the lesions found recurrently in a large number of tumors, mainly by automated analysis of clinical images.
Protein structure prediction
The prediction of protein structure is another important application of bioinformatics. The amino acid sequence of a protein, also called the primary structure, can be easily determined from the nucleotide sequence on the gene that encodes it. In the vast majority of cases, this primary structure determines only one structure of the protein in its native environment. (There are, of course, exceptions, such as bovine spongiform encephalopathy, or 'mad cow disease'; see, also, prion.) Knowledge of this structure is vital to understanding the protein's function. In the absence of better terms, the structural information of proteins is usually classified as secondary, tertiary, and quaternary structure. A viable general solution for the prediction of such structures still remains an open problem. By now, most efforts have been directed toward heuristics that work most of the time.
One of the key ideas in bioinformatics is the notion of homology. In the genomic branch of bioinformatics, homology is used to predict the function of a gene: whether the sequence of gene A, whose function is known, is homologous to the sequence of gene B , whose function is unknown, it can be inferred that B might share the function of A. In the structural branch of bioinformatics, homology is used to determine which parts of a protein are important in forming the structure and in interaction with other proteins. In the technique called homology modelling, this information is used to predict the structure of a protein once the structure of a homologous protein is known. This is currently the only way to reliably predict protein structures.
An example of this is the similar protein homology between hemoglobin in humans and hemoglobin in legumes (leghemoglobin). Both serve the same purpose of transporting oxygen in the body. Although the two have a completely different amino acid sequence, their structures are virtually identical, reflecting their nearly identical purposes.
Other techniques for predicting protein structure include protein threading (protein threading) and de novo (from scratch) modeling, based on features physical and chemical.
In this regard, you can also see structural motif (structural motif) and structural domain (structural domain).
Comparative Genomics
The core of comparative genome analysis is the establishment of correspondence between genes (orthologous analysis) or between other genomic characteristics of different organisms. These intergenomic maps are what make it possible to trace the evolutionary processes responsible for the divergence between two genomes. A multitude of evolutionary events acting at different organizational levels make up the evolution of the genome. At the lowest level, point mutations affect individual nucleotides. At the highest level, large chromosome segments undergo duplication, horizontal transfer, inversion, transposition, deletion, and insertion. Finally, entire genomes are involved in processes of hybridization, polyploidy, and endosymbiosis, often leading to sudden speciation.
The complexity of genome evolution poses many exciting challenges to developers of mathematical models and algorithms, who must resort to a spectrum of algorithmic, statistical, and mathematical techniques ranging from exact, heuristic, fixed-parameter, and algorithmic. approximation for problems based on parsimony models, to "Markov Chain Monte Carlo" for Bayesian analysis of problems based on probabilistic models.
Many of these studies are based on homology detection and protein family computation.
Modeling of biological systems
Systems biology involves the use of computer simulations of cellular subsystems (such as networks of metabolites and enzymes that comprise metabolism, signal transduction pathways, and genetic regulatory networks), both to analyze and visualize the complex connections of these cellular processes. Artificial life or virtual evolution tries to understand evolutionary processes through computer simulation of simple (artificial) life forms.
High performance image analysis
Computer technologies are being used to speed up or fully automate the processing, quantification, and analysis of large amounts of information-rich biomedical images. Modern image analysis systems increase the observer's ability to perform analysis on a large or complex set of images, improving precision, objectivity (independence of the results according to the observer), or speed. A fully developed analysis system could completely replace the observer. Although these systems are not exclusive to the field of biomedical imaging, they are becoming increasingly important for both diagnostics and research. Some examples:
- Quantification and subcellular location with high productivity and precision (high-content screening, cytohistopathology).
- Morphometry.
- Analysis and display of clinical images.
- Determination of patterns in real-time airflow of lung breathing of living animals.
- Quantification of the size of occlusion through real-time images, both by development and by recovery, of arterial lesions.
- Conducting behavioral observations based on prolonged video recordings of laboratory animals.
- Infrared observations (infrared spectroscopy) for the determination of metabolic activity.
Protein-protein coupling
Over the past two decades, tens of thousands of three-dimensional structures of proteins have been determined by X-ray crystallography and protein nuclear magnetic resonance (protein NMR) spectroscopy. A central question for scientists is whether it is feasible to predict possible protein-protein interactions based only on these 3D shapes, without performing experiments identifying these interactions. A variety of methods have been developed to deal with the problem of protein-protein coupling, although it appears that much work remains in this field.
Ontologies and data integration
Biological ontologies are directed acyclic graphs of controlled vocabularies/indexing languages. They are designed to capture biological concepts and descriptions in a way that can be easily categorized and analyzed by computers. When categorized in this way, it is possible to derive added value from holistic and integrated analysis.
The OBO Foundry consortium was an effort to standardize certain ontologies. One of the most widespread is the gene ontology that describes the function of genes. There are also ontologies that describe phenotypes.
Software Tools
Software tools for bioinformatics range from simple command line tools to much more complex graphical programs and stand-alone web services located in bioinformatics companies or public institutions. The best known computational biology tool among biologists is probably BLAST, an algorithm for determining the similarity of arbitrary sequences to other sequences, probably residing in protein or DNA sequence databases. The NCBI (National Center for Biotechnology Information, USA), for example, provides a widely used web-based implementation that works on top of their databases.
For multiple sequence alignments, the classic ClustalW, currently in version 2, is the reference software. You can work with an implementation of it at the EBI (European Bioinformatics Institute).
BLAST and ClustalW are just two examples of the many sequence alignment programs available. On the other hand, there is a multitude of bioinformatics software with other objectives: structural alignment of proteins, prediction of genes and other motifs, prediction of protein structure, prediction of protein-protein coupling, or modeling of biological systems, among others. In Annex:Software for sequence alignment and Annex:Software for structural alignment, you can find lists of programs or web services suitable for each of these two objectives in particular.
Free software in bioinformatics
Many free software tools exist and continue to appear since the 1980s. The need for new algorithms for the analysis of new data of biological origin, in combination with the potential for innovative experiments in silico and the availability of free repositories for free software have helped create opportunities for research groups to contribute to bioinformatics and available free code, regardless of their funding sources. Open source tools often act as incubators for ideas, or as a plug-in in commercial applications. They can also provide de facto standards and models or frameworks that contribute to the challenge of integration in bioinformatics.
The List of free software in bioinformatics includes titles such as Bioconductor, BioPerl, Biopython, BioJava, BioJS, BioRuby, Bioclipse, EMBOSS,.NET Bio, Orange with its bioinformatics plugins, Apache Taverna, UGENE and GenoCAD. To maintain this tradition and create new opportunities, the non-profit organization Open Bioinformatics Foundation has sponsored the Bioinformatics Open Source Conference (BOSC) annually since 2000.
An alternative method of building public databases is to use the MediaWiki wiki software with the WikiOpener extension. This system allows access and updating of the database to all experts in the field.
Web services in bioinformatics
Interfaces based on SOAP and Representational State Transfer (REST) have been developed for a wide variety of bioinformatics applications, allowing an application, running on a computer anywhere of the world, you can use algorithms, data and computing resources hosted on servers in any other part of the planet. The main advantages lie in the fact that the end user does not have to worry about updating and modifying the software or the databases. Basic bioinformatics services, according to the implicit classification of the EBI, used to be classified as:
- Online information-gathering services (data databases, for example).
- Analysis tools (e.g., services that give access to EMBOSS).
- Searches for similarities between sequences (access services to FASTA or BLAST, for example).
- Multiple sequence alignments (access to ClustalW or T-Coffee).
- Structural analysis (access to protein structural alignment services, for example).
- Access services to specialized literature and ontologies.
Since 2009, basic bioinformatics services have been classified by the EBI into three categories:
- similarities between sequences (SSS)
- Multiple sequence alignments (MSA)
- bioinformal sequencing analysis (BSA)
The availability of these SOAP-based web services through systems such as registry services, (data distribution and discovery services through web services) demonstrates the applicability of web-based bioinformatics solutions. These tools range from a collection of standalone tools with a common data format, and under a single standalone or web-based interface, to integrative and extensible systems for bioinformatics workflow management.
Bioinformatics Workflow Management Systems
A Bioinformatics Workflow Management System is a specialized form of Workflow Management System specifically designed to compose and execute a series of computational or data manipulation steps, or a workflow, in a Bioinformatics application. Such systems are designed to:
- Providing an easy-to-use environment for scientists of individual applications themselves to create their own workflows.
- provide interactive tools for scientists to enable them to run their workflows and see their results in real time,
- simplify the process of sharing and reusing workflows among scientists
- allows scientists to trace the origin of the results of the execution of the workflow and the creation steps thereof.
Some of the platforms that offer this service: Galaxy, Kepler, Taverna, UGENE, Anduril, HIVE.
"BioCompute" and "BioCompute Objects(BCO)"
In 2014, the US Food and Drug Administration sponsored a conference held at the [National Institutes of Health] Bethesda Campus to talk about reproducibility in bioinformatics. Over the next three years (2014 - 2017), a consortium of stakeholders met regularly to discuss what would become the BioCompute paradigm. These stakeholders included representatives from government, industry, and academia. Session leaders represented numerous branches of the FDA and NIH Institutes and Centers, non-profit entities including the Human Variome Project and the European Federation for Medical Informatics, and research institutions including Stanford, the New York Genome Center, and George Washington University.
It was decided that the BioCompute paradigm would be in the form of "digital lab notebooks" that allow reproducibility, replication, revision, and reuse of bioinformatics protocols. This was proposed to allow more continuity within a research group in the course of the normal staff flow while encouraging the exchange of ideas between groups. The US FDA funded this work to make pipeline information more transparent and accessible to its regulatory staff.
In 2016, the group met again at the NIH in Bethesda and discussed the potential of a BioCompute Object, an instance of the BioCompute paradigm. This work was copied as a "standard test use" and a preprint manuscript uploaded to bioRxiv.
BioCompute objects allow records to be shared among employees, partners, and regulators.
Contenido relacionado
Hybrid
Frankeniaceae
Atomic physics