Human genome
The human genome is the DNA sequence contained in 23 pairs of chromosomes in the nucleus of each diploid human cell. Of the 23 pairs, 22 are autosomal chromosomes and one sex-determining pair (two X chromosomes in females, and one X and one Y in males). The haploid genome (i.e., a single representation per pair) has a total length of approximately 3.2 billion DNA base pairs (3,200 Mb) containing about 20,000–25,000 genes. Of the 3,200 Mb, 2,950 Mb correspond to euchromatin and about 250 Mb to heterochromatin. The Human Genome Project produced a reference sequence of the euchromatic human genome, used throughout the world in the biomedical sciences.
The DNA sequence that makes up the human genome contains the encoded information necessary for the highly coordinated and environment-adaptable expression of the human proteome, that is, of the set of human proteins. Proteins, not DNA, are the main effector biomolecules; they have structural, enzymatic, metabolic, regulatory and signaling functions, organizing themselves into enormous functional networks of interactions. In short, the proteome is the basis for the particular morphology and functionality of each cell. Likewise, the structural and functional organization of the different cells shapes each tissue and each organ, and, finally, the living organism as a whole. Thus, the human genome contains the basic information necessary for the physical development of a complete human being.
The human genome has a much lower gene density than initially predicted, with only 1.5% of its length made up of protein-coding exons. 70% is composed of extragenic DNA and 30% by gene-related sequences. Of the total extragenic DNA, approximately 70% corresponds to sparse repeats, so that more or less half of the human genome corresponds to repetitive DNA sequences. For its part, of the total DNA related to genes, it is estimated that 95% corresponds to non-coding DNA: pseudogenes, gene fragments, introns or UTR sequences, among others.
More than 280,000 regulatory elements, approximately a total of 7Mb of sequence, are detected in the human genome, which originated through insertions of mobile elements. These regulatory regions are conserved in non-exonic elements (CNEEs), they were named as: SINE, LINE, LTR. It is known that at least between 11% and 20% of these gene regulatory sequences, which are conserved between species, It was made up of moving parts.
The Human Genome Project, which began in 1990, had the purpose of deciphering the genetic code contained in the 23 pairs of chromosomes, in its entirety. This study was completed in 2005, with approximately 28,000 genes being sequenced. And, on June 2, 2016, scientists formally announced the Human Genome-Write Project (HGP-Write), a plan to synthesize the human genome.
The function of the vast majority of the bases in the human genome is unknown. The ENCODE Project (acronym for ENCyclopedia Of DNA Elements) he has mapped regions of transcription, association to transcription factors, chromatin structure, and histone modification. These data have made it possible to assign biochemical functions for 80% of the genome, mainly outside the protein-coding exons. The ENCODE project provides new insights into the organization and regulation of genes and the genome, and an important resource for the study of human biology and disease.
Species | Size of genome (Mb) | Number of genes |
---|---|---|
Candidatus Carsonella ruddii | 0.15 | 182 |
Streptococcus pneumoniae | 2.2 | 2300 |
Escherichia coli | 4.6 | 4400 |
Saccharomyces cerevisiae | 12 | 5800 |
Caenorhabditis elegans | 97 | 19000 |
Arabidopsis thaliana | 125 | 25500 |
Drosophila melanogaster (smooth) | 180 | 13700 |
Oryza sativa (roaring) | 466 | 45 000-55 000 |
Musculus (chuckles) | 2500 | 29 000 |
Homo sapiens (human being) | 2900 | 27 000 |
Components
Chromosomes
The human genome (like that of any eukaryotic organism) is made up of chromosomes, which are long continuous sequences of highly spatially organized DNA (with the help of histone and non-histone proteins) to adopt an ultra-condensed form in metaphase. They are observable with conventional optical or fluorescence microscopy using cytogenetic techniques and are arranged to form a karyotype.
The normal human karyotype contains a total of 23 distinct pairs of chromosomes: 22 pairs of autosomes plus 1 pair of sex chromosomes that determine the sex of the individual. Chromosomes 1-22 were numbered in decreasing order of size based on karyotype. However, it was later found that chromosome 22 is actually larger than 21.
The somatic cells of an organism have a total of 46 chromosomes (23 pairs) in their nucleus: a set of 22 autosomes from each parent and a pair of sex chromosomes, an X chromosome from the mother and an X or a And from the father. (See image 1). Gametes -eggs and spermatozoa- have a haploid endowment of 23 chromosomes.
Chromosoma | Genes | Number of base pairs | Sequenced base pairs |
---|---|---|---|
1 | 4220 | 247 199 719 | 224 999 719 |
2 | 1491 | 242 751 149 | 237 712 649 |
3 | 1550 | 199 446 827 | 194 704 827 |
4 | 446 | 191 263 063 | 187 297 063 |
5 | 609 | 180 837 866 | 177 702 766 |
6 | 2281 | 170 896 993 | 167 273 993 |
7 | 2135 | 158 821 424 | 154 952 424 |
8 | 1106 | 146 274 826 | 142 612 826 |
9 | 1920 | 140 442 298 | 120 312 298 |
10 | 1793 | 135 374 737 | 131 624 737 |
11 | 379 | 134 452 384 | 131 130 853 |
12 | 1430 | 132 289 534 | 130 303 534 |
13 | 924 | 114 127 980 | 95 559 980 |
14 | 1347 | 106 360 585 | 88 290 585 |
15 | 921 | 100 338 915 | 81 341 915 |
16 | 909 | 88 822 254 | 78 884 754 |
17 | 1672 | 78 654 742 | 77 800 220 |
18 | 519 | 76 117 153 | 74 656 155 |
19 | 1555 | 63 806 651 | 55 785 651 |
20 | 1008 | 62 435(+34) 965 | 59 505 254 |
21 | 578 | 46 944 323 | 34 171 998 |
22 | 1092 | 49 528 953 | 34 893 953 |
X (sexual chromosome) | 1846 | 154 913 754 | 151 058 754 |
And (sexual chromosome) | 454 | 57 741 652 | 25 121 652 |
Total | 32 185 | 3 079 843 747 | 2 857 698 560 |
Intragenic DNA
Genes
A gene is the basic unit of heredity, and carries the genetic information necessary for the synthesis of a protein (coding genes) or non-coding RNA (RNA genes). It is made up of a promoter sequence, which regulates its expression, and a sequence that is transcribed, made up of: UTR sequences (non-translated flanking regions), necessary for translation and mRNA stability, exons (coding) and introns., which are untranslated DNA sequences located between two exons that will be removed in mRNA processing (splicing).
The human genome is currently estimated to contain between 20,000 and 25,000 protein-coding genes, a much lower estimate than earlier predictions of 100,000 or more genes. This implies that the human genome has less than twice as many genes as much simpler eukaryotic organisms, such as the fruit fly or the nematode Caenorhabditis elegans. However, human cells make extensive use of alternative splicing to produce several different proteins from the same gene, as a consequence of which the human proteome is much larger than that of other organisms. simpler. In practice, the genome only carries the necessary information for a perfectly coordinated and regulated expression of the set of proteins that make up the proteome, which is responsible for executing most of the cellular functions.
Based on initial results from the ENCODE project (acronym for ENCyclopedia Of DNA Elements), some authors have proposed to redefine the current concept of gene. The most recent observations make it difficult to sustain the traditional view of a gene as a sequence made up of UTRs, exons and introns. Detailed studies have found a number of transcription initiation sequences per gene much higher than the initial estimates, and some of these sequences are located in regions far removed from the translated one, for which reason the UTRs 5' they can span long sequences making it difficult to delineate the gene. On the other hand, the same transcript can give rise to completely different mature RNAs (total absence of overlap), due to the extensive use of alternative splicing. In this way, the same primary transcript can give rise to proteins of very different sequence and functionality. Consequently, some authors have proposed a new definition of a gene: the union of genomic sequences that encode a coherent set of potentially overlapping functional products. In this way, RNA genes and partially overlapping sets of translated sequences are identified as genes (thus, UTR sequences and introns are excluded, which are now considered as "gene-associated regions", together with the promoters). According to this definition, the same primary transcript that gives rise to two non-overlapping secondary transcripts (and two proteins) should actually be considered as two different genes, regardless of whether they present total or partial overlap with their primary transcripts.
The new evidence provided by ENCODE, according to which the UTR regions are not easily delimited and extend over long distances, would make it necessary to reidentify the genes that actually make up the human genome. According to the traditional definition (currently in force), it would be necessary to identify as the same gene all those that show a partial overlap (including UTR regions and introns), so that in light of the new observations, the genes would include multiple proteins of highly diverse sequence and functionality. Collaterally, the number of genes that make up the human genome would be reduced. The proposed definition, on the other hand, is based on the functional product of the gene, thus maintaining a more coherent relationship between a gene and a biological function. As a consequence, with the adoption of this new definition, the number of genes in the human genome will increase significantly.
RNA genes
In addition to protein-coding genes, the human genome contains several thousand RNA genes, the transcription of which reproduces transfer RNA (tRNA), ribosomal RNA (rRNA), microRNA (miRNA), or other non-coding RNA genes. Ribosomal and transfer RNAs are essential in the constitution of ribosomes and in the translation of proteins. For their part, microRNAs are of great importance in the regulation of gene expression, estimating that up to 20-30% of the genes in the human genome may be regulated by the miRNA interference mechanism. So far, more than 300 miRNA genes have been identified and it is estimated that there may be as many as 500.
Gene Distribution
Here are some average values from the human genome. It should be noted, however, that the enormous heterogeneity presented by these variables makes the average values unrepresentative, although they have indicative value.
The average gene density is 1 gene per 100 kb, with an average size of 20-30 kb, and an average number of exons of 7-8 per gene, with an average size of 150 nucleotides. The average size of an mRNA is 1.8-2.2 kb, including UTR regions (flanking untranslated regions), with the average length of the coding region being 1.4 kb.
The human genome is characterized by great sequence heterogeneity. In particular, the richness in guanine (G) and cytosine (C) bases versus those of adenine (A) and thymine (T) is heterogeneously distributed, with very rich regions in G+C flanked by very poor regions, the average G+C content of 41%, lower than theoretically expected (50%). Such heterogeneity is correlated with gene richness, such that genes tend to concentrate in the richest G+C regions. This fact was already known for years thanks to the separation by density gradient centrifugation of regions rich in G+C (which received the name of H isochores, from the English High) and regions rich in A. +T (isochores L; from English Low).
Regulatory Sequences
The genome has various gene expression regulation systems, based on the regulation of the binding of transcription factors to promoter sequences, on epigenetic modification mechanisms (DNA methylation or histone methylation-acetylation) or on the control of accessibility to promoters determined by the degree of chromatin condensation; all of them very interrelated. In addition, there are other regulation systems at the level of mRNA processing, stability and translation, among others. Therefore, gene expression is tightly regulated, which allows the development of the multiple phenotypes that characterize the different cell types of a multicellular eukaryotic organism, while at the same time providing the cell with the necessary plasticity to adapt to a changing environment. However, all the information necessary for the regulation of gene expression, depending on the cellular environment, is encoded in the DNA sequence in the same way that genes are.
Regulatory sequences are typically short sequences present in the vicinity of or within (often introns) genes. Currently, the systematic knowledge of these sequences and how they act in complex gene regulation networks, sensitive to exogenous signals, is very scarce and is beginning to be developed through studies of comparative genomics, bioinformatics and systems biology. Identification of regulatory sequences is based in part on the search for evolutionarily conserved non-coding regions. For example, evolutionary divergence between mouse and human occurred 70 to 90 million years ago. Through comparative genomics studies, aligning In the sequences of both genomes, regions with a high degree of coincidence can be identified, many corresponding to genes and others to non-protein-coding sequences but of great functional importance, since they have been subjected to selective pressure.
Ultra-conserved elements
This name is given to regions that have shown an almost total evolutionary constancy, even greater than protein-coding sequences, through comparative genomics studies. These sequences generally overlap with introns of genes involved in the regulation of transcription or in embryonic development and with exons of genes related to RNA processing. Its function is generally little known, but probably extremely important given its level of evolutionary conservation, as stated in the previous point.
About 500 fully conserved (100% match) segments greater than 200 base pairs in size have now been found among the human, mouse, and rat genomes, and nearly fully conserved (99%) in dog and chicken (95%).
Genes acquired by horizontal transfer
By some estimates 145 genes of the human genome were acquired by horizontal gene transfer from other organisms. Possibly from bacteria, fungi, protists, etc.
Pseudogenes
Some 19,000 pseudogenes have also been found in the human genome, which are complete or partial versions of genes that have accumulated various mutations and are generally not transcribed. They are classified into unprocessed pseudogenes (~30%) and processed pseudogenes (~70%).
- Non-processed pseudogenes are copies of genes usually originated by duplication, which are not transcribed by lack of a promoter sequence and have accumulated multiple mutations, some of which without meaning (which originates premature stop condoms). They are characterized by both exons and introns.
- The processed pseudogenes, on the contrary, are copies of retrotranscribed messenger RNA and inserted into the genome. Consequently they lack introns and promoter sequence.
Intergenic DNA
Intergenic or extragenic regions comprise most of the human genome sequence, and their function is generally unknown. A good part of these regions is composed of repetitive elements, classifiable as tandem repeats or sparse repeats, although the rest of the sequence does not respond to a defined and classifiable pattern. Much of the intergenic DNA may be an evolutionary artifact with no determined function in the current genome, which is why these regions have traditionally been called "junk" DNA. (Junk DNA), a name that also includes intronic sequences and pseudogenes. However, this name is not the most accurate given the known regulatory role of many of these sequences. In addition, the remarkable degree of evolutionary conservation of some of these sequences seems to indicate that they have other essential functions that are still unknown or little known. Therefore, some prefer to call it "non-coding DNA" (although so-called "junk DNA" also includes coding transposons) or "repetitive DNA". Some of these regions actually constitute precursor genes for microRNA synthesis (regulators of gene expression and gene silencing).
Recent studies within the framework of the ENCODE project have obtained surprising results, which require a reformulation of our vision of the organization and dynamics of the human genome. According to these studies, 15% of the human genome sequence is transcribed into mature RNAs, and up to 90% is transcribed at least into immature transcripts in some tissue: Thus, a large part of the human genome encodes functional RNA genes. This is consistent with the trend in recent scientific literature to assign increasing importance to RNA in gene regulation. Also, detailed studies have identified a much larger number of transcription initiation sequences per gene, some far removed from the proximal translated region. As a consequence, it is now more difficult to define a region of the genome as genic or intergenic, since genes and gene-related sequences span regions commonly considered intergenic.
Tandem repeat DNA
They are repetitions that are ordered consecutively, so that identical sequences, or almost, are arranged one after the other.
Satellites
The set of satellite-like tandem repeats comprises a total of 250 Mb of the human genome. They are sequences between 5 and several hundred nucleotides that are repeated in tandem thousands of times, generating repeat regions with sizes ranging from 100 kb (100,000 nucleotides) to several megabases.
They get their name from the initial observations of density gradient centrifugations of fragmented genomic DNA, which reported a main band corresponding to most of the genome and three satellite bands of lower density. This is due to the fact that the satellite sequences have a richness in A+T nucleotides higher than the average for the genome and are consequently less dense.
There are mainly 6 types of satellite DNA repeats
- Satellite 1: basic sequence of 42 nucleotides. Located in chromosomes 3 and 4 centromers and the short arm of acrocentric chromosomes (in distal position with respect to the RNA coding cluster).
- Satellite 2: The basic sequence is ATTCCATTCG. Present in the vicinity of the centers of chromosomes 2 and 10, and in the secondary constraint of 1 and 16.
- Satellite 3: The basic sequence is ATTCC. Present in the secondary constriction of chromosomes 9 e Y, and in proximal position regarding the DNA cluster of the short arm of acrocentric chromosomes.
- Alpha satellite: basic sequence of 171 nucleotides. It forms part of the DNA of the chromosomal centromers.
- Satellite beta: basic sequence of 68 nucleotides. It appears around the centromer in acrocentric chromosomes and the secondary constriction of chromosome 1.
- Gamma satellite: basic sequence of 220 nucleotides. Next to the centromer of chromosomes 8 and X.
Mini-satellites
They are composed of a basic sequence unit of 6-25 nucleotides that is repeated in tandem, generating sequences of between 100 and 20,000 base pairs. It is estimated that the human genome contains about 30,000 minisatellites.
Several studies have related minisatellites to gene expression regulation processes, such as transcription level control, alternative splicing or imprinting >). Likewise, they have been associated with chromosomal fragility points since they are located close to preferential places for chromosomal breakage, genetic translocation and meiotic recombination. Finally, some human minisatellites (~10%) are hypermutable, exhibiting an average mutation rate between 0.5% and 20% in germline cells, thus being the most unstable regions of the human genome known to date.
In the human genome, approximately 90% of the minisatellites are located in the telomeres of the chromosomes. The basic six nucleotide sequence TTAGGG is repeated thousands of times in tandem, generating 5-20 kb regions that make up telomeres.
Some minisatellites, due to their great instability, show notable variability between different individuals. They are considered multiallelic polymorphisms, since they can occur in a highly variable number of repeats, and are called VNTRs (acronym for Variable number tandem repeat). They are widely used markers in forensic genetics, since they make it possible to establish a characteristic genetic fingerprint of each individual, and they are identifiable by Southern blot and hybridization.
Microsatellites
They are composed of basic sequences of 2-4 nucleotides, whose tandem repetition often leads to sequences of less than 150 nucleotides. Some important examples are CA dinucleotide and CAG trinucleotide.
Microsatellites are also multiallelic polymorphisms, called STRs (acronym for Short Tandem Repeats) and can be quickly and easily identified by PCR. The human genome is estimated to contain about 200,000 microsatellites, which are distributed more or less evenly, unlike minisatellites, making them more informative as markers.
Scattered repeated DNA
They are DNA sequences that are repeated scattered throughout the genome, constituting 45% of the human genome. The quantitatively most important elements are the LINEs and SINEs, which are distinguished by the size of the repeating unit.
These sequences have the potential to self-propagate by being transcribed into an intermediate mRNA, reverse transcribed, and inserted at another point in the genome. This phenomenon occurs with low frequency, estimating that 1 in 100-200 neonates carry a new insertion of an Alu or L1, which can be pathogenic due to insertional mutagenesis, due to deregulation of the expression of nearby genes (by the promoters themselves of SINE and LINE) or by illegitimate recombination between two identical copies of different chromosomal locations (intra- or interchromosomal recombination), especially between Alu elements.
Type repeat | Homosapiens | Drosophilamelanogaster | Caenorhabditiselegans | Arabidopsisthaliana |
---|---|---|---|---|
LINE, SINE | 33.4 % | 0.7 % | 0.4 % | 0.5 % |
LTR/HERV | 8.1 % | 1.5 % | 0 % | 4.8 % |
DNA transposions | 2.8 % | 0.7 % | 5.3 % | 5.1 % |
Total | 44.4 % | 3.1 % | 6.5 % | 10.4 % |
SINE
Acronym for English Short Interspersed Nuclear Eelements (Short scattered nuclear elements). They are short sequences, generally of a few hundred bases, that appear repeated thousands of times in the human genome. They make up 13% of the human genome, 10% due exclusively to the Alu family of elements (characteristic of primates).
Alu elements are sequences of 250-280 nucleotides present in 1,500,000 copies scattered throughout the genome. Structurally they are almost identical dimers, except that the second unit contains an insert of 32 nucleotides, being larger than the first. As for their sequence, they have a considerable richness in G+C (56%), which is why they predominate in the R bands, and both monomers present a polyA tail (adenine sequence), a vestige of their mRNA origin. They also have a promoter for RNA polymerase III to transcribe. They are considered non-autonomous retrotransposons, since they depend on the reverse transcription of their mRNA by a reverse transcriptase present in the medium to propagate.
LINE
Acronym for English Long Interspersed Nuclear Elements (long scattered nuclear elements). They constitute 20% of the human genome, it contains about 100,000-500,000 copies of L1 retrotransposons, which is the family of greatest quantitative importance, it is a 6-kb sequence repeated about 800,000 times scattered throughout the genome, although the great most copies are incomplete when presenting the 5' truncated by incomplete reverse transcription. Thus, it is estimated that there are about 5000 complete copies of L1, only 90 of which are active, the rest being inhibited by methylation of their promoter.
Their richness in G+C is 42%, close to the average for the genome (41%) and they are located preferentially in the G bands of the chromosomes. They also have an RNA polymerase II promoter.
Entire LINE elements are encoding. Specifically, LINE-1 encodes two proteins:
- RNA binding protein (’RNA-binding protein’): encoded by the open reading frame 1 (ORF1, English acronym ‘’Open reading Frame 1’)
- Enzyme with retrotranscriptase and endonuclease activity: encoded by ORF2. Both proteins are necessary for retrotransposition.
These mobile elements are flanked by 2 non-coding regions, called 5'UTR and 3'UTR.
Therefore, they are considered autonomous retrotransopsons, since they encode the proteins they need to propagate. The RNA polymerase II present in the medium transcribes the LINE, and this mRNA is translated in both reading frames producing a reverse transcriptase that acts on the mRNA generating a DNA copy of the LINE, potentially capable of inserting itself into the genome. Likewise, these proteins can be used by processed pseudogenes or SINE elements for their propagation.
Transcription starts at an internal promoter at the 5'UTR end. The L1 endonuclease generates a single-strand nick in genomic DNA, at a consensus sequence 5'TTTTT/A3'.
Several studies have shown that LINE sequences may be important in the regulation of gene expression, having verified that genes close to LINE present a lower expression level. This is especially relevant because approximately 80% of the genes in the human genome contain some L1 element in their introns.
It has been seen that the random insertion of active L1s into the human genome has given rise to genetic diseases, since it interferes with normal expression. A predilection of L1 for AT-rich regions is also observed.
HERV
Acronym for Human eendogenous retrovirus (retrovirus human endogenous). Retroviruses are viruses whose genome is made up of RNA, capable of retrotranscription and integration of its genome into that of the infected cell. Thus, HERVs are partial copies of the genome of retroviruses integrated into the human genome throughout the evolution of vertebrates, vestiges of ancient retroviral infections that affected germ line cells. Some estimates establish that there are about 98,000 HERV sequences, while others state that there are more than 400,000. In any case, it is accepted that around 5-8% of the human genome is made up of formerly viral genomes. The size of a complete retroviral genome is around 6-11 kb, but most HERVs are incomplete copies.
Throughout evolution, these sequences of no interest to the host genome have accumulated nonsense mutations and deletions that have inactivated them. Although most HERVs are millions of years old, at least one family of retroviruses integrated during the evolutionary divergence of humans and chimpanzees, the HERV-K(HML2) family, which accounts for about 1% of HERVs.
DNA Transposons
Retrotransposons are sometimes included under the name of transposons, such as processed pseudogenes, SINEs and LINEs. In this case, we speak of class I transposons to refer to retrotransposons, and class II to refer to DNA transposons, to which this section is dedicated.
Full-length DNA transposons have the potential to self-propagate without an mRNA intermediate followed by reverse transcription. A transposon contains the gene for a transposase enzyme, flanked by inverted repeats. Its transposition mechanism is based on cutting and pasting, moving its sequence to a different location in the genome. Different types of transposases act differently, with some capable of binding to any part of the genome while others bind to specific target sequences. The transposase encoded by the transposon itself extracts it by making two flanking cuts in the DNA strand, generating cohesive ends, and inserts it into the target sequence at another point in the genome. A DNA polymerase fills in the gaps generated by the sticky ends and a DNA ligase restores the phosphodiester bonds, restoring the continuity of the DNA sequence. This entails a duplication of the target sequence around the transposon, in its new location.
It is estimated that the human genome contains about 300,000 copies of scattered repeating elements originating from DNA transposons, constituting 3% of the genome. There are multiple families, of which the mariner elements, as well as the MER1 and MER2 families, are worth noting due to their pathogenic importance due to the generation of chromosomal rearrangements.
Variability
Although two human beings of the same sex share a very high percentage (around 99.9%) of their DNA sequence, which allows us to work with a unique reference sequence, small variations Genomics underlie a large part of the interindividual phenotypic variability. A variation in the genome, by substitution, deletion, or insertion, is called a polymorphism or genetic allele. It can be located in both coding and non-coding regions. Not all genetic polymorphisms cause an alteration in the sequence of a protein or its level of expression, that is, many are silent and lack phenotypic expression.
SNPs
The main source of variability in the genomes of two humans comes from variations in a single nucleotide, known as SNPs (Single nucleotide polimorphisms), on which most studies have focused. Given its importance, there is currently an international project (International HapMap Project) to catalog the SNPs of the human genome on a large scale. In this context, the naming of SNPs is often restricted to those single nucleotide polymorphisms in which the less frequent allele occurs in at least 1% of the population.
SNPs are tetralelic markers, since in theory there can be four different nucleotides in one position, each of which would identify an allele; however, in practice they usually present only two alleles in the population. It is estimated that the frequency of SNPs in the human genome is one SNP every 500-100 base pairs, of which a relevant part are coding polymorphisms, which cause the substitution of one amino acid for another in a protein.
Thanks to their abundance and the fact that they present an approximately uniform distribution in the genome, they have been very useful as markers for linkage maps, a fundamental tool of the Human Genome Project. They are also easily detectable on a large scale by using DNA chips (commonly known as microarrays).
Little by little, their study using new sequencing techniques (NGS) is gaining more prominence in the clinical setting because many of them have been shown to be associated with diseases and can serve as susceptibility markers.
The identification of new single nucleotide variants obtained by this method are called SNVs (Single Nucleotide Variants) and have no frequency limitations. Despite the fact that its wide distribution is known, there are regions with a higher degree of conservation, or what is the same, less tendency to variation, given the close association with a possible function and cellular essentiality. In this way, the areas that code for proteins are more conserved than intergenic areas, in the same way that exons are, and especially splicing donor and acceptor areas (with very low tolerance to change) with respect to the introns in intragenic regions, since changes in these positions could lead to the truncation of the protein in question. It is worth mentioning that within the exons there is a differential enrichment of the number of variants in the different positions that make up the codons and that tend to follow a pattern characterized by a loss of intolerance to variation of the third nucleotide at that position, as a consequence of the degeneracy of the genetic code. On the other hand, in the regions that code for RNAs that do not give rise to proteins, there is greater variability in the case of snoRNAs compared to lncRNAs. Regarding non-transcribed regulatory sequences, the variability is concentrated in binding sites. to transcription factors and promoter regions, the latter being the most variable elements of the genome.
Structural variation
This type of variation refers to duplications, inversions, insertions, or copy number variants of large segments of the genome (usually 1000 nucleotides or more). These variants involve a large proportion of the genome, so they are thought to be at least as important as SNPs.
Structural variation is the general term to encompass a group of genomic alterations involving DNA segments larger than 1 Kb. Structural variation can be quantitative (copy number variant, including: deletions, insertions, and duplications), positional (translocations) and orientational (inversions).
Although this field of study is relatively new (the first large-scale studies were published in 2004 and 2005), it has been booming, to the point that a new project has been created to study these types of variants in the same individuals on which the HapMap Project was based.
Although there are still doubts about the causes of this type of variants, there is increasing evidence that it is a recurring phenomenon that continues to shape and create new variants of the genome.
This type of variation has promoted the idea that the human genome is not a static entity, but rather is constantly changing and evolving.
Genetic diseases
The alteration of the DNA sequence that constitutes the human genome can cause the abnormal expression of one or more genes, originating a pathological phenotype. Genetic diseases can be caused by mutation of the DNA sequence, affecting the coding sequence (producing incorrect proteins) or regulatory sequences (altering the expression level of a gene), or by alterations chromosomal, numerical or structural. Alteration of an individual's germ cell genome is frequently transmitted to their offspring. Currently the number of known genetic diseases is approximately 4,000, the most common being cystic fibrosis.
The study of genetic diseases has often been subsumed under population genetics. The results of the Human Genome Project are of great importance for the identification of new genetic diseases and for the development of new and better genetic diagnostic systems, as well as for research into new treatments, including gene therapy.
Mutations
Gene mutations can be:
- Substitution (changes of one nucleotide by another): Substitutions are called transitions if they involve a change between bases of the same chemical type, or transversions if they are a purine change (A, G)→pirimidine (C, T) or pirimidina→purine.
- Deletions or insertions: are respectively the elimination or addition of a certain nucleotide sequence, of variable length. Large deletions can affect even several genes, to the point of being appreciable at chromosomal level with cytogenetic techniques. Insertions or deletions of a few base pairs in a encoding sequence can cause displacement of the reading frame (frameshift), so the nucleotide sequence of the RNAm is read incorrectly.
Gene mutations can affect:
- DNA coder: If the change in a nucleotide causes instead of a protein amino acid, the mutation is called no synonym. Otherwise they are called synonyms or silent (possible because the genetic code is degenerated). Non-synonymous mutations are also classified into mutations with change of meaning (Missense) if they cause the change of an amino acid by another, meaningless mutations (non-sense) if they change a codon encoding by a stop condom (TAA, TAG, TGA) or with sense gain if it happens to the reverse.
- Non-coding DNA: They can affect regulatory sequences, promoters or involved in fasting (splicing). The latter can cause an erroneous processing of RNAm, with various consequences in the expression of the protein encoded by that gene.
Monogenic disorders
They are genetic diseases caused by a mutation in a single gene, which present an easily predictable Mendelian-type inheritance. The table summarizes the main inheritance patterns that they can show, their characteristics and some examples.
Hereditary pattern | Description | Examples |
---|---|---|
Self-dominant | Diseases manifested in heterozygous individuals. It is enough with a mutation in one of the two copies (remember that each individual has a pair of each chromosome) of a gene to manifest the disease. Sick individuals usually have one of their two sick parents. The probability of having affected offspring is 50% since each parent provides one of the chromosomes of each pair. They often correspond to mutations with gain of function (so that the mutated alelo is not inactive but possesses a new function that causes the development of the disease) or for loss of function of the mutated alelo with gene-dose effect also known as haploinsufficiency. Often they are diseases with low penetration, that is, only a part of the individuals carrying the mutation develop the disease. | Huntington disease, neurofibromatosis 1, Marfan syndrome, hereditary colorectal cancer |
Autonomic recessive | The disease is manifested only in recessive homozygotic individuals, that is, those in which both copies of a gene are mutated. They are mutations that cause loss of function, so the cause of the disease is the absence of a gene's action. Mutation only in one of the two copies is compensated by the existence of the other (when a single copy is not sufficient is originated haploinsufficiency, with dominant autosomal inheritance). Usually a sick individual has both healthy parents but carriers of the mutation (heterozygotic genotype: Aa). In this case 25% of the offspring will be affected. | Cystic fibrosis, sickle cell anemia, Tay-Sachs disease, spinal muscle atrophy |
Dominant linked to X | The dominant diseases linked to chromosome X are caused by mutations in that chromosome, and present a special hereditary pattern. Only a few hereditary diseases present this pattern. Women have greater prevalence of the disease than men, since they receive a chromosome X from their mother and another from their father, any of which can carry the mutation. Males instead always receive chromosome Y from their father. Thus, a sick man (xY) will have all his healthy sons (XY) and all sick daughters (Xx), while a sick woman (Xx) will have 50% of her sick offspring, regardless of sex. Some of these diseases are lethal in males (xY), so there are only sick women (and males with Klinefelter syndrome, XxY). | Hypophosphatemia, Aicardial syndrome |
Recipient linked to X | Recessive diseases linked to X are also caused by mutations in chromosome X. Men are most frequently affected. A bearer male will always be sick (xY) because he only has an X chromosome, which is mutated. Their offspring will be healthy males (XY) and bearer daughters (Xx). A carrier woman will have a seed consisting of 50% bearer daughters and 50% sick males. | Hemophilia A, muscular dystrophy of Duchenne, daltonism, muscular dystrophy androgenic alopecia |
♪ I'm in love ♪ | They are diseases caused by mutation in chromosome Y. Consequently, it can only be manifested in males, whose offspring will be 100% healthy daughters and 100% sick males. Given the chromosome Y functions, these diseases often only cause infertility, which can often be overcome therapeutically. | Male hereditary infertility |
Mitocondrial | Diseases caused by mutation in mitochondrial genome genes. Given the particularities of the genome, its transmission is matrilineal (the mitochondrial genome is transferred from mothers to children). The severity of a mutation depends on the percentage of genomes affected in the population of mitochondria, a phenomenon called heteroplasmia (in contrast to heterozygosis), which varies by asymmetric mythotic segregation. | Leber's hereditary optical neuropathy (LHON) |
Polygenic and multifactorial disorders
Other genetic alterations can be much more complex in their association with a pathological phenotype. They are multifactorial or polygenic diseases, that is, those that are caused by the combination of multiple genotypic alleles and exogenous factors, such as the environment or lifestyle. Consequently, they do not present a clear hereditary pattern, and the diversity of etiological and risk factors makes risk estimation, diagnosis, and treatment difficult.
Some examples of multifactorial diseases with partially genetic etiology are:
- autism
- cardiovascular disease
- hypertension
- diabetes
- obesity
- cancer
Chromosomal abnormalities
Genetic alterations can also occur at the chromosomal level (chromosomopathies), causing severe disorders that affect multiple genes and that are often lethal, causing premature abortions. They are frequently caused by an error during cell division, which, however, does not prevent its conclusion. Chromosomal alterations reflect an abnormality in the number or structure of chromosomes, which is why they are classified as numerical and structural. They cause very diverse phenotypes, but frequently present some common features:
- Mental retardation and development delay.
- Facial alterations and anomalies in the head and neck.
- Congenital malformations, with preferential involvement of extremities, heart, etc.
Numerical
Aneuploid | Frequency (/1000) | Syndrome |
---|---|---|
Trisomnia 21 | 1.5 | Down |
Trisomnia 18 | 0.12 | of Edwards |
Trisomnia 13 | 0.07 | of Patau |
Monsomnia X | 0.4 | of Turner |
XXY | 1.5 | of Klinefelter |
XYYY | 1.5 | XYY |
It is an alteration of the normal number of chromosomes of an individual, which normally has 23 pairs of chromosomes (46 in total), with each chromosome endowment of a parent (diploidy). If the alteration affects a single pair of chromosomes, it is called aneuploidy, so there may be only one chromosome (monosomy) or more than two (trisomy, tetrasomy...). An example of high prevalence is trisomy 21, responsible for Down Syndrome. If, on the contrary, the alteration affects all the chromosomes, we speak of euploidies, so that in theory the individual has a single chromosome set (haploidy, 23 chromosomes in total) or more than two sets (triploidy: 69 chromosomes; tetraploidy: 92 chromosomes...). In practice, euploidies cause embryo lethality (abortions), with very few live births, and death very early. Aneuploidies are mostly lethal, except for trisomies of chromosomes 13, 18, 21, X and Y (XXY, XYY), and monosomy of X chromosome. The table shows the frequencies of live births with these alterations.
Structural
This is the name given to alterations in the structure of chromosomes, such as large deletions or insertions, rearrangements of genetic material between chromosomes... detectable by cytogenetic techniques.
- Deletions: elimination of a portion of the genome. Some known disorders are Wolf-Hirschhorn syndrome by partial deletion of the short arm of chromosome 4 (4p), and Jacobsen syndrome or 11q terminal deletion.
- Duplications: a considerable region of a chromosome doubles. An example is Charcot-Marie-Tooth type 1A disease, which can be caused by duplication of the peripheral myelin protein encoder gene 22 (PMP22) in chromosome 17.
- Translocations: when a portion of a chromosome is transferred to another chromosome. There are two main types of translocations: the reciprocal translocation, in which segments of two different chromosomes are exchanged, and the Robertsonian translocation, in which two acrocentric chromosomes (13, 14, 15, 21, 22) are merged by their centromers (centerary).
- Investments: a part of the genome is broken and redirected in opposite direction before re-associating, so this sequence appears inverted. They can be paracentric (if they affect only one arm) or pericentric (if the inverted sequence includes the centromer).
- Ring chromosomes: a portion of the genome breaks and forms a ring by circularization. This may occur with loss of material or without loss of material.
- Isochromosomes: symmetrical chromosomes, with their two arm identical by deletion of one of the arms and duplication of the other. The most common is is isochromosome X, in which the short arm of chromosome X is lost, originating phenotypes of Turner syndrome.
Chromosomal instability syndromes are a group of disorders characterized by a great instability of the chromosomes, which frequently suffer structural alterations. They are associated with an increase in the malignancy of neoplasms.
Evolution
Comparative genomics studies are based on large-scale comparison of genomic sequences, generally using bioinformatics tools. These studies allow us to delve into the knowledge of evolutionary aspects of a very diverse temporal and spatial scale, from the study of the evolution of the first living beings billions of years ago or phylogenetic radiation in mammals, to the study of the migrations of beings. humans in the last 100,000 years, which explain the current distribution of the different human races.
Comparative genomics between different species
Genomics studies compared to mammalian genomes suggest that approximately 5% of the human genome has been evolutionarily conserved over the past 200 million years; which includes the vast majority of genes and regulatory sequences. However, currently known genes and regulatory sequences account for only 2% of the genome, suggesting that most of the genomic sequence with high functional significance is unknown. A significant percentage of human genes have a high degree of evolutionary conservation. The similarity between the human genome and that of the chimpanzee (Pan troglodytes) is 98.77%. On average, a human protein differs from its chimpanzee ortholog by only two amino acids, and almost a third of the genes have the same sequence. An important difference between the two genomes is human chromosome 2, which is the product of a fusion between chimpanzee chromosomes 12 and 13.
Another finding from the comparison of the genomes of different primates is the remarkable loss of olfactory receptor genes that has occurred in parallel with the development of color (trichrome) vision during primate evolution.
Comparative genomics between human genomes
For decades the only evidence that allowed us to deepen our knowledge of the origin and expansion of Homo sapiens has been the few archaeological finds. However, currently, comparative genomics studies based on the genomes of current individuals from around the world are providing highly relevant information. Its basic foundation consists of identifying a polymorphism, a mutation, which is assumed to have originated in an individual from an ancestral population, and which has been inherited by all its descendants up to the present. Furthermore, since mutations appear to occur at a constant rate, the age of a given mutation can be estimated based on the size of the haplotype in which it lies, ie, the size of the conserved sequence that flanks the mutation. This methodology is complicated by the phenomenon of recombination between the pairs of chromosomes of an individual, coming from its two parents. However, there are two regions in which this problem does not exist because they have uniparental inheritance: the mitochondrial genome (matrilineal inheritance), and the Y chromosome (patrilineal inheritance).
In recent decades, comparative genomics studies based on the mitochondrial genome, and to a lesser extent on the Y chromosome, have reported very interesting conclusions. Various studies have traced the phylogeny of these sequences, estimating that all living humans share a common female ancestor who lived in Africa around 150,000 years ago. For its part, for reasons still poorly understood, the increased convergence of Y-chromosome DNA establishes that the most recent common male ancestor dates to about 60,000 years ago. These individuals have been named mitochondrial Eve and Y-chromosome Adam.
The greatest diversity of genetic markers and consequently the shortest haplotypes have been found in Africa. All the rest of the world population has only a small part of these markers, so the genomic composition of the rest of the current human population is only a subset of what can be seen in Africa. This suggests that a small group of human beings (perhaps around a thousand) migrated from the African continent to the coasts of western Asia, some 50,000 to 70,000 years ago, according to studies based on the mitochondrial genome. About 50,000 years ago they reached Australia and between 40,000 and 30,000 years ago other subpopulations colonized western Europe and central Asia. Likewise, it is estimated that 20,000 to 15,000 years ago they reached the American continent through the Bering Strait (the sea level was lower during the last glaciation, or the Würm or Wisconsin glaciation), populating South America about 15,000-12 years ago. 000 years. However, these data are only estimates, and the methodology has certain limitations. Currently, the trend is to combine comparative genomics studies based on mitochondrial DNA with analysis of the Y chromosome sequence.
Characterization of genetic diversity in Africa is a crucial step for most analyzes and for reconstructing evolutionary history. A study published by the journal Science on November 13, 2015 shows the first ancient genome found on the African continent. Until now, no study had succeeded in sequencing the ancient genome obtained from fossils on this continent. The reason was the instability of the DNA molecule itself, which was affected by temperature and humidity conditions. Therefore this new finding is a breakthrough.
The remains of "Mota" they were dated to around 4500 years ago and therefore predate both the Bantu expansion and, even more important, what is known as the West Eurasian ebb. It is a migratory event that occurred about 3,000 years ago, when populations from western Eurasian regions, such as the Near East and Anatolia, again flooded the Horn of Africa.
By comparing 250,000 base pairs of the Mota genome to 40 African populations and 81 contemporary European and Asian populations, it was found that Mota was most closely related to the Ari, an ethnic group living near the highlands from Ethiopia. It is seen that Mota is more similar to Ari populations. It is also quite similar to the Sandawe of South Tanzania. These similarities are very important, among other reasons, in deciphering the ancient demographic landscape of Africa.
Aside from Y and mitochondrial chromosomes, much data has been obtained from autosomal chromosomes. From a set of genomic studies of various human populations, the different genomic variations that help determine human migrations have been obtained. The most complete and complex would be the 1000 Genomes Project, although other projects such as the Simons Genomic Diversity Project, the International HapMap Project, etc. They have also provided a lot of information. All of them have provided information on different SNPs, STR, VNTR and others that help to complete the genetic trees of human populations, which are still incomplete.
Mitochondrial genome
It is the genome of the mitochondria of eukaryotic cells. Mitochondria is an essential subcellular organelle in the aerobic or oxidative metabolism of eukaryotic cells. Their origin is endosymbiont, that is, they were formerly independent prokaryotic organisms captured by an ancestral eukaryotic cell, with which they developed a symbiotic relationship. The characteristics of its genome, therefore, are very similar to those of a current prokaryotic organism, and its genetic code is slightly different from what is considered universal. To adapt to the intracellular niche and increase its replication rate, the mitochondrial genome has been substantially reduced throughout its coevolution, currently presenting a size of 16,569 base pairs. Thus, the vast majority of the proteins located in the mitochondria (~1500 in mammals) are encoded by the nuclear genome (to which all the previous sections refer), so that many of these genes were transferred from the mitochondria to the cell nucleus. during the coevolution of the eukaryotic cell. In most mammals, only the female transmits her mitochondria to the zygote, which is why they present, as has already been said, a matrilineal hereditary pattern. In general, an average human cell contains 100-10,000 copies of the mitochondrial genome per cell, at a rate of about 2-10 DNA molecules per mitochondria.
The mitochondrial genome contains 37 genes:
- 13 protein encoding genes: they encode 13 polypeptides that form part of the multienzymatic complexes of oxidative phosphorylation (OXPHOS system). There are 7 subunits of Complex I (NADH dehydrogenasa), a subunit of the complex III (cytochrome b), 3 subunits of the complex IV (cytochrome oxidase) and 2 subunits of Complex V (ATPsintasa).
- 2 RNA genes, which encode the two subunits of the ribosome RNA of the mitochondrial matrix.
- 22 RNA genes, which encode the 22 transferent RNAs necessary for protein synthesis in the mitochondrial matrix.
Contrary to what happened with the nuclear genome, where only 1.5% was coding, in the mitochondrial genome 97% corresponds to coding sequences. It is a single circular double-stranded DNA molecule. One of the hemistrands is called the heavy chain or H chain, and contains 28 of the 37 genes (2 rRNA, 14 tRNA and 12 polypeptides). The complementary strand (light or L chain) encodes the remaining 9 genes. In both strands, tRNA genes appear distributed between two rRNA or protein-coding genes, which is of great importance for mitochondrial RNA processing.
Contenido relacionado
Aerospace medicine
Lipid
Annonaceae