Wednesday, December 31, 2014

Tracing Human Migrations from Africa with Mitochondrial DNA and Y Chromosomal Variants - A Basic Introduction

Note on this post: I wrote this as a way of getting myself through a number of research articles I had collected, thinking that Biologos.org might want it for their site. They thought that it overlapped with things that they were already doing, but because of my original intention, it contains at the end a few thoughts of mine on one way that Christians might think about these things. If that section interests you, fine; if it doesn't ignore it. It isn't meant to be any kind of definitive statement - only a few of my thoughts on the subject at the moment.

kya - thousand years ago
AMH - anatomically modern human

Many people are aware that in the last few decades scientists have presented "mitochondrial Eve" and "Y chromosome Adam" as part of an account of the spread of AMHs from Africa to the rest of the world. What I want to do here is give a basic introduction to the reasoning that underlies the "out of Africa" story of humankind that geneticists have produced in collaboration with archaeologists and linguists.


The human maternal line is followed with mitochondrial DNA (mtDNA), which is present in all our cells as a small circle of DNA that exists and replicates in the little energy-producing vesicles called mitochondria (which are separate from the nucleus where the other chromosomes live.) MtDNA is passed from mothers only, to sons and daughters, but sons do not pass it on, so it traces the matrilineal line. The patrilineal line is followed by the Y chromosome, a small nuclear chromosome, present normally in only one copy in a man and passed from father to son to grandson and so on. One gene on the Y is the trigger for male development, and most of the few genes on the Y contribute to male fertility. The variants we will discuss on the Y chromosome almost always have no known functional effect. MtDNA variants do have functional effects more often, but we can ignore that here.


In what follows, I will describe the reasoning from the present perspective in which it is possible to sequence a large part of the Y chromosome and the whole mtDNA. Historically, in early studies only restricted parts of each were sequenced because of the limits of the methods that were available at the time. It's easier to explain if we assume complete sequences can be obtained, which has been true now for the some years. Complete sequences have generally confirmed earlier conclusions and removed some ambiguities in the results of earlier studies.


In order to understand the genetic evidence for the "out of Africa" hypothesis, the first thing we have to do is understand a few things about the Y chromosome and the mitochondrial DNA. The Y chromosome is linear (like other human chromosomes). Over 20 million base pairs have been sequenced and assembled (out of 60 million total,) but the parts that are easiest to analyze (uniquely mappable) add up to about 10 million base pairs. The Y is unusual in two respects. First it is only passed through the male line, father to son to grandson and so on. Second, for 95% of the length of the Y there is no recombination with the X or any other chromosome during the development of sperm, so when mutations happen, they occur on a stable background of previous mutations that gets passed on intact along with any new mutations. (The small regions at each end of the Y that recombine with the X chromosome are not used in the methods discussed here.) 


The mitochondrial DNA is similar, although it is circular, and only 16,569 base pairs long. It is passed from mother to both sons and daughters, but only daughters pass it on. Like the Y chromosome it does not undergo recombination. Because the Y chromosome and the mitochondrial DNA are passed on in sex-specific ways, they can be used to trace differences in the past populations of men and women, but here we will just be concerned with the fact that for both, a tree of descent can be determined.


In both cases a reference sequence was first determined and numbered. In the Y chromosome one end of the linear molecule is numbered as 1 (leaving off the telomere, or end sequence, which consists of a large, variable array of short repeats, and skipping  an internal section called the "centromere," for the same reason.) It has been possible to sequence over 20 million base pairs of the human Y out of a total of about 60 million. The rest has so many highly repetitive sequences that it has not been possible to "assemble" (figure out how the small sequenced fragments fit together.) The mtDNA is much smaller, at 16,569 base pairs, and its complete sequence has been determined for tens of thousands of people now. Its origin for numbering was chosen in relation to functional features that we can ignore. 


When these sequences from a number of people were compared, it was found that certain positions in the sequences have a different nucleotide in some people than the most common nucleotide observed at that position. The sequence which contains the most common nucleotide at each position is referred to as the "consensus" or "ancestral" sequence. The variant nucleotides are often referred to as single nucleotide variants or polymorphisms (SNPs.) Small insertions or deletions are sometimes observed and they can be used in the same way as SNPs, but for simplicity, I'll just refer to the variants as SNPs.


In what follows I will speak of just the Y chromosome for simplicity, but the reasoning is identical for the mitochondrial DNA. 


If you looked at the sequences of the Y from two people from distant parts of the world, you would see that each one of them has a number of SNPs (nucleotides that are different from the consensus sequence.) Some of these SNPs would be shared between the two sequences, and some would be unique to one sequence or the other. When the sequences were obtained from a number of individuals from around the world, it was possible to group them by putting together sequences that share a lot of variants with each other. After that the groups could be connected to each other according to what SNPs one group shared with another group. 


To see how this works, let's say we have sequenced the Y chromosomes from two men, A and B. We find that they share SNPs 1-7. One has unique SNPs 8-12 and the other has unique SNPs 13-18. (I've used sequential numbers here for shared and unique SNPs, but the SNPs can be in any order on the chromosome.) So we can draw the following diagram for the history of these two Y chromosomes. Please note, this is not a diagram of one Y chromosome - this traces the history of the occurrence of mutations (SNPs) in a lineage of Y chromosomes over time.


                                        |
                         |           1-7|
                  Time   |              |
                         |             / \
                         V       8-12 /   \ 13-18
                                     A     B

Mutations 1-7 happened before a branch point where there was one man who didn't get a mutation and another who did. They might have been brothers or cousins. The shared mutations 1-7 happened earlier and are so shared by both, and the unique mutations happened in each lineage after the branch. The branch point could be marked by any of the SNPs that are unique to one or the other. We don't at this point know which SNP happened at the branch point. So now we sequence the Y from a third man (C), and we find that it contains SNPs 1-7, 13, 14 and new SNPs 19 and 20 which are not present in either A or B. So we can now draw this tree:

                                        |
                        |           1-7 |
                  Time  |               |
                        |              / \
                        V         8-12/   \13-14
                                     A     \
                                           /\
                                     19,20/  \15-18
                                         C    B

By sequencing the Y chromosomes from more men, we will get to see different branch points, and we can partition the SNPs into the relative time segments in which they occurred, thus gradually determining which SNPs are older and which are younger and determining a set of branching lineages in the history of the chromosome.

The result of this process was the discovery that certain SNPs could be identified as the branch points between large subgroups. This became the basis for what geneticists call haplogroups. All the SNPs that an individual has are his haplotype. Large groups of people that share a large number of SNPs, including the SNP that defines the group, are haplogroups. (Haplo- just means one copy, as opposed to diplo- meaning two. There is normally only one Y chromosome, unlike the autosomes which we have in pairs. Each cell has hundreds of copies of the mitochondrial DNA, but they are almost always all the same.)


Based on this simple process of grouping haplotypes into haplogroups and partitioning SNPs into earlier and later, standard haplogroup trees have been developed for the Y chromosome and the mitochondrial DNA, and major branch points have been labeled with letters or letter combinations to designate the haplogroups. The SNPs themselves are usually named with a letter-number combination, and sometimes a branch point is labeled with the name of the SNP that is associated with it. (These diagrams are usually referred to as "trees," although "root system" might make more sense, since all the groups converge at the top or one side of the diagram to the ancestral sequence type, the sequence of mitochondrial Eve or Y chromosome Adam.)


The standard diagrams of Y chromosome haplogroups and mtDNA haplogroups have been determined by sequencing DNA from people all over the world and adding each to the tree. They are shown in the figures, with indicators of their distribution in the world in the second figure for each.


Y chromosome: http://www.pnas.org/content/106/48/20174/F1.expansion.html


Geographic distribution of Y haplogroups:
http://www.pnas.org/content/106/48/20174/F2.expansion.html


A figure which reflects more recent refinements to the Y tree, but less precision in the geographic distributions:
https://genographic.nationalgeographic.com/y-tree-update/

mtDNA:
http://rstb.royalsocietypublishing.org/content/367/1590/770.full
Figure 2 in this paper.

Geographic distribution of mtDNA haplogroups: http://www.worldfamilies.net/mtdnamap

From the figures with the distributions of the haplogroups, it can be seen that the oldest Y haplogroups (those with the longest branches) and the oldest mtDNA haplgroups are found only in sub-Saharan Africa. If you were to compare the SNPs of an African in one of these groups to a Native American or an aboriginal Australian, you would find that such a pair would share a lower percentage of their SNPs than two Australians, two native Americans, two Europeans or even a European and a native American. 


Two Africans could have a high percentage of their SNPs in common or a low percentage, depending on how long it has been since their last patrilineal or matrilineal common ancestor. It is the fact that any pairing of an African with these ancient haplogroups and any non-African will have a lower percentage of SNPs in common than can be seen in any pair from the rest of the world that indicates that these African haplogroups are the oldest populations in the world.


Thus it appears that certain Africans have the deepest rooting haplogroups for both Y chromosome and mitochondrial DNA, and the population of the rest of the world seems to be derived from them. It is possible to look at the order of emergence of haplogroups in the tree and compare it to where they are found today as large percentages of the population and get a broad brush view of the migration of people in the past.


It may be helpful to have an idea how many SNPs we are talking about in each DNA molecule and how often a new SNP occurs. The mitochondrial haplogroup tree is roughly 60 SNPs deep, that is, if you go from one person's sequence today back up the tree to the ancestral sequence, there are roughly 60 changes, although this number can vary quite a bit. A new mitochondrial SNP will occur in a particular lineage on average about every 2000-3000 years. For the Y chromosome, about 10 million base pairs can be readily mapped, and for a typical man roughly 1000 SNPs can be defined in those regions. A new Y SNP occurs on average about every 100-150 years. Both the number of SNPs and the time interval between new ones has a substantial variation. What I have given are current best estimates of the averages. 


When the SNPs in a given set of Y chromosomes have been determined, the rates at which new SNPs occur can be used to calculate rough estimates of the age of particular SNPs in the tree, and compared to dated archeological artifacts and ancient DNA sequence results to estimate when particular migrations took place. The rate of occurrence of new SNPs (effective mutation rate) has been determined by counting the number of variants that occur over a known time interval, either in a well documented genealogy or by comparing current sequences to dated ancient sequences. Recently it has been possible to directly measure this by sequencing parents and children in the same family. There is still some uncertainty about mutation rates and to what degree it varies over time and in different regions of the chromosome, which accounts for part of the uncertainty about exact dates.


The earliest targets of DNA-based migration studies were particular genes on the autosomes, but I won't be discussing those studies here. MtDNA came to prominence in the 1980s when it was first used to deduce a 'mitochondrial Eve' in Africa. The African origin of modern humans had been suggested by Darwin based on skeletal comparisons with apes, and the study of individual autosomal genes in various populations had likewise suggested an African origin.


Looking at the mtDNA tree, it can be seen that several L lineages (L1a, L1b, L1c L1k, & L2 in the older naming system in this figure; L0, L1, L2, L4 & L5 in the current system) are the deepest lineages in the tree and are only seen in Africa. L3 has some branches in Africa and its remaining descendant groups (M, N and R) contain the lineages for the rest of the world. While some L lineages are only seen in Africa today, or in individuals whose recent ancestors left Africa in the last few centuries, I should make clear that these L haplogroups are not the only lineages seen in Africa today. They are just the deepest ones in terms of the number of SNPs they contain.


Over the last 150,000 years there have been a lot of population movements and the slow diffusion of genes that comes in the small increments of a man or woman moving as a result of marriage (or mating outside of marriage.) As a result, every population, whether defined by nation, language, religion, or region is a mixture of haplogroups. Usually several haplogroups together will account for a substantial majority of a local population and there will be a number of haplogroups that are found at low frequencies. (When younger SNPs are used to define subhaplogroups, the geographic ranges seen are often smaller.) In the figures in the 2007 paper in the references (unfortunately not open access,) it can be seen that the much younger mtDNA haplogroups M1 and U6, descendants of L3, are also seen in Africa today. These are probably the result of later migrations of people from Eurasia back into Africa.


Haplogroups M and N are of nearly equal age, and R branched from N at close to the same age.  Deep branching derivatives of these three groups are found along the Indian Ocean shores from Arabia to southeast Asia, New Guinea and Australia. This is believed to be the result of an exit from Africa by crossing the Red Sea at its mouth about 60 kya, followed by migration eastward on the shore of the Indian Ocean, reaching Australia and New Guinea by about 50 kya. (Australia and New Guinea and neighboring areas were a single land mass called Sahul at the time, because of lower sea levels.) It is still a matter of dispute whether there were one or more migrations of AMHs out of Africa, but in either case the total number of people involved is estimated from the genetic diversity decrease from Africans to non-Africans to have been a few hundred to a few thousand. 


Some people also migrated back into North Africa much later, which is isolated from South Africa in all but the wettest climates by the Sahara desert. Derivatives of haplogroups M, N and R migrated north from the Indian Ocean coast to inhabit all of Asia and later, Europe, by routes that are still being studied. The initial migration into Europe is dated to about 45 kya by both by genetics and dating of archaeological sites. Some deep rooting branches of Hg N are present in the Middle East and Europe at low levels. Hg M and its branches (except M1 which re-entered Africa) are shifted to the east, where they are prevalent in east and southeast Asia and the Pacific islands. Derivatives of haplogroup M are very rare in west Eurasia. Haplogroup R and its branches range across Asia into Europe. 


Certain subhaplogroups of M and R are found in Melanesia (areas inhabited by ancient dark skinned peoples,) which includes New Guinea, Borneo and the islands northeast of Australia. The aboriginal inhabitants of these islands are believed to derive from the early migrants from Africa. This is based on both the genetic results and the archeology of these regions. Some of the subhaplogroups in Melanesia are also found in Australian aborigines, as well as several other subhaplogroups of M, N, and R. New Zealand and the more isolated Pacific islands were first inhabited by a much later migration beginning about 5000 years ago, possibly from Taiwan, called the Austronesian expansion, which has its own much younger haplogroup, B4a1a1, derived from R, and its own language group. Haplogroup determination and linguistic analysis indicates that the Austronesian expansion even extended westward to Madagascar off the east coast of Africa.


Around 15-20 kya groups that had weathered the last ice age in Beringia, an extension of Siberia that depended on the lower sea levels, migrated into North America and from there to South America. Current results indicate that there were at least three separate migrations corresponding to the three language groups of native Americans, Inuit, Na-Dene and Amerind. MtDNA haplogroups A2, B2, C1, C4, D1-4 and X1 (also derived from M, N and R) are the only mt DNA haplogroups found so far among native Americans. Of course, today native Americans are part of populations that also have the haplogroups of immigrants from all over the world. There is some evidence that a small number of Pacific islanders made it to South America (and back to Easter Island) by sea, but they made no significant contribution to the mtDNA haplogroups in South America. 


The mix of mitochondrial haplogroups in various populations around the world today can be seen at: http://www.scs.illinois.edu/~mcdonald/worldmtdnamap


The story of the Y chromosome differs in detail, but the broad picture is similar to mtDNA. The A and B haplogroups are the oldest in the Y tree and are African. These haplogroups are common among modern African hunter gathers (Khoisan, Pygmy, Hadza.) The next oldest haplogroup, DECF, composed of the pairs DE and CF, is believed to be the group (defined by a single SNP) that left Africa ~60,000 years ago. All the other Y-haplogroups in the world branch off from F and are younger. Haplogroup E is very prominent in Africa today, and it occurs in southern Europe as well. This may mean that E originated in the horn of Africa or it may result from a return from the Middle East to Africa. It seems that it was part of a population that migrated west across the Mediterranean region and north Africa as well as back into southern Africa. E1b1a is associated with the expansion of Bantu-speaking farmers from central Africa into the south about 3 kya.


Again, the younger haplogroups C-F and groups that derive from F populate Asia, Europe, Australia, the Pacific islands and the Americas. Haplogroups G-J are in west Eurasia and north Africa, and as far east as India. Groups M and S are characteristic of Australia and nearby islands. Haplogroup O is dominant in east Asia. The young groups R1a and R1b are associated with migrations into eastern and western Europe, respectively, in the Neolithic and Bronze Ages, where they encountered hunter gathers of older haplogroups such as haplogroup G-J, some of whom had migrated into western Asia and Europe over 40,000 years ago. The Siberian people that migrated to the Americas had C3 and Q4 Y haplogroups.


The mtDNA and Y chromosome haplogroup trees have been worked out in much finer detail at the tips than is shown in the figures linked above. The finer branches in most cases have not yet been mapped geographically, but where they have, often the finer branches separate to some extent within the ranges of the major haplogroup. The haplogroups common in Europe (and consequently the U.S. as well) have been defined in the most detail. Samples of the current state of the trees of haplogroups Y-R and mtDNA-R0 can be seen at these links:

Y tree for haplogroup R - http://www.isogg.org/tree/ISOGG_HapgrpR.html

mtDNA tree for haplogroup R0 - http://www.phylotree.org/tree/subtree_R0.htm

The migration hypotheses described above have largely been driven by determination of haplogroups in modern populations correlated with the results of archaeology and linguistics. In recent years it has been possible to test these ideas by sequencing ancient human DNA from dated archaeological material. Initially mtDNA was the most useful because its high copy number allowed enough DNA to survive in teeth and bones to be typed with the early techniques. In recent years a few specialized research groups have developed methods to analyze whole ancient genomes including mitochondrial and Y DNA. These studies have resulted in some adjustment of models, but generally have confirmed earlier results. 

In another recent development these studies of Y chromosomes and mtDNA have been supplemented with studies of whole modern human genomes from around the world, and even a few whole ancient genomes. The autosomes and the X chromosome contain many thousands of genes and billions of nucleotides, and because recombination occurs in these pairs of chromosomes during the formation of sperm and eggs, using them to study the history of populations is more complicated than what I have discussed here. However, there is far more information contained in whole genomes than in mtDNA and the Y chromosome, and the observations, to this point, support the same models of human migration that were suggested by the earlier Y chromosome and mtDNA studies. (See http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3334592/ )


In one major respect DNA results in general have changed the view of archaeologists. For many years, the fashion in archaeology was to invoke diffusion of ideas and technologies rather than movements of people. Genetics has shown that mass migrations of people have really occurred. The haplogroups prevalent in people of one era are often largely replaced by different haplogroups in a later era. Those movements of early farmers out of the Fertile Crescent, the invading Bronze Age warriors, Anglo-Saxons and Vikings that you heard about in school really happened, and the genetic traces of these events can be detected in both ancient and modern DNA.


There are many fascinating stories among the detailed studies of migrations. The genetics of the Jewish populations of the Diaspora have been studied in detail, revealing a common Middle Eastern genetic heritage with varying degrees of mixing with the surrounding populations of each Diaspora location. Recently, a particular Y chromosome sub-haplogroup was identified that occurs in a very high percentage of Ashkenazi (of eastern European descent) Levites (priestly caste), a lower proportion of other Jews and rarely in surrounding eastern European populations. Its origin appears to be in the eastern Mediterranean region, where it is seen in some Jewish and non-Jewish men.


I have discussed in this post what genealogists call deep ancestry - SNP testing on the Y and mtDNA generally deal with variants that occurred thousands of years ago. I haven't discussed another kind of variant, short tandem repeats (STRs) on the Y chromosome and autosomes, each of which can vary in the copy number of its particular repeat. Copy number changes in STRs happen more frequently than new SNPs, and thus are used to test possible common ancestors within the last few centuries (the "genealogical" time frame defined by the period in which surnames were adopted.) If you are interested in testing either SNPs or STRs in your own DNA, information on the different companies and tests is available at the wiki of the International Society of Genetic Genealogists (ISOGG.) http://www.isogg.org/
-------------
So, how should a Christian think about the Out of Africa hypothesis? What does the population geneticist have to say that is relevant to Christian beliefs? On the question of one couple named Adam and Eve, the answer I think is "not much." Neither archaeology nor genetics can say much about one ancient couple, unless they happen to be among the few whose bones are found. Dennis Venema at http://biologos.org has reviewed the reasons that population genetics seems to rule out one couple who lived a few thousand years ago as the unique progenitors of all mankind. The short version of the argument is that there is too much genetic diversity in the autosomes of the worldwide human population to have arisen from one couple who lived a few thousand years ago, or even 100 kya. 

In addition, AMHs were already spread around the world, including Australia, western Polynesia and the Americas long before the traditional Biblical dating. These things don't mean that there couldn't have been a couple singled out as representatives of humanity for a unique experience of (and test by) God. Robin Collins (at Biologos.org) has discussed different ways of thinking about the stories in Genesis, and I won't try to replicate that discussion here.


Science has very little to say about when or where such a thing might have happened. What the Christian is interested in is when and where did biological humans become spiritually and morally responsible beings. It's not possible to determine that from tools, bones, cave paintings or DNA. I have read Christians who suggested a literal Adam and Eve millions of years ago. This makes little sense to me. Highly developed language certainly seems to be a prerequisite for the transition to mature humanity, and it seems unlikely that language had progressed that far so long ago. Others have suggested the cultural flowering (cave art, primitive musical instruments and carved figurines) of ~40 kya years ago as a sign of the arrival of "Homo divinus." This may be more reasonable, but AMHs were already scattered over Africa, Eurasia, and Australia by that point. 


Some people think that the appearance of sophisticated tools, art and music must imply full humanity. However, children learn sophisticated language, invent games, develop an acute sense of what is fair, draw pictures, sing songs, even learn to play instruments years before we regard them as having become responsible before the law and God. So I'm not convinced that any of these things are necessarily signs of spiritually mature humanity.


It is a remarkable observation that, while there are some earlier artifacts that can be taken as related to religion, religion seems to have arisen as a widespread phenomenon about 10 kya, more or less simultaneously around the world. From a scientific standpoint, it is not clear why it should not have happened earlier, or less synchronously. (http://rstb.royalsocietypublishing.org/content/363/1499/2041) The Christian or Jew may understandably think that this reflects God's action stirring humans to an awareness of Himself, and that this is reflected in the early chapters of Genesis.


It is difficult to tell what people were thinking before the development of written language, which seems to have begun no more than 6000 years ago. With only ancient artifacts and contemporary Neolithic cultures to study, the conclusions that science can draw about the origin of religion are limited. The scientific uncertainties, and disagreements about Scriptural interpretation make it likely that there will not soon be any universally accepted view 
among Christians of the dawn of spiritual consciousness. Under these circumstances, I would suggest that we should agree to discuss these things and think about them and not regard any Christian as spiritually inferior because they disagree with us on these issues.
--------

References


All references are open access except the 2007 and 2014 articles.


2005 Single, Rapid Coastal Settlement of Asia Revealed by Analysis of Complete Mitochondrial Genomes. Science 308, 1034-1036.


2007 Use of Y chromosome and mitochondrial DNA population structure in tracing human migrations. Ann. Rev. Genet. 41, 539–64.

2008 Neuroscience, evolution and the sapient paradox: the factuality of value and of the sacred.

2009 Y chromosome diversity, human expansion, drift, and cultural evolution. Proc Natl Acad Sci U S A. 106, 20174-9.


2011 A world in a grain of sand: human history from genetic data. Genome Biol. 12, 234.


2012 Out-of-Africa, the peopling of continents and islands: tracing uniparental gene trees across the map. Phil. Trans. Royal Soc. B 367, 1590.

2012 The great human expansion. Proc. Natl. Acad. Sci. U.S.A. 109, 17758-17764. 


2013 Genetic and archaeological perspectives on the initial modern human colonization of southern Asia. Proc. Natl. Acad. Sci. U.S.A. 110, 10699–10704. 


2013 Phylogenetic applications of whole Y-chromosome sequences and the Near Eastern origin of Ashkenazi Levites.

2013 Bioenergetics in human evolution and disease: implications for the origins of biological complexity and the missing genetic variation of common diseases
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3685467/
Diagram of the migratory history of the human mtDNA haplogroups
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3685467/figure/RSTB20120267F1/

2014 The impact of whole genome sequencing on the reconstruction of human population history. Nature Reviews Genetics 15, 149-162.

Transposable Elements and Common Descent of Humans and other Primates

Note on this blog - I set this blog up twice several years ago. The second attempt was started with this post, but it fell into some kind of digital hole at google, and I couldn't find any way to get back to it to edit the first post or to add any new posts. I've now moved this post to the blog at the original address biomattersarising.com. I would like to change the address to artofthesoluble.com, but google has never responded to me about fixing this problem, so there seems to be no way to do it. You'll have to save a link, or remember to type in an address that doesn't match the name of the blog.
-------------
I am writing this because it seems to me that the type of evidence that I am going to describe here for evolution is both decisive and readily comprehensible by laymen. I am going to concentrate on humans, chimpanzees and other primates beause this seems to be what concerns people the most, and because several primate genomes have been fully sequenced (including the human genome, which has of course been studied more intensely than any other.) This means that there is a staggering amount of evidence. I will just note that the same kind of analysis could be done for any other group of closely related organisms where substantial amounts of genome sequence are available, because transposable elements are present in nearly all animals and plants and in a large portion of microbes.

The argument, simply put, is this. Primate genomes contain large numbers (millions) of sequences (ranging from 50 or so bp to 6000 bp) which got where they are by being copied from another location in the genome and inserted where they are now. (There are longer sequences that have been duplicated to new locations in the human genome, but they are not my focus here.) The processes by which these sequences get inserted are not target specific. When a sequence segment "jumps" (actually in most cases the sequence is copied and inserted) its final location in the genome is largely random. And here is the essential point. When you compare the genomes of different species (say human and chimp) huge numbers of these transposed sequences are found in exactly the corresponding position in the 2 genomes. Depending on when the transposon at a particular site was inserted, it may be there in multiple species. Very old transposons can be present at a particular site in all mammals. Generally the more closely related two species are, the more transposon insertion sites they will share.

Some transposable elements are still "jumping" in the human and other genomes. About 80 instances of genetic diseases have been found where a transposable element inserted into a gene in an individual and was not present in either parent, and over 7000 locations have been identified in the human genomes sequenced so far where a transposable element has inserted in some chromosomes but is absent in other copies of the chromosome and in all chimp chromosomes. These latter new insertions are interesting but not relevant to my argument, except that they may serve to convince people that transposons really do get where are by being copied and inserted. They aren't just repetitive sequences. One creationist web site did a whole series on Alu transposons and never mentioned that they get where they are by insertion.

Now logically, the presence of a transposable sequence at a particular location in different species could be the result of parallel transpositions of the same element in the two separate species. However, none of the transposon types that occur in primates are targeted to specific locations. There are transposons in bacteria that target specific sites that can be unique in a genome, but none of the transposons in primates work this way. When a transposon in a mammal jumps, it has about 3 billion base pairs to "choose from" when it lands. To get target specificity would require a recognition sequence of at least 16 bp if the enzymes involved had absolute target specificity for one particular sequence. (There are about 2^32 possible sequences 16 bp long = over 4 billion sequences. The transposons in mammals do have a statistical preference for much shorter (AT rich) sequences (about 5-6 base long), but there are millions of short sequences in a mammalian genome that fit these preferences, and the preference isn't absolute. A transposon will sometimes land in a suboptimal sequence. Any break in the DNA caused by various kinds of damage can be a target for insertion of a transposon.

The upshot of this is that even a single case of transposons inserting independently at the exactly corresponding site in two different species is a rare event. This has been reported, but it was possible to distinguish the 2 events even in this case because the two insertions were by different classes of transposon (the sequence that inserted at the site was completely different in the two species. In the human genome over 900 classes and subclasses of transposable element have been distinguished.) It is also possible to distinguish insertions at the same site of the same element, because the inserted elements are often truncated from their full length, and the different lengths distinguish separate events.

The result of this is that the hundreds of thousands of cases of the same transposon being inserted at exactly the corresponding position in the genomes of different species can only be be accounted for by the different species having common ancestors in which the transposon insertions took place.

An additional level of specificity is added by the fact that transposons often insert within previously inserted transposons. This is not surprising when you realize that the human genome (and other primate genomes) are at least 50% composed of transposable element sequences. When transposons have inserted into previous inserted transposons it is possible to analyze the sequences and determine the order in which the different transposons were inserted. There are over 600,000 clusters of multiple transposons like this in the human genome. When you compare the human and chimp genomes you find that the transposons were not only inserted at the same sites in the two genomes, they were inserted in the same order.

When you put all this together it is apparent that the odds against millions of transposon insertions (the human genome contains about 3 million total) occurring in parallel at the corresponding sites in different species and in the same order are astronomical. (Actually trans-astronomical. The calculation would produce a number larger than any number that is useful in astronomy.)

The result of this is that there are only two possibilities to account for the transposons in animal genomes, common ancestors or miracles, millions and millions of miracles. But the trouble with miracles is that when you have you invoked them, you have quit doing science, because miracles can account for anything. You can postulate that the whole world, including all our memories and all the physical evidence, was created 5 minutes ago. No one can prove that it didn't happen. It just isn't very interesting. Once you have started doing that, evidence is irrelevant. You might as well just go to the beach or watch TV or whatever you prefer. There's no reason to do all the work and spend all the money that it takes to do science. So, if you want to stick to science, the only way to account for all those transposons inserted at the same position in different species is that those species had common ancestors. At least that's the only scientific possibility I can think of.

That is the argument in brief. There is a huge mass of details that one can get into about the different kinds of transposons in mammalian genomes, their mechanisms of transposition, the different ways of estimating the age of individual insertion sites, the occurrence of sequences in the transposable elements that have some function for the cell, the many ways that transposons alter chromosome sequences by carrying neighboring sequences with them when they transpose, the mechanisms that cells employ to suppress the transcription of TEs most of the time, the way that new TE insertions get into the germ line so that they are inherited in the next generation, the occurrence of transposition in somatic cells and the effect that that sometimes has on induced cancer, etc. But what I have presented here is the basic argument that TEs provide for the fact that current species groups have common ancestors, i.e., that speciation and evolution have occurred.

References

2001 Endogenous retroviruses in the human genome sequence. Genome Biol. 2:reviews1017.

2001 Initial sequencing and analysis of the human genome. Nature 409, 860. See Section "Repeat content of the human genome."

2004 Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res. 14, 2245.

2006. Retroposed elements as archives for the evolutionary history of placental mammals. PLOS Biol. Apr 4, e91.


2007 Evolutionary History of Mammalian Transposons Determined by Genome-Wide Defragmentation. PLOS Comput. Biol.3(7): e137.

2008 Mammalian non-LTR retrotransposons: For better or worse, in sickness and in health. Genome Res. 18, 343.

2009 The impact of retrotransposons on human genome evolution. Nature Rev. Genetics 10, 691.

2013 Mobile element scanning (ME-Scan) identifies thousands of novel Alu insertions in diverse human populations. Genome Res. 23,1170.

To give some idea of what interspecies comparison of TE insertion sites looks like, I am including a figure that aligns a 50,000 bp region of the human and chimp genomes, with the TEs marked.

Figure 1. The upper panel is repetitive sequences determined by Repeat Masker software in a segment of human chromosome 3. The bottom panel is the corresponding segment in chimp. Generally the elements present in human are present in chimp, although they may not line up perfectly due to small insertions or deletions in the intervening sequences. Darker shades of grey mean that the element is very similar to the standard sequence that the software uses for that type of element, and thus that the element is younger and has had less time for its sequence to diverge. In a least one case of an old highly diverged L2 element, the software detected it in human but missed it in the chimp sequence. SINEs are short interspersed elements, the most common of which in humans is Alu elements. LINEs are long interspersed elements, the most common of which in humans are LINE-1s. LTR elements are endogenous retroviruses and related elements which lack the envelope gene and thus can't form virus particles. LTR stands for long terminal repeat, the diagnostic characteristic of these kind of elements. DNA transposons are old elements that moved by a cut-and-paste mechanism, but none of them have been active in the line that led to humans for a very long time. The bottom 2 or 3 tracks of each part of the figure show TEs that have been interrupted by the insertion of another TE.