Thursday, April 2, 2015

On Finding Where Your Paternal Ancestor Came From using Y DNA Results

This post pretty much assumes that you know about genetic genealogy - Y SNPs and STRs. Here's a brief explanation.

Y-SNPs are single nucleotide variants that are used to define the branching history of Y chromosomes as they are passed father to son. Whenever a new mutation occurs and is passed on, a new branch forms on this "root" system (which is usually referred to as a tree.) A new mutation occurs about every 4 generations on average. SNP groups have been defined now to where most Europeans can define their group down to a SNP that occurred a few thousand years ago, although a few have been worked out to a few hundred years ago.

STRs are short tandem repeats on the Y. Each STR locus has a repeat copy number at any point in time. When you get your STR results, you get a number at each locus for the number of copies present. It is possible to be tested at 12, 25, 37, 67 or 111 STR loci. The first three levels have been found to be pretty useless for anything other than predicting your SNP group. I have been tested to 67. As the Y chromosome is accumulating SNPs over time, it is also having periodic shifts in copy number at STRs. The copy number of an STR can move up or down, usually, but not always in steps of one. The rate at which these changes occur is quite different at different STRs, varying up to 30 fold among the STRs that are used in the 67 panel.If you have a close match (< 8 steps) at 67 STRs, you almost certainly have a common patrilineal ancestor with them in the last few centuries.

I have been working my way around to trying something since I had my DNA tested a year ago. My Garrison clan knows that we came from England around 1700 or a little later, but we have no idea where in England we came from and we don't know any relatives over there. My 67 STR test done at FT confirmed that I was part of this clan. Soon after I got the results I joined the U106 project (I also did SNP testing with the National Geographic Geno 2 test.) U106 is one the major SNP groups in western Europe, accounting for about 15% of men. I proved to be in one of its subgroups - Z331 Z326-.

I soon started running my STRs on a spreadsheet that calculated simple genetic distance against other peoples' data. I tested people in my subclade of U106, Z331 Z326-, and people from nearby subclades to get a feel for what kind of distances they would have. I also searched Y-search (FT's public database where people can put their STR results, no matter what company they tested with) and found a handful of 67 STR kits that were close to me but with GD of 8-13, just larger than the usual standards for same surname clans.

The idea that formed was that I might find people who have a common paternal ancestor with me in the centuries before surname adoption, and they might know where they are from. That way I could narrow down where we should look for relatives in England.

Of course, there are possible false positives. STR panels can "converge" occasionally by chance so someone whose STRs look close to yours, really is much more distantly related to you by a common ancestor thousands of years ago. If they are in a different SNP clade than you, then you know this has happened. Even if they really are close to you, someone could have migrated in an unusual way in the Old Country, so that their locale at the time of immigration tells you nothing about their ancestral locale or your own origin. Our own ancestor could have migrated from their long time home before immigration, so we are really looking for that long time home region before 1700.

In order to minimize the problems from false positives, we need to look for the closest non-family matches, since they are less likely to be due to convergence, and we need to find a bunch of kits close to us who know their origin and look at the geographic distribution to see if there is an obvious cluster. There will be noise from the false positives noted above, but if it's not too much we may be able to see a pattern.

I recently developed what should be a better way of calculating genetic distance that corrects for the estimated mutation rate at each STR - weighted genetic distances (wGD.) A difference at a slow STR counts proportionately more than a difference at a fast one. By this measure I am 4.5 from the modal haplotype of my Garrison clan. Modal haplotypes are composed of the most common value at each STR in a group descended from a common ancestor. They are a way of approximating the STR haplotype of the common ancestor.

I found that the handful of people in my SNP clade in the U106 project are not very close to me at the STR level. wGDs are 18-30. (Since the maximum wGD within Z331 is about 45 and our SNP is 3000 years old, each unit of wGD should be roughly 75 years.)  It is not too surprising that among only a handful of kits, none have a common ancestor with me later than about 2000 years ago.

I did find one kit in the U106 project that has not had SNP testing beyond U106 who is close to us (wGD = 9.8 - probably < 1000 years). He knows his origin is in Scotland.

I have also found others who are somewhat more distant who are not in the U106 project. I found them by searching on Y-search (it's based on simple genetic distances) and made sure they were still close when wGD was tested (some that look close by simple GD, don't when tested with wGD.) They are at wGD of 14. I was able to find these two in family projects and contact them. One is descended from a German who immigrated to America in 1700. I estimate the time of our common ancestor to be ~1500 years ago (the time of the Saxon migration to Britain.) Since our common ancestor was likely in Germany 1500 years ago, that means that my ancestor had to leave Germany for Britain at that time or later. The second kit at wGD of 14 belongs to someone who doesn't know their origin, but their surname sounds English.

I have also tested kits in the British Isles by County project who know their origin. There are about 2000 kits who know their origin. About 300 are probably in U106. However this didn't yield any very close kits. This is not surprising, since I only found one kit in the 2100 in the U106 project who is very close to us. That's 0.05%. Z331 Z326- is about 0.5% of U106 kits. It is a generous estimate to think that people with a common ancestor with me less than 1000 years ago would be 1/10 of Z331 Z326-. The real fraction is probably lower than that, and I got lucky to even find one kit in the U106 project who is close to us.

What all this means is that, for my approach to work, we must have access to 67 STR matches at greater GDs than FT will give us now. Trying to find them on Y-search and fishing in family and regional projects is just not practical. It might work for people in very well populated SNP clades, but it isn't going to work for the rest of us. We need to find enough close kits that know their origins, that we can see a geographic pattern emerge. If I could search 30,000 U106 kits, I might find 15 people sufficiently close to be informative, and maybe half of them might know their origins. That's about the minimum for this to work.

We can't rely on one or a few close matches - it's too easy for it to be misleading. I don't know how many U106 kits FT has tested at 67 STRs , but I'm guessing that it's enough that this approach might work if they (and the kit owners) would just let us see the matches. I think that if it was explained to them how other people could benefit from what they know, many would be willing to do it.