Introduction

Analysis of human whole genome sequences have shown that genetic variations tend to cluster on a single chromosome and form distinct haplotypes. The strong correlation of SNPs in an haplotype has been used by the 1000 genomes project to impute the association of SNPs for each of the 5,008 samples of haploid autosomal chromosomes in its phase 3 data. This notebook provides an overview of some of the major hierarchical structures identified with a set of notebooks that explore the associations of SNPs in the lactase persistence region of chromosome 2.

The unit used for this analysis is called a series. Series are labeled by two numbers. The first is the number of SNPs in the series. The second is the number of 1000 genomes samples that express the series. Series of at least four SNPs were identified from the 5008 samples in 1000 genomes data with a simple algorithm. The algorithm looked for sequences of SNPs where 90% of the samples expressing any of the SNPs in a series expressed 90% of the SNPs in it. Over 940,000 of these series were identified in the 1000 genomes data for the human autosomal chromosomes.

A specific region of chromosome 2 was chosen for an initial detailed analysis of the patterns formed by these SNP series. One reason for the interest in the region is its association with the phenotype of lactase persistence. An 870,000 base sequence was identified that contained an 11 SNP series including rs4988235, the SNP generally associated with lactase persistence. That series is associated with seven other series that are commonly expressed by samples that express rs4988235.

The pattern of genetic variations associated with the lactase persistence phenotype is only one example of the complex structures of series and their associations observed in this region of chromosome 2. This notebook attempts to introduce the major observed patterns through a discussion of some more examples.

In [1]:
import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.series_plots as dm
import genomes_dnj.lct_interval.chrom_plots as cp
output_notebook(hide_banner=True)

Lactase Persistence Region

Most samples appear to express at least one SNP series at most locations along a chromosome's length. But, there is a large variation in the number of series expressed at any particular chromosome location, the number of SNPs in a series, and in the length in DNA bases over which a series extends. The plot below illustrates the variation in the number of SNPs in series over the length of chromosome 2. The label LCT identifies the location of the lactase persistence series.
In [2]:
plt = cp.chrom2_stats('active_series_snp_count')
show(plt)
Out[2]:

<Bokeh Notebook handle for In[2]>

The plot below is the same as the one above except that it covers only a small region of chromosome 2. The lactase persistence region used for the analysis summarized in this notebook extends on both sides of the LCT label to the locations where the active series SNP count goes to zero. The count of active series goes to zero at locations 135,757,320 and 136,786,630 on both sides of the SNP series associated with lactase persistence. The region of chromosome 2 between these two locations was used for the analysis introduced in this notebook.

Note the difference between the lactase persistence region and others identifed in this plot both in length of the region and in the numbers of SNPs in active series. Also note that the exceptional number of series SNPs is most pronounced in the lower part of the region.

In [3]:
plt = cp.chrom2_intvl_stats('active_series_snp_count')
show(plt)
Out[3]:

<Bokeh Notebook handle for In[3]>

Series Hierarchies

Two patterns are seen repeatedly in the structure of SNP series and their associations.

One is an hierarchy of series. This pattern appears to result from a process where an overexpressed series of SNPs emerges in an series hierarchy as a new child series and increases the overexpression of that hierarchy.

The other is a pattern of genetic recombination. Associations between series show a pattern of remarkable stability. But, recombination events do occurr and result in a pattern of stable series identity across the varied associations that result from recombination events.

Within the lactase persistence region, it appears possible to reconstruct the genetic history of the human expansion out of Africa by using the 1000 genomes data to reconstruct a record of these two kinds of processes. The hierarchies that have resulted from this history can be viewed as three trees. Those trees have been labeled the EUR tree, the SAS tree, and the EAS tree. All of the non African populations are intermixed to a significant degree in all three trees. But, each tree is biased towards its labeled population. There is also an overexpressed association of series in East Asian populations that appears to have formed from recombination events between series from the lower region of the EUR tree and series in the upper part of the studied region that are not part of any of the three trees.

EUR Tree

The plot below shows the association of SNP series that form the root of the EUR tree. A division of the root sequences into two regions is visible. The lower region extends to the start of series 26_1414. That 640,000 base lower region includes only 10 SNPs. In contrast, the 390,000 base upper region includes 118 SNPs.

Both the SAS and EAS tree roots show a similar division into lower and upper regions. The lower regions of both the EUR and EAS roots include only a small number of SNPs. But the lower region of the SAS tree root is very different. It is composed of four overlapping series that include a total of 495 SNPs. One series of 193 SNPs is specific to the SAS tree root. The other three series are also expressed in varying combinations by a large number of samples mostly from African populations.

The upper region of the EUR tree includes the series 26_1414, 10_2206, 64_1575, and 7_1868. All 107 SNPs appear to have some history as part of a single common series Recombination events generated fragments of that series. Then two fragments evolved to 10_2206 and 7_1868 through mutations that reverted some SNPs to the reference nucleotides. The fragments recombined to form associations with other series that became overexpressed through some kind of selection process. The result was a significant population of the fragment independent of the original series. The SAS and EAS root upper regions each include a distinct collection of SNPs that appear to fit a pattern similar to the the EUR tree root.

The other distinction between the lower and upper regions of all three trees is a difference in the stability of series. Some record of recombination events can be observed in all of the studied parts of chromosome 2. But, the series in the lower region of the trees that covers the genes rab3gap1, zranb3 and r2hdm1 appear to be significantly more stable then those in the upper part that covers the genes ubxnr, lct, mcm6, and dars. The pattern commonly observed in the history of the human expansion from Africa starts with selection of a new series of SNPs rooted in one of the trees followed by some history of recombination events. Those events leave a stable series in the lower region of a tree. But, the recombination events yield new associations of those stable series with a variety of series found in the 1000 genomes data for the upper part of the region.

In [4]:
plt_obj = dm.superset_yes_no([dm.di_26_1414, dm.di_6_1503], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)
Out[4]:

<Bokeh Notebook handle for In[4]>

In [5]:
HTML(plt_obj.get_html())
Out[5]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.5712510.99120.05370.472711.561130.455982.381041.141160.76
354170136,682,27493,624718680.5910940.86120.05360.522201.45810.375642.57881.11930.69
353462135,915,35879,721416990.7512681.00130.05370.472751.571150.456022.361071.161190.77
353901136,494,186271,7656415750.7011060.87120.05360.522241.46870.395662.55861.07950.70
353283135,771,974368,330615030.8412701.00130.05370.462761.571150.456022.361071.161200.77
353797136,398,17475,9242614140.9012701.00130.05370.462761.571150.456022.361071.161200.77

The plot below shows the five children of the EUR tree. The series 11_765 is a child of the series 4_911. The others are direct children of the tree root. There are 202 samples that express the EUR tree root but not any of its children. Only 13 samples from native African populations express the EUR tree root. The EUR root series may have been a small minority among native Africans that went through some kind of bottleneck as humans expanded outside Africa. But, the 2 native Africans that express 11_765 are very likely the result of some kind of back migration.

In [6]:
plt_obj = dm.subset_yes_no([dm.di_6_1503, dm.di_26_1414], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[6]:

<Bokeh Notebook handle for In[6]>

In [7]:
HTML(plt_obj.get_html())
Out[7]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353604136,092,061315,41849111.009100.7220.01310.542051.6300.005242.87711.07770.69
353380135,837,906870,076117651.007650.6020.01320.671461.3800.004843.15561.01450.48
353444135,889,881493,0556570.93530.0420.1910.30141.9100.00282.6351.3030.46
353407135,859,157121,7816350.97340.0310.1510.47122.5500.00172.4920.8110.24
353315135,786,061800,6237221.00220.0200.0000.0000.00194.2900.0031.8800.00

The plot below shows the full set of series expressed by the samples that express 11_765, the series associated with lactase persistence. The series 11_765 is both exceptionally long and exceptionally stable. All 484 European samples that express 11_765 express all of the associated series with no trace of any recombination events.

In [8]:
plt_obj = dm.superset_yes_no([dm.di_11_765], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[8]:

<Bokeh Notebook handle for In[8]>

In [9]:
HTML(plt_obj.get_html())
Out[9]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.357651.0020.01320.671461.3800.004843.15561.01450.48
354170136,682,27493,624718680.417641.0020.01320.671461.3800.004843.15561.01440.47
353462135,915,35879,721416990.457651.0020.01320.671461.3800.004843.15561.01450.48
353901136,494,186271,7656415750.497651.0020.01320.671461.3800.004843.15561.01450.48
353283135,771,974368,330615030.517651.0020.01320.671461.3800.004843.15561.01450.48
353797136,398,17475,9242614140.547651.0020.01320.671461.3800.004843.15561.01450.48
353604136,092,061315,41849110.847621.0020.01310.651461.3800.004843.16540.97450.48
353380135,837,906870,076117651.007651.0020.01320.671461.3800.004843.15561.01450.48

SAS Tree

The plot below shows the root of the SAS tree. The lower region of the root is defined by the series 117_1685, 123_1561, 62_1265, and 193_843. The series 193_843 is specific to the SAS tree. The other series are also expressed in varying combinations by a large number of samples from mostly African populations. The samples that express hierarchies rooted in this tree come from all populations. But, African and European populations are substantially under represented while South Asian populations are the most overrepresented.

In [10]:
plt_obj = dm.superset_yes_no([dm.di_193_843, dm.di_39_1014], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[10]:

<Bokeh Notebook handle for In[10]>

In [11]:
HTML(plt_obj.get_html())
Out[11]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353244135,758,231618,28411716850.315171.00310.30150.461211.69990.95450.43711.891352.13
353478135,933,921434,64212315610.335181.00310.30150.461211.69990.95450.43721.911352.13
353814136,406,64631,432514600.324700.91310.33150.511211.86910.96450.48531.551141.98
353269135,766,890509,0956212650.415181.00310.30150.461211.69990.95450.43721.911352.13
353790136,393,15748,2531012180.394740.92310.32150.501211.84910.95450.47541.571172.01
353906136,496,49357,432911700.404680.90310.33150.511181.82920.98440.47551.621131.97
353849136,447,70742,559411490.414700.91310.33150.511181.81920.97440.47531.551172.03
353907136,496,80555,824910230.464670.90310.33140.481181.82920.98440.47551.621131.97
353984136,556,805190,2803910140.515181.00310.30150.461211.69990.95450.43721.911352.13
353504135,964,764136,36869460.545110.99310.30150.471191.68980.95450.44691.861342.14
353807136,403,99481,102139110.524710.91310.33150.511181.81920.97440.47541.581172.03
353248135,759,095628,2041938430.615181.00310.30150.461211.69990.95450.43721.911352.13

The plot below shows children of the SAS tree. In some cases a significant number of samples that express a child series did not express the full 193_843 series. The causes were two recombination events. One involved some samples expressing the hierarchy that includes 8_267 and 6_713. That event generated a recombinant tree root that was the parent for the new child series 9_39 shown in the next chart. The other resulted in a recombination event that associated 4_815 with the series 5_684 in the EAS tree. The results of that event are illustrated in charts in the lower part of the notebook

In [12]:
plt_obj = dm.subset_yes_no([dm.di_193_843], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)
Out[12]:

<Bokeh Notebook handle for In[12]>

In [13]:
HTML(plt_obj.get_html())
Out[13]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353504135,964,764136,36869460.888320.99780.47260.501411.221650.99820.491141.892262.22
353248135,759,095628,2041938431.008431.00790.47260.491431.221670.98830.491171.912282.21
353384135,839,80584,66248150.826670.79470.35180.431351.461471.09710.53791.631702.08
353521135,989,333275,29867130.946680.79470.35180.431351.461461.09730.54791.631702.08
353511135,977,595306,20444161.004150.49330.4090.35771.34961.15240.29531.761232.42
353358135,818,487471,07882670.842230.2610.0230.21561.81471.05481.07241.48441.61
353320135,789,764571,6884911.00910.1100.0000.00645.08251.3620.1100.0000.00
353285135,772,482525,0487901.00900.11241.3230.5310.08191.0500.00172.60262.36
353396135,846,793454,4997791.00790.0900.0000.00666.03110.6920.1300.0000.00
353292135,775,651610,78229681.00680.0810.0710.2350.5300.00100.73204.05313.72
353809136,404,524150,03315590.92540.0600.0000.0020.2710.0990.83164.08263.93
353355135,814,490711,81214571.00570.0700.0000.0000.00161.3900.00184.34233.29
354133136,654,760116,00720560.80450.0510.1120.7100.0090.9910.1192.75234.17
353811136,405,65687,9934491.00490.0600.0000.0000.0000.0000.00174.77325.33
353626136,133,578283,6244451.00450.0500.0000.00406.4100.0000.0010.3140.73
353437135,884,166499,7576431.00430.0500.0000.0010.1700.00121.3992.88213.98
353262135,763,383650,32514391.00390.05303.8262.4530.5600.0000.0000.0000.00
353450135,894,943479,0664381.00380.0500.0000.00387.2200.0000.0000.0000.00
353413135,865,015494,1759291.00290.0320.3400.0071.7400.0071.2041.9092.53
353998136,568,45822,8505280.82230.0310.2210.6900.0010.2210.2284.79113.90
353719136,288,733138,8474271.00270.0300.0000.00246.4110.1820.3700.0000.00
353459135,909,151834,66815271.00270.0300.0000.0000.0000.0000.00136.62144.23
353280135,770,895701,96419271.00270.03234.2331.7710.2700.0000.0000.0000.00
353468135,921,205438,7355261.00260.0300.0021.2330.8300.00152.8721.0641.25
354047136,595,04444,51215250.80200.0210.2521.5900.0010.2510.2564.1393.67
353386135,840,292457,64510250.96240.0300.0000.0020.6000.0061.2484.5982.72
353568136,037,384313,9436191.00190.0200.0000.0031.1400.0041.0521.45104.29
353266135,765,502729,7016191.00190.02153.9243.3600.0000.0000.0000.0000.00
353548136,013,433296,0338170.88150.0200.0000.0000.0000.0000.0076.4284.35
353456135,899,723462,1405161.00160.0200.0000.0041.8000.0072.1800.0052.55

The plot below shows the series 9_39 hierarchy. The root of this hierarchy appears to have been generated through a recombination event between the lower part of the SAS tree and the higher part of the EAS tree. Enough SNPs were lost from both the 193_843 series and the 123_1561 series for their remaining fragments no longer to be counted as instances of the series. Some SNPs were also lost from 117_1685. But the remaining SNPs still met the criteria for instances of that series. The samples that express this hierarchy are a large portion of those that express 62_1265 without 123_1561.

In [14]:
plt_obj = dm.superset_yes_no([dm.di_8_267], [dm.di_193_843], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
Out[14]:

<Bokeh Notebook handle for In[14]>

In [15]:
HTML(plt_obj.get_html())
Out[15]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.02340.7710.1300.0092.0000.00212.7900.0030.95
353244135,758,231618,28411716850.02420.9510.1100.00132.3400.00252.6900.0030.77
354130136,653,925107,9282415040.02310.7010.1400.0092.1900.00182.6200.0031.04
353925136,506,37532,564414420.02340.7710.1300.0092.0000.00212.7900.0030.95
353791136,393,65892,6843213610.03420.9510.1100.00132.3400.00252.6900.0030.77
353958136,535,87619,014713030.03340.7710.1300.0092.0000.00212.7900.0030.95
354129136,652,953108,222512960.02310.7010.1400.0092.1900.00182.6200.0031.04
353269135,766,890509,0956212650.03441.0010.1000.00132.2320.23252.5600.0030.74
353919136,500,47542,0851312270.03340.7710.1300.0092.0000.00212.7900.0030.95
354127136,652,49180,281611140.03310.7010.1400.0092.1900.00182.6200.0031.04
353504135,964,764136,36869460.05441.0010.1000.00132.2320.23252.5600.0030.74
353902136,494,985278,653818570.04310.7010.1400.0092.1900.00182.6200.0031.04
353384135,839,80584,66248150.05420.9510.1100.00132.3400.00252.6900.0030.77
353521135,989,333275,29867130.06441.0010.1000.00132.2320.23252.5600.0030.74
353358135,818,487471,07882670.16441.0010.1000.00132.2320.23252.5600.0030.74
353559136,028,572409,8939391.00390.8910.1100.00132.5200.00222.5500.0030.83

EAS Tree

The plot below shows the association of series for the 542 samples that express 13_1696, 9_944, 32_1361, and 81_857. The appearance of 9_944 in this association appears to have been the initial event in the formation of the EAS tree.

In [16]:
plt_obj = dm.superset_yes_no([dm.di_9_944, dm.di_81_857, dm.di_13_1696], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[16]:

<Bokeh Notebook handle for In[16]>

In [17]:
HTML(plt_obj.get_html())
Out[17]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.315411.00170.1640.121011.351311.201411.30481.22991.49
353240135,757,32020,1841316960.325421.00170.1640.121011.341311.201421.30481.22991.49
354130136,653,925107,9282415040.365421.00170.1640.121011.341311.201421.30481.22991.49
353925136,506,37532,564414420.385421.00170.1640.121011.341311.201421.30481.22991.49
353791136,393,65892,6843213610.405421.00170.1640.121011.341311.201421.30481.22991.49
353958136,535,87619,014713030.425421.00170.1640.121011.341311.201421.30481.22991.49
354129136,652,953108,222512960.425421.00170.1640.121011.341311.201421.30481.22991.49
353919136,500,47542,0851312270.445360.99170.1640.12991.331311.211421.32471.21961.46
354127136,652,49180,281611140.495421.00170.1640.121011.341311.201421.30481.22991.49
353372135,829,592591,31699440.575421.00170.1640.121011.341311.201421.30481.22991.49
353902136,494,985278,653818570.635421.00170.1640.121011.341311.201421.30481.22991.49

The plot below shows series that have appeared during the history of the EAS tree.

In [18]:
plt_obj = dm.subset_yes_no([dm.di_9_944], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)
Out[18]:

<Bokeh Notebook handle for In[18]>

In [19]:
HTML(plt_obj.get_html())
Out[19]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353372135,829,592591,31699441.009441.00190.1050.081260.963671.931650.87851.241771.53
353498135,959,272416,58356840.825620.6000.0010.03620.803423.02300.27411.00861.25
353485135,942,182432,5604671.00670.0700.0000.0000.00664.8900.0010.2100.00
353564136,033,790291,9475601.00600.0600.0020.5340.4810.08171.4181.83283.81
353917136,499,236176,0927590.81480.0500.0000.00172.5600.00242.4920.5750.85
353430135,879,258516,5116541.00540.0600.0000.0081.0700.00373.4130.7660.91
353394135,845,676641,90014430.98420.0400.0000.00142.4100.00222.6120.6640.78
353635136,151,572121,1184370.89330.0300.0000.0030.6600.0050.7593.75163.95
353308135,782,087144,6684340.94320.0300.0000.0020.4500.0050.78104.30153.82
353469135,922,944346,9115311.00310.0300.0000.0061.4000.00203.2110.4441.05
354116136,648,736114,1144260.81210.0200.0000.0041.3700.00102.3731.9741.55
354176136,687,84421,9654250.88220.0200.0000.0000.00224.9700.0000.0000.00
353467135,920,394608,5625171.00170.0200.0000.0052.1200.00113.2210.8100.00
353321135,789,949762,4627171.00170.0200.0000.0020.8500.00144.1010.8100.00
353705136,266,716297,5304161.00160.0200.0000.0041.8000.00123.7300.0000.00

The plot below shows the root series for the 238 samples that express the series 16_1396, 5_684, 32_1361, and 81_857. All but one also express 9_944. After the series 11_765 associated with lactase persistence, 5_684 appears to be the most overexpressed series in the lactase persistence region of chromosome 2 that has appeared during the human expansion from Africa. The 684 samples that express it include 486 that do not express any more specific series in the EAS tree hierarchy. It is not expressed by any of the samples from native African populations and is heavily overexpressed among East Asians. It appears to have initially appeared as a new child series in an hierarchy that included 13_1696, 9_944, 32_1361, and 81_857.

In [20]:
plt_obj = dm.superset_yes_no([dm.di_13_1696, dm.di_5_684, dm.di_81_857], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[20]:

<Bokeh Notebook handle for In[20]>

In [21]:
HTML(plt_obj.get_html())
Out[21]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.142381.0000.0000.00431.301222.55250.52170.98311.06
353240135,757,32020,1841316960.142381.0000.0000.00431.301222.55250.52170.98311.06
354130136,653,925107,9282415040.162381.0000.0000.00431.301222.55250.52170.98311.06
353925136,506,37532,564414420.172381.0000.0000.00431.301222.55250.52170.98311.06
353791136,393,65892,6843213610.172381.0000.0000.00431.301222.55250.52170.98311.06
353958136,535,87619,014713030.182381.0000.0000.00431.301222.55250.52170.98311.06
354129136,652,953108,222512960.182381.0000.0000.00431.301222.55250.52170.98311.06
353919136,500,47542,0851312270.192360.9900.0000.00421.281222.57250.53170.99301.04
354127136,652,49180,281611140.212381.0000.0000.00431.301222.55250.52170.98311.06
353372135,829,592591,31699440.252371.0000.0000.00421.281222.56250.53170.99311.07
353902136,494,985278,653818570.282381.0000.0000.00431.301222.55250.52170.98311.06
353498135,959,272416,58356840.352381.0000.0000.00431.301222.55250.52170.98311.06

The plot below shows the root EAS series for the 271 samples that express 13_1696, 5_684, and 64_1575. More then 90% of them also express 9_944 and 32_1361. Recombination events between the lower and upper parts of the analyzed region appear to be relatively common. A variety of cases have been identified where recombination events associated 64_1575, 10_2206, and 7_1868 with series in the lower part of the region that are outside the EUR tree. But, this case is the only one where the number of samples expressing that association are large enough to suggest some form of selection for the new association of series that resulted from the recombination events.

In [22]:
plt_obj = dm.superset_yes_no([dm.di_13_1696, dm.di_5_684, dm.di_64_1575], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[22]:

<Bokeh Notebook handle for In[22]>

In [23]:
HTML(plt_obj.get_html())
Out[23]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.122711.0000.0010.06190.511733.1760.11201.02521.57
354170136,682,27493,624718680.142620.9700.0010.06190.521713.2460.11180.95471.46
353240135,757,32020,1841316960.162711.0000.0010.06190.511733.1760.11201.02521.57
353901136,494,186271,7656415750.172711.0000.0010.06190.511733.1760.11201.02521.57
353791136,393,65892,6843213610.192590.9600.0010.06190.531633.1350.10191.01521.64
353372135,829,592591,31699440.282610.9600.0010.06190.531653.1450.10191.00521.63
353498135,959,272416,58356840.402711.0000.0010.06190.511733.1760.11201.02521.57
The plot below shows the root EAS series for the 61 samples that express the series 4_815, 5_684, and 64_1575. All but one express 32_1361. This association appears to be the result of a recombination between a chromosome expressing the EAS tree and one expressing the SAS tree. The series 13_1696 and the first two SNPs of 9_944 are lost from the typical EAS root series. Fragments containing the remaining 9_944 SNPs and SNP fragments from the major series of the SAS tree root are present.
In [24]:
plt_obj = dm.superset_yes_no([dm.di_5_684, dm.di_4_815, dm.di_64_1575], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[24]:

<Bokeh Notebook handle for In[24]>

In [25]:
HTML(plt_obj.get_html())
Out[25]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.03611.0000.0000.0000.00604.8900.0010.2300.00
354170136,682,27493,624718680.03611.0000.0000.0000.00604.8900.0010.2300.00
353901136,494,186271,7656415750.04611.0000.0000.0000.00604.8900.0010.2300.00
353791136,393,65892,6843213610.04600.9800.0000.0000.00594.8900.0010.2300.00
353384135,839,80584,66248150.07611.0000.0000.0000.00604.8900.0010.2300.00
353498135,959,272416,58356840.09611.0000.0000.0000.00604.8900.0010.2300.00
The plot below shows the root series for the 18 samples that express the series 4_815, 5_684, and 81_857. All of them express 32_1361. The presence of all four pairs of associations between the lower region and upper region series shows some history of multiple recombination events.
In [26]:
plt_obj = dm.superset_yes_no([dm.di_5_684, dm.di_4_815, dm.di_81_857], min_match=0.9)
plt = plt_obj.do_plot()
show(plt)
Out[26]:

<Bokeh Notebook handle for In[26]>

In [27]:
HTML(plt_obj.get_html())
Out[27]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.01181.0000.0000.0000.00133.5900.0032.2920.91
354130136,653,925107,9282415040.01181.0000.0000.0000.00133.5900.0032.2920.91
353925136,506,37532,564414420.01181.0000.0000.0000.00133.5900.0032.2920.91
353791136,393,65892,6843213610.01181.0000.0000.0000.00133.5900.0032.2920.91
353958136,535,87619,014713030.01181.0000.0000.0000.00133.5900.0032.2920.91
354129136,652,953108,222512960.01181.0000.0000.0000.00133.5900.0032.2920.91
353919136,500,47542,0851312270.01181.0000.0000.0000.00133.5900.0032.2920.91
354127136,652,49180,281611140.02181.0000.0000.0000.00133.5900.0032.2920.91
353902136,494,985278,653818570.02181.0000.0000.0000.00133.5900.0032.2920.91
353384135,839,80584,66248150.02181.0000.0000.0000.00133.5900.0032.2920.91
353498135,959,272416,58356840.03181.0000.0000.0000.00133.5900.0032.2920.91
The plot below shows descendants of 5_684. All but 5_47 are also descendants of 9_944. The series 5_684 is expressed by 486 samples that express 5_684 as the most specific child of the EAS tree.
In [28]:
plt_obj = dm.subset_yes_no([dm.di_5_684], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)
Out[28]:

<Bokeh Notebook handle for In[28]>

In [29]:
HTML(plt_obj.get_html())
Out[29]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353498135,959,272416,58356841.006841.0000.0010.02630.664443.23310.23490.99961.14
353485135,942,182432,5604670.99660.1000.0000.0000.00654.8900.0010.2100.00
353554136,021,814347,5225470.94440.0600.0000.0000.00404.5200.0030.9410.19
353394135,845,676641,90014430.88380.0600.0000.00132.4700.00192.4920.7240.86
354176136,687,84421,9654250.88220.0300.0000.0000.00224.9700.0000.0000.00

EAS Exception

The plot below shows an association of series that is exceptional both because the 139 samples that express it are all East Asian and because it does not fit into an hierarchical tree structure. The combination of the series 95_176 and 51_176 cover the upper half of the gene r3hdm1 and the genes ubxn4, lct, mcm6, and dars. This location makes the pair of series another alternative to the three different associations of series that cover the same region for the roots of the EAS, EUR, and SAS trees. The lower region includes the series 6_1503 and 4_1699 that are part of the root of the EUR tree. Their presence in this association of series is the result of recombination. The other samples that express one of the 36 other instances of 95_176 or 51_176 form a variety of associations with other series that appear to be the result of many recombination events. The 139 samples that express 6_1503, 4_1699, 95_176 and 51_176 seem to indicate some kind of selection for the recombination formed association.

In [30]:
plt_obj = dm.superset_yes_no([dm.di_6_1503, dm.di_95_176, dm.di_51_176], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[30]:

<Bokeh Notebook handle for In[30]>

In [31]:
HTML(plt_obj.get_html())
Out[31]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.081401.0000.0000.0000.001404.9700.0000.0000.00
353462135,915,35879,721416990.081390.9900.0000.0000.001394.9700.0000.0000.00
354130136,653,925107,9282415040.091401.0000.0000.0000.001404.9700.0000.0000.00
353283135,771,974368,330615030.091401.0000.0000.0000.001404.9700.0000.0000.00
354129136,652,953108,222512960.111401.0000.0000.0000.001404.9700.0000.0000.00
353849136,447,70742,559411490.121401.0000.0000.0000.001404.9700.0000.0000.00
354061136,603,63828,487165110.271401.0000.0000.0000.001404.9700.0000.0000.00
354123136,651,773122,854511760.801401.0000.0000.0000.001404.9700.0000.0000.00
353787136,392,474250,856951760.801401.0000.0000.0000.001404.9700.0000.0000.00

African Base

A large number of samples mostly from African populations express some combination of the series 117_1685, 123_1561, and 62_1265 without 193_843 in the lower 600,000 bases of the lactase persistence region that includes the genes rab3gap1, zranb3, and the lower half of r3hdm1. The varying combination of these series and their large number of SNPs indicate some complex history that cannot be reconstructed from the 1000 genomes data. But, the several cases of overexpressed series associations do generally include a complex series like 193_843 that is specific to some hierarchy of associations and that appears to have acted as a selector in some kind of process that resulted in a large overexpresson of the root series associated with that hierarchy.

The plot below shows an example of the hierarchy that has been selected by the series 67_329. The 329 chromosomes that express this hierarchy are the next largest number after the 843 that express 193_843 and the rest of the SAS tree root for those expressing an hierarchy rooted in 62_1265, 123_1561, and 117_1685.

This example shows the complete hierarchy of series expressed by the 32 samples that express the series 12_32. It includes the 1,000,000 DNA base series 147_38 that is the longest one identified in the lactase persistence region.

In [32]:
plt_obj = dm.superset_yes_no([dm.di_6_32], min_match=0.7)
plt = plt_obj.do_plot()
show(plt)
Out[32]:

<Bokeh Notebook handle for In[32]>

In [33]:
HTML(plt_obj.get_html())
Out[33]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.02321.00274.1952.4900.0000.0000.0000.0000.00
353244135,758,231618,28411716850.02321.00274.1952.4900.0000.0000.0000.0000.00
353478135,933,921434,64212315610.02321.00274.1952.4900.0000.0000.0000.0000.00
354130136,653,925107,9282415040.02310.97264.1752.5700.0000.0000.0000.0000.00
353925136,506,37532,564414420.02321.00274.1952.4900.0000.0000.0000.0000.00
353958136,535,87619,014713030.02321.00274.1952.4900.0000.0000.0000.0000.00
353269135,766,890509,0956212650.03321.00274.1952.4900.0000.0000.0000.0000.00
353729136,309,23952,32198870.04321.00274.1952.4900.0000.0000.0000.0000.00
353312135,784,351117,86987180.04321.00274.1952.4900.0000.0000.0000.0000.00
353764136,364,91622,97755880.05310.97264.1752.5700.0000.0000.0000.0000.00
353349135,810,53587,48895450.06321.00274.1952.4900.0000.0000.0000.0000.00
354061136,603,63828,487165110.06321.00274.1952.4900.0000.0000.0000.0000.00
353331135,795,222587,478673290.10321.00274.1952.4900.0000.0000.0000.0000.00
354057136,602,29127,69672260.14310.97264.1752.5700.0000.0000.0000.0000.00
354189136,704,46627,74852120.15310.97264.1752.5700.0000.0000.0000.0000.00
353788136,392,700158,106551800.18321.00274.1952.4900.0000.0000.0000.0000.00
354181136,696,01068,554181720.18310.97264.1752.5700.0000.0000.0000.0000.00
354036136,589,52152,491171270.24310.97264.1752.5700.0000.0000.0000.0000.00
353992136,562,71621,99861270.25321.00274.1952.4900.0000.0000.0000.0000.00
353486135,946,660156,6685960.33321.00274.1952.4900.0000.0000.0000.0000.00
354115136,647,93330,44510430.67290.91244.1152.7500.0000.0000.0000.0000.00
354144136,658,314107,14433380.76290.91244.1152.7500.0000.0000.0000.0000.00
353242135,758,1161,022,378147380.84321.00274.1952.4900.0000.0000.0000.0000.00
353556136,025,256488,0356321.00321.00274.1952.4900.0000.0000.0000.0000.00

Series 74_210

This plot shows an example of the hierarchy that has been selected by the series 74_210. This hierarchy is responsible for most of the cases where the series 117_1685 is expressed by itself without 123_1561, 62_1265, or 193_843. The 210 samples that express the 74 SNPs in series 74_210 modestly favor European populations over African. But, those samples are relatively evenly distributed across the 1000 genomes regional populations.

The plot below shows series associations for the instances of 74_210 that have been selected by the appearance of series 5_30. It is a series expressed by Europeans and South Asians. The tree hierarchy of 5_30 includes 18_131 as well as 74_210. The emergence of series 5_30 has selected an association of series that includes series 39_1014 and related series that also form the primary upper region of the SAS tree root.

In [34]:
plt_obj = dm.superset_yes_no([dm.di_5_30], min_match=0.7)
plt = plt_obj.do_plot()
show(plt)
Out[34]:

<Bokeh Notebook handle for In[34]>

In [35]:
HTML(plt_obj.get_html())
Out[35]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353244135,758,231618,28411716850.02301.0000.0000.0010.2400.00162.6652.2982.18
353814136,406,64631,432514600.02301.0000.0000.0010.2400.00162.6652.2982.18
353925136,506,37532,564414420.02301.0000.0000.0010.2400.00162.6652.2982.18
353919136,500,47542,0851312270.02301.0000.0000.0010.2400.00162.6652.2982.18
353790136,393,15748,2531012180.02301.0000.0000.0010.2400.00162.6652.2982.18
353906136,496,49357,432911700.02280.9300.0000.0010.2600.00162.8441.9772.04
353849136,447,70742,559411490.03301.0000.0000.0010.2400.00162.6652.2982.18
353907136,496,80555,824910230.03301.0000.0000.0010.2400.00162.6652.2982.18
353984136,556,805190,2803910140.03270.9000.0000.0010.2700.00162.9542.0461.81
353807136,403,99481,102139110.03301.0000.0000.0010.2400.00162.6652.2982.18
353312135,784,351117,86987180.04301.0000.0000.0010.2400.00162.6652.2982.18
353249135,759,145604,162742100.14301.0000.0000.0010.2400.00162.6652.2982.18
353325135,790,329493,075181310.23301.0000.0000.0010.2400.00162.6652.2982.18
353987136,560,126181,9544570.47270.9000.0000.0010.2700.00162.9542.0461.81
353519135,987,259328,7395301.00301.0000.0000.0010.2400.00162.6652.2982.18

Series 28_434

The series 28_434 is particularly interesting because it is at the root of four different hierarchies. The series and series SNP fragments associated with those hierarchies reflect a large amount of remodeling of SNPs in the lower 600,000 bases of the studied region. Different hiearchies include varying combinations of 117_1685, 123_1561, 13_1696, and fragments of 117_1685 and 123_1561.

The plot below shows one hierarchy rooted in one subset of the series selected by 28_434. The root of that hierarchy includes 49_136 and 6_68. It also includes 117_1685 but not 123_1561 or 62_1265. The plotted hierarchy that has emerged from that root includes the series 10_17, 6_28, and 14_40. The association of series in the upper part of the region selected by the emergence of 10_17 does not match any of the three tree roots. But, it does include the series 7_1868 also found in the root of the EUR tree.

In [36]:
di_10_17 = 353401
plt_obj = dm.superset_yes_no([di_10_17], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[36]:

<Bokeh Notebook handle for In[36]>

In [37]:
HTML(plt_obj.get_html())
Out[37]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354170136,682,27493,624718680.01171.00133.8043.7500.0000.0000.0000.0000.00
354033136,588,0315,647717600.01171.00133.8043.7500.0000.0000.0000.0000.00
353244135,758,231618,28411716850.01171.00133.8043.7500.0000.0000.0000.0000.00
353814136,406,64631,432514600.01171.00133.8043.7500.0000.0000.0000.0000.00
353729136,309,23952,32198870.02171.00133.8043.7500.0000.0000.0000.0000.00
353312135,784,351117,86987180.02171.00133.8043.7500.0000.0000.0000.0000.00
353851136,448,85517,97646120.03171.00133.8043.7500.0000.0000.0000.0000.00
353764136,364,91622,97755880.03171.00133.8043.7500.0000.0000.0000.0000.00
353349135,810,53587,48895450.03171.00133.8043.7500.0000.0000.0000.0000.00
354061136,603,63828,487165110.03171.00133.8043.7500.0000.0000.0000.0000.00
353614136,115,507269,773284340.04171.00133.8043.7500.0000.0000.0000.0000.00
353804136,402,77890,652213320.05171.00133.8043.7500.0000.0000.0000.0000.00
354057136,602,29127,69672260.08171.00133.8043.7500.0000.0000.0000.0000.00
353931136,509,36128,756111730.10171.00133.8043.7500.0000.0000.0000.0000.00
353538136,008,173373,078701660.10171.00133.8043.7500.0000.0000.0000.0000.00
353304135,780,924236,215491360.12171.00133.8043.7500.0000.0000.0000.0000.00
354036136,589,52152,491171270.13171.00133.8043.7500.0000.0000.0000.0000.00
353910136,497,00691,73219770.22171.00133.8043.7500.0000.0000.0000.0000.00
353581136,056,699377,2035690.25171.00133.8043.7500.0000.0000.0000.0000.00
353247135,758,71319,5206680.25171.00133.8043.7500.0000.0000.0000.0000.00
353354135,814,021610,52714400.43171.00133.8043.7500.0000.0000.0000.0000.00
353338135,798,652364,9086280.61171.00133.8043.7500.0000.0000.0000.0000.00
354042136,592,964134,8434270.63171.00133.8043.7500.0000.0000.0000.0000.00
353401135,850,274858,65210171.00171.00133.8043.7500.0000.0000.0000.0000.00

The plot below shows another hierarchy rooted in 28_434. The example is the hierarchy for the series 7_49. It is rooted in a recombination between a chromosome that expressed 28_434 and 49_136 with a chromosome that expressed the series 13_1696. The resulting association that included the series 28_434, 49_136, 13_1696, and 117_1685. That association of SNPs was selected for overexpression by the process that generated the series 22_73. The series 7_49 is a child of 22_73 rooted in this hierarchy.

In [38]:
plt_obj = dm.superset_yes_no([dm.di_7_49], min_match=0.7)
plt = plt_obj.do_plot()
show(plt)
Out[38]:

<Bokeh Notebook handle for In[38]>

In [39]:
HTML(plt_obj.get_html())
Out[39]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353240135,757,32020,1841316960.03480.98313.21165.3210.1500.0000.0000.0000.00
353244135,758,231618,28411716850.03491.00313.14175.5310.1500.0000.0000.0000.00
354130136,653,925107,9282415040.03390.80253.18135.3210.1900.0000.0000.0000.00
353925136,506,37532,564414420.03430.88273.12155.5610.1700.0000.0000.0000.00
353958136,535,87619,014713030.03430.88273.12155.5610.1700.0000.0000.0000.00
354129136,652,953108,222512960.03390.80253.18135.3210.1900.0000.0000.0000.00
354127136,652,49180,281611140.04390.80253.18135.3210.1900.0000.0000.0000.00
353504135,964,764136,36869460.04380.78233.01145.8810.1900.0000.0000.0000.00
353729136,309,23952,32198870.06491.00313.14175.5310.1500.0000.0000.0000.00
353312135,784,351117,86987180.07491.00313.14175.5310.1500.0000.0000.0000.00
353764136,364,91622,97755880.08491.00313.14175.5310.1500.0000.0000.0000.00
353349135,810,53587,48895450.09491.00313.14175.5310.1500.0000.0000.0000.00
353614136,115,507269,773284340.11491.00313.14175.5310.1500.0000.0000.0000.00
353788136,392,700158,106551800.22400.82242.98155.9810.1800.0000.0000.0000.00
353538136,008,173373,078701660.30491.00313.14175.5310.1500.0000.0000.0000.00
353304135,780,924236,215491360.36491.00313.14175.5310.1500.0000.0000.0000.00
354125136,651,969124,016241130.33370.76233.09135.6010.2000.0000.0000.0000.00
353915136,499,188132,23626960.40380.78233.01145.8810.1900.0000.0000.0000.00
354037136,589,70151,5674800.45360.73212.90146.2010.2000.0000.0000.0000.00
353259135,761,470609,61022730.64470.96303.17165.4310.1500.0000.0000.0000.00
353806136,403,06184,7834670.63420.86263.08155.7010.1700.0000.0000.0000.00
353379135,837,488509,1877491.00491.00313.14175.5310.1500.0000.0000.0000.00

The plot below shows another hierarchy rooted in 28_434. It is the one selected for overexpression by the process that generated the 180 SNP series expressed by 251 samples, 180_251. It is the major example of an hierarchy that has selected 123_1561 without 117_1685. The history of the emergence of 180_251 includes a recombination event between a chromosome that expressed 28_434, 117_1685, and 123_1561 with a chromosome that expressed 13_1696. The resulting association of series included 28_434, 123_1561, and 13_1696 but not 117_1685.

This example shows the specific hierarchy for the series 10_17. The hierarchy includes the series 6_88 and 23_226 and 180_251.

In [40]:
plt_obj = dm.superset_yes_no([dm.di_10_17], min_match=0.7)
plt = plt_obj.do_plot()
show(plt)
Out[40]:

<Bokeh Notebook handle for In[40]>

In [41]:
HTML(plt_obj.get_html())
Out[41]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.01171.00164.6810.9400.0000.0000.0000.0000.00
353240135,757,32020,1841316960.01171.00164.6810.9400.0000.0000.0000.0000.00
353478135,933,921434,64212315610.01171.00164.6810.9400.0000.0000.0000.0000.00
353790136,393,15748,2531012180.01171.00164.6810.9400.0000.0000.0000.0000.00
353906136,496,49357,432911700.01171.00164.6810.9400.0000.0000.0000.0000.00
353935136,511,87421,32159760.02171.00164.6810.9400.0000.0000.0000.0000.00
353729136,309,23952,32198870.02171.00164.6810.9400.0000.0000.0000.0000.00
353851136,448,85517,97646120.03160.94154.6611.0000.0000.0000.0000.0000.00
353764136,364,91622,97755880.03171.00164.6810.9400.0000.0000.0000.0000.00
354061136,603,63828,487165110.03171.00164.6810.9400.0000.0000.0000.0000.00
353614136,115,507269,773284340.04171.00164.6810.9400.0000.0000.0000.0000.00
353288135,773,319720,8121802510.07171.00164.6810.9400.0000.0000.0000.0000.00
353300135,778,464608,500232260.08171.00164.6810.9400.0000.0000.0000.0000.00
353909136,496,87551,331181480.11171.00164.6810.9400.0000.0000.0000.0000.00
353992136,562,71621,99861270.13171.00164.6810.9400.0000.0000.0000.0000.00
354058136,602,95227,735141040.16171.00164.6810.9400.0000.0000.0000.0000.00
353415135,867,083369,7046880.19171.00164.6810.9400.0000.0000.0000.0000.00
353241135,757,652888,99410171.00171.00164.6810.9400.0000.0000.0000.0000.00

Series 209_56

The plot below shows another hierarchy rooted in 117_1685, 123_1561, and 62_1265 but not 193_843. The series 209_56 appears to be the selector series for this hierarchy.

The example uses the series 14_48 to illustrate this hierarchy. This series is interesting because it has selected 26_1414 in an hierarchy that is totally independent of the EUR tree. That independent selection event shows that 26_1414 has an identity beyond the EUR tree.

In [42]:
di_14_48 = 353833
plt_obj = dm.superset_yes_no([di_14_48], min_match=0.7)
plt = plt_obj.do_plot()
show(plt)
Out[42]:

<Bokeh Notebook handle for In[42]>

In [43]:
HTML(plt_obj.get_html())
Out[43]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.02481.00343.52103.3240.6000.0000.0000.0000.00
353244135,758,231618,28411716850.03481.00343.52103.3240.6000.0000.0000.0000.00
353478135,933,921434,64212315610.03460.96333.5693.1240.6300.0000.0000.0000.00
354130136,653,925107,9282415040.02360.75263.5962.6640.8000.0000.0000.0000.00
353797136,398,17475,9242614140.03481.00343.52103.3240.6000.0000.0000.0000.00
353269135,766,890509,0956212650.04460.96333.5693.1240.6300.0000.0000.0000.00
353729136,309,23952,32198870.05460.96333.5693.1240.6300.0000.0000.0000.00
353764136,364,91622,97755880.08460.96333.5693.1240.6300.0000.0000.0000.00
353349135,810,53587,48895450.08460.96333.5693.1240.6300.0000.0000.0000.00
354189136,704,46627,74852120.17360.75263.5962.6640.8000.0000.0000.0000.00
354143136,658,23819,68161670.22360.75263.5962.6640.8000.0000.0000.0000.00
353486135,946,660156,6685960.48460.96333.5693.1240.6300.0000.0000.0000.00
353276135,769,240617,518209560.82460.96333.5693.1240.6300.0000.0000.0000.00
353833136,428,402221,23914481.00481.00343.52103.3240.6000.0000.0000.0000.00
353347135,809,270491,37210430.88380.79283.6672.9430.5700.0000.0000.0000.00
354128136,652,918133,71280380.95360.75263.5962.6640.8000.0000.0000.0000.00