In [1]:
import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.series_plots as dm
import genomes_dnj.lct_interval.anal_series as an
import genomes_dnj.lct_interval_snp_anal.lct_interval_snp_anal as snp_anal
import genomes_dnj.lct_interval.series_masks as sm
from genomes_dnj.lct_interval.bokeh_match_count_plot import snp_match_count_plot_cls as smcp
output_notebook(hide_banner=True)

Methods

Full details of all of the methods are available in the source code. This notebook describes them and provides some examples to illustrate their use.

Identification of Series

SNPs were grouped into series through an algorithmic process. The data for that process was a two dimensional array. There was a row for each SNP expressed by at least 16 1000 genomes chromosome samples. There were 5008 columns, one for each 1000 genomes chromosome sample. Values were true if a sample expressed a SNP and false if it did not.

The process for identifying a series started with a specific SNP and an array of the column indexes for the samples that expressed that SNP. Those indexes were used to generate an array of counts of the number of the test samples that expressed each SNP farther along the chromosome. The counts were compared with the 90% match criterion to identify additional SNPs that were expressed by 90% of the test samples. A second array was generated with counts of all samples that expressed each of the SNPs selected by the initial match process. The second match check required the count of the test samples expressing the SNP to be at least 90% of the count of all samples expressing the SNP.

This process was carried out recursively by making the last identified SNP in a series the start of another attempt to find more SNPs in the series. The process was continued until a 1,000,000 base interval was scanned without finding any additional SNPs in the series. The cases where this process resulted in at least 4 SNPs were identified as a series. Samples that expressed at least 90% of the SNPs in a series were considered to express the series.

Rounding can result in inclusion samples that express less than 90% of the SNPs. The recursive process can favor identification of series that have some history of fragment generation of the lower part of the series. But, samples that express the fragments are not counted as expressing the series.

Yes No Comparisons

Plots were constructed by selecting chromosome samples that satisfied some filter criteria. The criteria are based on a yes list of series and a no list of series. To be selected a chromosome sample must express all of the series in the yes list and none of the series in the no list.

Series Representation

Series are identified by the number of SNPs in the series and the number of 1000 genomes chromosome samples that express it. The series 11_765 associated with lactase persistence includes 11 SNPs and is expressed by 765 of the 5008 1000 genomes samples of chromsome 2.

In [2]:
import genomes_dnj.lct_interval.plot_colors as clr
HTML(clr.pop_colors)
Out[2]:
African
American
East Asian
European
South Asian

Series are represented in a plot by a colored rectangle. The hue of the rectangle represents the most overexpressed population for the chromosome samples that express the series. More saturated colors represent more overexpressed populations. The table above shows the fully saturated colors used to represent overexpression of series by different 1000 genomes populations. The plot x axis maps to positions on chromosome 2. The height of the rectangle represents the log of the number of 1000 genomes chromosome 2 samples that express the series. Each SNP in the series is represented with a vertical line. Height and color are properties of the series regardless of how many chromosome samples express the series in a particular plot.

Superset Plots

Most plots in these notebooks are superset plots. The intent of a superset plot is to find all of the series in the studied region expressed by some set of chromosome samples. Some of those series may be expressed specifically by the samples that satisfy the filter test. For other series those samples are just a subset of the 1000 genomes chromosome samples that express the series. A yes no comparison is used to select input chromosome samples. The output series are identified by trying to match the input samples with the samples that express each of the series in the studied region. Those series that meet the match criterion are added to the result set. When the intent was to limit the series to those expressed by all of the selected chromosome samples, the match criterion was generally 90%. When the intent was to identify all of the series most commonly expressed by the selected chromosome samples, the match criterion was generally 50%. The use of a 50% match prevents the selection of overlapping series that are expressed by different subsets of the selected chromosome samples. When the intent was to display a complete hierarchy of series that are all expressed by all the chosen set of chromosome samples, the match criterion was generally 0.1%. The data associated with each plot shows the number of input samples that actually express each of the result series.

In [3]:
lct_plt_obj = dm.superset_yes_no([dm.di_11_765], min_match=0.9)
plt = lct_plt_obj.do_plot()
show(plt)
Out[3]:

<Bokeh Notebook handle for In[3]>

This plot is an example of a superset plot. The only criterion for selecting chromosome samples is the expression of the series 11_765. The series in the plot are ones that are expressed by at least 90% of the 1000 genomes phase 3 chromsome 2 samples that express the series 11_765.

In [4]:
HTML(lct_plt_obj.get_html())
Out[4]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.357651.0020.01320.671461.3800.004843.15561.01450.48
354170136,682,27493,624718680.417641.0020.01320.671461.3800.004843.15561.01440.47
353462135,915,35879,721416990.457651.0020.01320.671461.3800.004843.15561.01450.48
353901136,494,186271,7656415750.497651.0020.01320.671461.3800.004843.15561.01450.48
353283135,771,974368,330615030.517651.0020.01320.671461.3800.004843.15561.01450.48
353797136,398,17475,9242614140.547651.0020.01320.671461.3800.004843.15561.01450.48
353604136,092,061315,41849110.847621.0020.01310.651461.3800.004843.16540.97450.48
353380135,837,906870,076117651.007651.0020.01320.671461.3800.004843.15561.01450.48

This table shows the plot data for the 11_765 superset plot. Each matched series has a row in the table. Series are ordered by the number of 1000 genomes chromosome samples that express the series. Matches give the number of those samples selected for the plot that express the particular series.

For example the top line indicates that the 10 SNP series expressed by 2206 chromosome 2 samples is expressed by all 765 of the chromsomes samples that satisfied this plot's yes no criteria. Those samples were 35% of the total 2206 1000 genomes chromosome 2 samples that express the series 10_2206. Those samples were 100% of the 765 chromosome samples that satisfied this plot's yes no criteria.

Number of matches and relative population expression are broken out by population. Both the African and South Asian populations are divided between those that came from the native area and those who live externally. The data in this table was the reason for that choice. The 32 instances of 11_765 found in American Southwest and Caribbean African populations are very unlikely to have come from Africa and significantly over represent the frequency of lactase persistence in Afican populations.

For calculations of population overexpression chromosome samples that failed the no test were excluded from the population under consideration. The relative expression calculation is based on a ratio of the samples that matched to the population size minus the samples that failed the no test.

Subset Plots

A subset plot starts with a test sample set and identifies series that are expressed by some subset of that sample set. The samples that express each of the series in the result set must be a subset of the samples in the test set up to some match criterion. Generally, a 90% match criterion was used for these plots.

In [5]:
plt_obj = dm.subset_yes_no([dm.di_193_843], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)
Out[5]:

<Bokeh Notebook handle for In[5]>

The plot above shows series that are expressed by chromosome 2 samples that are subsets of the chromosome samples that express the series 193_843. Those series are all part of the hierarchy that has been named the SAS tree. The match criterion has been relaxed to 0.8 because some of the chromosome samples that express the series 8_267 have experienced a recombination event that cause loss of enough 193_843 SNPs for those samples not to match 193_843.

In [6]:
HTML(plt_obj.get_html())
Out[6]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353504135,964,764136,36869460.888320.99780.47260.501411.221650.99820.491141.892262.22
353248135,759,095628,2041938431.008431.00790.47260.491431.221670.98830.491171.912282.21
353384135,839,80584,66248150.826670.79470.35180.431351.461471.09710.53791.631702.08
353521135,989,333275,29867130.946680.79470.35180.431351.461461.09730.54791.631702.08
353511135,977,595306,20444161.004150.49330.4090.35771.34961.15240.29531.761232.42
353358135,818,487471,07882670.842230.2610.0230.21561.81471.05481.07241.48441.61
353320135,789,764571,6884911.00910.1100.0000.00645.08251.3620.1100.0000.00
353285135,772,482525,0487901.00900.11241.3230.5310.08191.0500.00172.60262.36
353396135,846,793454,4997791.00790.0900.0000.00666.03110.6920.1300.0000.00
353292135,775,651610,78229681.00680.0810.0710.2350.5300.00100.73204.05313.72
353809136,404,524150,03315590.92540.0600.0000.0020.2710.0990.83164.08263.93
353355135,814,490711,81214571.00570.0700.0000.0000.00161.3900.00184.34233.29
354133136,654,760116,00720560.80450.0510.1120.7100.0090.9910.1192.75234.17
353811136,405,65687,9934491.00490.0600.0000.0000.0000.0000.00174.77325.33
353626136,133,578283,6244451.00450.0500.0000.00406.4100.0000.0010.3140.73
353437135,884,166499,7576431.00430.0500.0000.0010.1700.00121.3992.88213.98
353262135,763,383650,32514391.00390.05303.8262.4530.5600.0000.0000.0000.00
353450135,894,943479,0664381.00380.0500.0000.00387.2200.0000.0000.0000.00
353413135,865,015494,1759291.00290.0320.3400.0071.7400.0071.2041.9092.53
353998136,568,45822,8505280.82230.0310.2210.6900.0010.2210.2284.79113.90
353719136,288,733138,8474271.00270.0300.0000.00246.4110.1820.3700.0000.00
353459135,909,151834,66815271.00270.0300.0000.0000.0000.0000.00136.62144.23
353280135,770,895701,96419271.00270.03234.2331.7710.2700.0000.0000.0000.00
353468135,921,205438,7355261.00260.0300.0021.2330.8300.00152.8721.0641.25
354047136,595,04444,51215250.80200.0210.2521.5900.0010.2510.2564.1393.67
353386135,840,292457,64510250.96240.0300.0000.0020.6000.0061.2484.5982.72
353568136,037,384313,9436191.00190.0200.0000.0031.1400.0041.0521.45104.29
353266135,765,502729,7016191.00190.02153.9243.3600.0000.0000.0000.0000.00
353548136,013,433296,0338170.88150.0200.0000.0000.0000.0000.0076.4284.35
353456135,899,723462,1405161.00160.0200.0000.0041.8000.0072.1800.0052.55

The data in the table above shows 19 series where 100% of the chromsome samples that express the series also express 193_843.

Three other series, 4_815, 6_713, and 8_267, are of particular interest. All three are contained within the interval of chromosome 2 covered by 193_843. Recombination events that lose association with 193_843 must be somewhere within the DNA covered by the 193_843 series. More than 80% of the chromosome samples that express any of the three series also express 193_843. But a signficant number of those chromosomes samples do not.

The reasons are two recombination events that are analyszed in the part of this notebook focused on allele masks. One involved a chromosome that expressed the series 8_267 and 6_713. It resulted from a recombination event that lost the last 23 SNPs of 193_843. The result of that recombination event became overexpressed through a selection process associated with the appearance of the series 9_39.

The other recombination event involved a chromosome that expressed 193_843 and the other major series that form the root of the SAS tree with a chromosome that expressed the series 5_684 that is a major component of the EAS tree hierarchy. That recombination event resulted in a number of overexpressed variations on the association of 4_815 and 5_684 that appear to have conferred some kind of selecrtive advantage on the recombinant chromosomes.

Another class of cases include 15_59, 20_56, 5_28, and 5_25. All of these series cover parts of the studied region of chromosome 2 that are beyond the end of 193_843. As is commonly the case hierarchies selected by these series have experienced recombination events between the more stable lower part of the region and the less stable upper part.

The two other cases are the series 10_25 and 8_17. Both have experienced recombination events within the region covered by 193_843. But neither case has generated an overexpressed result.

Basis Plots

A basis plot is used to match samples to the most specific series in an hierarchy that the samples express. Basis filtering can be used in combination with yes no comparisons to explore recombination events for samples that express particular parts of an hierarchy.

In [7]:
plt_obj = dm.basis_yes_no([dm.di_6_1503, dm.di_26_1414])
plt = plt_obj.do_basis_plot()
show(plt)
Out[7]:

<Bokeh Notebook handle for In[7]>

This plot shows the basis series for all of the chromosome 2 samples in the phase 3 1000 genomes data that express both the series 6_1503 and 26_1414. These criteria identify the chromosome 2 samples that express hierarchies that are part of the EUR tree.

Series are labeled in the top left corner by their id. The number at the end of the series is the number of chromosome 2 samples that do express it and do not express any other more specific series in the lactase persistence region of chromosome 2. For example the 135 chromosome 2 samples that express the series 4_911 as a basis series do not express the series 11_765. All but 3 of the chromosome samples that express 11_765 also express 4_911. But, since 911 chromosamples express 4_911 and only 765 chromosome samples express 11_765, 11_765 is the more specific series.

The color used to represent the basis series is the color used to represent the whole series. But the height of the basis series rectangles is a function of the log of the number of chromosome samples that actually express it as a basis series.

In [8]:
HTML(plt_obj.get_basis_html())
Out[8]:
indexfirstlastlengthsnpsallelesbasisafrafxamreaseursassax
353797136,398,174136,474,09875,92426141420260.1520.16391.39801.97200.49191.29361.45
354129136,652,953136,761,175108,22251296300.0000.0000.0034.9700.0000.0000.00
354127136,652,491136,732,77280,28161114700.0000.0000.0021.4210.7123.9322.33
353984136,556,805136,747,085190,280391014900.0000.0000.0073.8610.5511.5300.00
353604136,092,061136,407,479315,418491113500.0000.00562.9900.00351.29161.63281.69
353902136,494,985136,773,638278,65381857200.0000.0000.0000.0000.0000.0028.16
353938136,514,709136,543,14728,4386820500.0000.0011.4400.0000.0025.5023.26
353380135,837,906136,707,982870,0761176576520.01320.671461.3800.004843.15561.01450.48
354061136,603,638136,632,12528,48716511100.0000.0017.2200.0000.0000.0000.00
354064136,605,402136,624,68619,2848328300.0000.0012.4100.0023.3200.0000.00
354123136,651,773136,774,627122,85451176200.0000.0000.0024.9700.0000.0000.00
354051136,596,162136,785,729189,56717137300.0000.0000.0011.6623.3200.0000.00
353987136,560,126136,742,080181,954457100.0000.0000.0000.0000.00113.7600.00
353444135,889,881136,382,936493,0556575320.1910.30141.9100.00282.6351.3030.46
353407135,859,157135,980,938121,7816353410.1510.47122.5500.00172.4920.8110.24
353252135,759,628136,443,550683,9221232100.0000.0017.2200.0000.0000.0000.00
354074136,608,339136,738,273129,934731114.9700.0000.0000.0000.0000.0000.00
354116136,648,736136,762,850114,114426100.0000.0000.0000.0014.9800.0000.00
354176136,687,844136,709,80921,965425100.0000.0000.0014.9700.0000.0000.00
353315135,786,061136,586,684800,6237222200.0000.0000.00194.2900.0031.8800.00
353570136,039,834136,349,562309,7287201700.0010.9441.7000.00113.2200.0010.48
353692136,243,714136,351,532107,818518100.0000.0017.2200.0000.0000.0000.00
353922136,504,541136,528,22423,683417114.9700.0000.0000.0000.0000.0000.00

The data accompanying the basis plots needs to be consulted to understand the population distribution of the chromosome samples that express a particular series as a basis. The data in this table for the series 26_1414, 4_911, and 11_765 provide some insight in the evolution of lactase persistence.

The 202 chromosome samples that express 26_1414 as a basis suggest that the root series of the EUR tree already had some kind of selective advantage. Those samples are heavily under expressed in African populations, modestly under expressed in European populations, modestly overexpressed in American and South Asian populations, and most heavily overexpressed in East Asian populations. The root series of the EUR tree does seem to have been favored by the expansion out of Africa. But, it appears to have been distributed over all of the other 1000 genomes population regions.

The 135 chromosomes that express 4_911 as a basis do not include any African or East Asian chromosomes. They are modestly overexpressed for European populations, more overexpressed for South Asian populations and most overexpressed for American populations. It seems unlikely that the American samples got to America through the migration either of East Asian or European populations.

Series SNP distributions

Another kind of plot is used to examine the number of SNPs from a series that are expressed by samples from the 1000 genomes data. This plot shows number of series SNPs on the x axis and number of 1000 genomes chromosome samples that express that number of the series SNPs on the y axis.

This example shows that kind of plot for the 117 SNPs from the series 117_1685

In [9]:
count_data = an.sa_117_1685.unique_snps_per_allele()
plt_obj = smcp(count_data)
plt = plt_obj.do_plot()
show(plt)
Out[9]:

<Bokeh Notebook handle for In[9]>

This plot shows the numbers of samples that express different numbers of SNPs from the series 117_1685

In [10]:
count_data
Out[10]:
array([(0, 2680), (1, 153), (2, 15), (3, 7), (4, 1), (5, 4), (6, 2),
       (7, 1), (8, 1), (10, 2), (11, 5), (12, 1), (13, 1), (17, 3),
       (20, 1), (21, 13), (22, 3), (24, 2), (43, 22), (44, 2), (46, 1),
       (47, 1), (50, 1), (53, 1), (54, 1), (57, 1), (58, 1), (59, 11),
       (60, 28), (61, 167), (62, 53), (63, 2), (64, 1), (68, 1), (70, 3),
       (75, 1), (83, 2), (87, 2), (92, 6), (93, 1), (94, 2), (95, 14),
       (96, 87), (97, 4), (98, 1), (100, 5), (102, 5), (104, 1), (105, 6),
       (106, 70), (107, 5), (108, 6), (109, 1), (110, 1), (111, 41),
       (112, 3), (113, 9), (114, 12), (115, 55), (116, 294), (117, 1182)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

This data shows the SNP fragment size and the number of samples expressing that number of 117_1685 SNPs as an array of tuples. Each tuple associate a number of SNPs with the number of samples that express that number of SNPs

In this example the distribution is strongly peaked with 2680 samples expressing none of the 117 SNPs and 1182 samples expressing all 117 of them. This kind of result was observed for all of the 76 series documented in the detailed results of this study.

The distribution includes several cases of peaks for significant numbers of samples expressing the same size fragment of the 117 SNPs. Investigations documented in the detailed results showed that multiple fragments of the same size generally indicate expression of multiple instances of fragments with the same or similar sequences of SNPs.

In [11]:
sa_64_1575 = snp_anal.series_anal_cls(dm.di_64_1575, sm.series_data)
count_data = sa_64_1575.unique_snps_per_allele()
plt_obj = smcp(count_data)
plt = plt_obj.do_plot()
show(plt)
Out[11]:

<Bokeh Notebook handle for In[11]>

This plot shows the SNP distribution analysis for the series 64_1575. It is another case where substantial numbers of fragments exist that generally can be traced back to recombination events and some kind of process that selected the fragment for overexpression.

In [12]:
count_data
Out[12]:
array([(0, 2689), (1, 191), (2, 136), (3, 2), (4, 33), (5, 40), (6, 25),
       (7, 4), (9, 7), (10, 1), (11, 17), (12, 1), (13, 1), (14, 1),
       (15, 2), (20, 3), (21, 42), (22, 4), (23, 1), (26, 7), (27, 3),
       (28, 9), (29, 1), (30, 5), (31, 37), (32, 6), (33, 4), (35, 1),
       (36, 12), (37, 6), (39, 2), (40, 50), (41, 8), (42, 1), (43, 15),
       (44, 2), (46, 2), (49, 6), (50, 1), (51, 2), (52, 46), (53, 3),
       (54, 1), (57, 3), (58, 31), (61, 21), (62, 19), (63, 135),
       (64, 1369)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

This data is the array of SNP counts and sample numbers for the 64_1575 analysis.

Allele Mask Plots

Detailed analysis of series fragments can be carried out with allele mask plots. An allele mask is an array with a boolean value for each of the 5008 chromosome samples in the 1000 genomes phase 3 data. The value is true for those chromosome samples that satisfy some logical test. Allele masks for samples that express fragments of a specific size from some series are useful for analyzing those series fragments.

In [13]:
from genomes_dnj.lct_interval.series_masks import am_193_843, nam_193_843
an.sa_193_843.unique_snps_per_allele(am_193_843)
Out[13]:
array([(174, 4), (175, 1), (181, 3), (185, 1), (186, 1), (188, 2),
       (191, 15), (192, 120), (193, 696)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

The an.sa_193_43 object is an instance of the series_anal_cls. It is used for analysis of the samples that express any subset of the 193 SNPs in the series 193_843. The allele mask am_193_843 is a boolean mask that identifies the 1000 genomes chromosome 2 samples that express the series 193_843. The unique_snps_allele method returns a list of the number of SNPs and the number of chromosome samples that express that number of 193_843 SNPs. This data shows that 696 of the 843 chromosome samples that express 193_843 express all 193 of its SNPs. Another 120 of the chromosomes express 192 of them. Smaller numbers of chromosome samples do express smaller numbers of SNPs that are fragments of the whole series.

In [14]:
an.sa_193_843.unique_snps_per_allele(nam_193_843)
Out[14]:
array([(0, 3458), (1, 215), (2, 174), (3, 2), (4, 3), (5, 52), (6, 8),
       (7, 3), (8, 2), (9, 1), (10, 1), (11, 1), (12, 39), (15, 1),
       (17, 1), (18, 3), (19, 1), (21, 1), (23, 3), (26, 1), (27, 1),
       (28, 12), (29, 94), (30, 1), (44, 2), (54, 1), (55, 27), (63, 2),
       (65, 2), (98, 1), (100, 1), (101, 1), (139, 1), (158, 1), (164, 3),
       (165, 3), (169, 1), (170, 41)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

The nam_193_843 boolean mask identifies chromosome 2 samples that do not express 193_843. The results are a list of numbers of 193_843 SNPs and the number of chromsome 2 samples that express that number of 193_843 SNPs. The results show that 3458 samples express none of the 193 SNPs.

The data does show that a significant number of samples express 1 or 2 of the 193 SNPs. It also shows that 5, 6, 12, 28, 29, 55 and 170 SNP fragments are expressed by numbers of samples that indicate substantial overexpression. There also is a background of other fragments of the 193 SNPs that could be available for some kind of future selection process.

In [15]:
aps_29, am_29 = an.sa_193_843.snps_from_aps_value(29, nam_193_843)
aps_29
Out[15]:
array([93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,
       93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,
        0,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  0,  1,  0,  0,  1,
        0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,  1,  1,  0,  0,
        0,  0,  1,  0,  1,  0,  0,  0,  0,  1,  0,  1,  0,  0,  1,  1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  0,  0,  0,  0,  1,
        1,  0,  0,  1,  0,  0,  1,  0,  1,  0,  1,  1,  0,  0,  1,  0,  0,
        0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data looks at the 29 SNP fragments that are expressed by 94 chromosome samples. The method snps_from_aps_value takes a count of the number of SNPs and a boolean mask of chromosome 2 samples. It returns an array of the number of samples that express each of the 193 SNPs and a boolean mask of samples that express 29 SNP fragments of the 193 SNPs. Only samples identified in the input mask are elgible for the output mask.

In this case 93 of the chromsomes samples express the first 29 SNPs. That pattern reflects the result of some recombination event. The other chromosome sample expresses a very irregular collection of 29 different SNPs generated by some kind of unknown process.

In [16]:
plt_obj = dm.superset_allele_mask(am_29, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
Out[16]:

<Bokeh Notebook handle for In[16]>

This plot shows the series expressed by at least 50% of the chromosome samples that express a 29 SNP fragment of the series 193_843 SNPs. It indentifies the results of recombination events that have associated the series 4_815 from the South Asian tree with 5_684 and 32_1361 from the East Asian tree and with 64_1575, 10_2206, and 7_1868 from the European tree.

In [17]:
HTML(plt_obj.get_html())
Out[17]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.03600.6400.0000.0000.00594.8900.0010.2300.00
354170136,682,27493,624718680.03620.6610.0800.0000.00594.7310.0810.2200.00
353901136,494,186271,7656415750.04600.6400.0000.0000.00594.8900.0010.2300.00
353791136,393,65892,6843213610.06790.8410.0600.0000.00714.4710.0640.7020.21
353384135,839,80584,66248150.11930.9920.1100.0000.00844.4910.0540.5920.18
353498135,959,272416,58356840.13870.9300.0000.0000.00824.6800.0040.6310.09

The data from the plot show that 93 out of the 94 chromsome samples that express 29 193_843 SNPs express the series 4_815.

In [18]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815], min_match=0.5)
plt = plt_obj.do_plot()
am_4_815 = plt_obj.plot_context.yes_allele_mask
show(plt)
Out[18]:

<Bokeh Notebook handle for In[18]>

In [19]:
HTML(plt_obj.get_html())
Out[19]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.03600.6500.0000.0000.00594.8900.0010.2300.00
354170136,682,27493,624718680.03620.6710.0800.0000.00594.7310.0810.2200.00
353901136,494,186271,7656415750.04600.6500.0000.0000.00594.8900.0010.2300.00
353791136,393,65892,6843213610.06790.8510.0600.0000.00714.4710.0640.7020.21
353384135,839,80584,66248150.11931.0020.1100.0000.00844.4910.0540.5920.18
353498135,959,272416,58356840.13870.9400.0000.0000.00824.6800.0040.6310.09

The data from the plot show that 87 of the 93 chromsome samples that express 29 193_843 SNPs and the series 4_815 also express the series 5_864.

In [20]:
an.sa_193_843.alleles_per_snp(am_4_815)
Out[20]:
array([93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,
       93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data confirms that the 93 chromosome samples that express 4_815 are the 93 chromsome samples that express the first 29 SNPs of the series 193_843.

In [21]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_684, dm.di_32_1361, dm.di_64_1575], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[21]:

<Bokeh Notebook handle for In[21]>

This plot limits the chromosme samples to those that express all of the 6 series shown in the plot and no others.

In [22]:
HTML(plt_obj.get_html())
Out[22]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.03581.0000.0000.0000.00574.8800.0010.2400.00
354170136,682,27493,624718680.03581.0000.0000.0000.00574.8800.0010.2400.00
353901136,494,186271,7656415750.04581.0000.0000.0000.00574.8800.0010.2400.00
353791136,393,65892,6843213610.04581.0000.0000.0000.00574.8800.0010.2400.00
353384135,839,80584,66248150.07581.0000.0000.0000.00574.8800.0010.2400.00
353498135,959,272416,58356840.08581.0000.0000.0000.00574.8800.0010.2400.00

The data shows that 58 chromosomes satisfy the plot's criteria. All but 1 of them are East Asian.

The next 6 plots show the series expressed by the 6 chromosome samples that express the first 29 193_843 SNPs, series 4_815, but not series 5_684. Every one of those chromosome samples expresses a different association of series.

In [23]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361, dm.di_81_857, dm.di_5_47], [dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[23]:

<Bokeh Notebook handle for In[23]>

In [24]:
HTML(plt_obj.get_html())
Out[24]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.0011.0000.0000.0000.0000.0000.0000.0018.16
354130136,653,925107,9282415040.0011.0000.0000.0000.0000.0000.0000.0018.16
353925136,506,37532,564414420.0011.0000.0000.0000.0000.0000.0000.0018.16
353791136,393,65892,6843213610.0011.0000.0000.0000.0000.0000.0000.0018.16
353958136,535,87619,014713030.0011.0000.0000.0000.0000.0000.0000.0018.16
354129136,652,953108,222512960.0011.0000.0000.0000.0000.0000.0000.0018.16
353919136,500,47542,0851312270.0011.0000.0000.0000.0000.0000.0000.0018.16
354127136,652,49180,281611140.0011.0000.0000.0000.0000.0000.0000.0018.16
353902136,494,985278,653818570.0011.0000.0000.0000.0000.0000.0000.0018.16
353384135,839,80584,66248150.0011.0000.0000.0000.0000.0000.0000.0018.16
353554136,021,814347,5225470.0211.0000.0000.0000.0000.0000.0000.0018.16
In [25]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361, dm.di_81_857], [dm.di_5_47, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[25]:

<Bokeh Notebook handle for In[25]>

In [26]:
HTML(plt_obj.get_html())
Out[26]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.0011.0014.9700.0000.0000.0000.0000.0000.00
354130136,653,925107,9282415040.0011.0014.9700.0000.0000.0000.0000.0000.00
353925136,506,37532,564414420.0011.0014.9700.0000.0000.0000.0000.0000.00
353791136,393,65892,6843213610.0011.0014.9700.0000.0000.0000.0000.0000.00
353958136,535,87619,014713030.0011.0014.9700.0000.0000.0000.0000.0000.00
354129136,652,953108,222512960.0011.0014.9700.0000.0000.0000.0000.0000.00
353919136,500,47542,0851312270.0011.0014.9700.0000.0000.0000.0000.0000.00
354127136,652,49180,281611140.0011.0014.9700.0000.0000.0000.0000.0000.00
353902136,494,985278,653818570.0011.0014.9700.0000.0000.0000.0000.0000.00
353384135,839,80584,66248150.0011.0014.9700.0000.0000.0000.0000.0000.00
353570136,039,834309,7287200.0511.0014.9700.0000.0000.0000.0000.0000.00
In [27]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361], [dm.di_81_857, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[27]:

<Bokeh Notebook handle for In[27]>

In [28]:
HTML(plt_obj.get_html())
Out[28]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354170136,682,27493,624718680.0011.0000.0000.0000.0000.0014.9800.0000.00
353791136,393,65892,6843213610.0011.0000.0000.0000.0000.0014.9800.0000.00
353384135,839,80584,66248150.0011.0000.0000.0000.0000.0014.9800.0000.00
In [29]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_1460, dm.di_117_1685], 
                                  [dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[29]:

<Bokeh Notebook handle for In[29]>

In [30]:
HTML(plt_obj.get_html())
Out[30]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354170136,682,27493,624718680.0011.0014.9700.0000.0000.0000.0000.0000.00
354033136,588,0315,647717600.0011.0014.9700.0000.0000.0000.0000.0000.00
353244135,758,231618,28411716850.0011.0014.9700.0000.0000.0000.0000.0000.00
353478135,933,921434,64212315610.0011.0014.9700.0000.0000.0000.0000.0000.00
353814136,406,64631,432514600.0011.0014.9700.0000.0000.0000.0000.0000.00
353790136,393,15748,2531012180.0011.0014.9700.0000.0000.0000.0000.0000.00
353906136,496,49357,432911700.0011.0014.9700.0000.0000.0000.0000.0000.00
353935136,511,87421,32159760.0011.0014.9700.0000.0000.0000.0000.0000.00
353729136,309,23952,32198870.0011.0014.9700.0000.0000.0000.0000.0000.00
353384135,839,80584,66248150.0011.0014.9700.0000.0000.0000.0000.0000.00
353851136,448,85517,97646120.0011.0014.9700.0000.0000.0000.0000.0000.00
353764136,364,91622,97755880.0011.0014.9700.0000.0000.0000.0000.0000.00
353614136,115,507269,773284340.0011.0014.9700.0000.0000.0000.0000.0000.00
354064136,605,40219,28483280.0011.0014.9700.0000.0000.0000.0000.0000.00
354124136,651,96494,28841630.0111.0014.9700.0000.0000.0000.0000.0000.00
353909136,496,87551,331181480.0111.0014.9700.0000.0000.0000.0000.0000.00
353992136,562,71621,99861270.0111.0014.9700.0000.0000.0000.0000.0000.00
In [31]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_1460], 
                                  [dm.di_117_1685, dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[31]:

<Bokeh Notebook handle for In[31]>

In [32]:
HTML(plt_obj.get_html())
Out[32]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353814136,406,64631,432514600.0011.0000.0000.0000.0014.9700.0000.0000.00
353906136,496,49357,432911700.0011.0000.0000.0000.0014.9700.0000.0000.00
353849136,447,70742,559411490.0011.0000.0000.0000.0014.9700.0000.0000.00
353907136,496,80555,824910230.0011.0000.0000.0000.0014.9700.0000.0000.00
353984136,556,805190,2803910140.0011.0000.0000.0000.0014.9700.0000.0000.00
353935136,511,87421,32159760.0011.0000.0000.0000.0014.9700.0000.0000.00
353807136,403,99481,102139110.0011.0000.0000.0000.0014.9700.0000.0000.00
353938136,514,70928,43868200.0011.0000.0000.0000.0014.9700.0000.0000.00
353384135,839,80584,66248150.0011.0000.0000.0000.0014.9700.0000.0000.00
In [33]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815], [dm.di_5_1460, dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[33]:

<Bokeh Notebook handle for In[33]>

In [34]:
HTML(plt_obj.get_html())
Out[34]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353921136,501,84053,8191022060.0011.0000.0000.0000.0014.9700.0000.0000.00
354170136,682,27493,624718680.0011.0000.0000.0000.0014.9700.0000.0000.00
353901136,494,186271,7656415750.0011.0000.0000.0000.0014.9700.0000.0000.00
353797136,398,17475,9242614140.0011.0000.0000.0000.0014.9700.0000.0000.00
353384135,839,80584,66248150.0011.0000.0000.0000.0014.9700.0000.0000.00

The following plot shows the series expressed by the 1 chromosome sample that expresses 29 193_843 SNPs but does not express the series 4_815

In [35]:
plt_obj = dm.superset_allele_mask(am_29, [dm.di_117_1685], [dm.di_4_815], min_match=0.001)
plt = plt_obj.do_plot()
am_not_4_815 = plt_obj.plot_context.yes_allele_mask
show(plt)
Out[35]:

<Bokeh Notebook handle for In[35]>

In [36]:
HTML(plt_obj.get_html())
Out[36]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353244135,758,231618,28411716850.0011.0014.9700.0000.0000.0000.0000.0000.00
353478135,933,921434,64212315610.0011.0014.9700.0000.0000.0000.0000.0000.00
354130136,653,925107,9282415040.0011.0014.9700.0000.0000.0000.0000.0000.00
354189136,704,46627,74852120.0011.0014.9700.0000.0000.0000.0000.0000.00
354181136,696,01068,554181720.0111.0014.9700.0000.0000.0000.0000.0000.00
354143136,658,23819,68161670.0111.0014.9700.0000.0000.0000.0000.0000.00
354156136,666,402102,253171210.0111.0014.9700.0000.0000.0000.0000.0000.00
353866136,468,30715,5915180.0611.0014.9700.0000.0000.0000.0000.0000.00
In [37]:
an.sa_193_843.alleles_per_snp(am_not_4_815)
Out[37]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

This data confirms that this chromosome sample is the one that expresses the divergent series of 29 193_843 SNPs.

In [38]:
aps_170, am_170 = an.sa_193_843.snps_from_aps_value(170, nam_193_843)
aps_170
Out[38]:
array([41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data shows the expressed SNPs and for the chromosome samples that express 170 193_843 SNPs. In this case, all 41 of the chromosome samples express the first 170 SNPs.

In [39]:
plt_obj = dm.superset_allele_mask(am_170, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
Out[39]:

<Bokeh Notebook handle for In[39]>

This plot indicates that a selection process associated with the emergence of the series 9_39 has resulted in overexpression of the first 170 193_843 SNPs. The series 9_39 appears to be rooted in an hierachy formed by a recombination of 8_267 from the South Asian tree and the series including 32_1361 and 81_857 at the root of the upper region of the East Asian tree

In [40]:
HTML(plt_obj.get_html())
Out[40]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.02340.8310.1500.0091.9100.00213.0700.0030.72
353244135,758,231618,28411716850.02411.0010.1200.00132.2900.00242.9100.0030.60
354130136,653,925107,9282415040.02310.7610.1600.0092.1000.00182.8900.0030.79
353925136,506,37532,564414420.02340.8310.1500.0091.9100.00213.0700.0030.72
353791136,393,65892,6843213610.03411.0010.1200.00132.2900.00242.9100.0030.60
353958136,535,87619,014713030.03340.8310.1500.0091.9100.00213.0700.0030.72
354129136,652,953108,222512960.02310.7610.1600.0092.1000.00182.8900.0030.79
353269135,766,890509,0956212650.03411.0010.1200.00132.2900.00242.9100.0030.60
353919136,500,47542,0851312270.03340.8310.1500.0091.9100.00213.0700.0030.72
354127136,652,49180,281611140.03310.7610.1600.0092.1000.00182.8900.0030.79
353504135,964,764136,36869460.04411.0010.1200.00132.2900.00242.9100.0030.60
353902136,494,985278,653818570.04310.7610.1600.0092.1000.00182.8900.0030.79
353384135,839,80584,66248150.05411.0010.1200.00132.2900.00242.9100.0030.60
353521135,989,333275,29867130.06411.0010.1200.00132.2900.00242.9100.0030.60
353358135,818,487471,07882670.15411.0010.1200.00132.2900.00242.9100.0030.60
353559136,028,572409,8939390.97380.9310.1300.00132.4700.00212.7500.0030.64

The data associated with the plot shows that 38 of the chromosome samples that express the series 9_38 express the first 170 SNPs of series 193_843.

In [41]:
aps_169, am_169 = an.sa_193_843.snps_from_aps_value(169, nam_193_843)
aps_169
Out[41]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

This data is for the 1 chromosome sample that expresses 169 series 193_843 SNPs. It expresses all but 1 of the first 170 SNPs.

In [42]:
plt_obj = dm.superset_allele_mask(am_169, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
Out[42]:

<Bokeh Notebook handle for In[42]>

This plot shows that the 1 chromosome sample that expresses 169 series 193_843 SNPs is the remaining chromosome sample that expresses the series 9_39

In [43]:
HTML(plt_obj.get_html())
Out[43]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353244135,758,231618,28411716850.0011.0000.0000.0000.0000.0014.9800.0000.00
353791136,393,65892,6843213610.0011.0000.0000.0000.0000.0014.9800.0000.00
353269135,766,890509,0956212650.0011.0000.0000.0000.0000.0014.9800.0000.00
353906136,496,49357,432911700.0011.0000.0000.0000.0000.0014.9800.0000.00
353907136,496,80555,824910230.0011.0000.0000.0000.0000.0014.9800.0000.00
353984136,556,805190,2803910140.0011.0000.0000.0000.0000.0014.9800.0000.00
353935136,511,87421,32159760.0011.0000.0000.0000.0000.0014.9800.0000.00
353504135,964,764136,36869460.0011.0000.0000.0000.0000.0014.9800.0000.00
353938136,514,70928,43868200.0011.0000.0000.0000.0000.0014.9800.0000.00
353384135,839,80584,66248150.0011.0000.0000.0000.0000.0014.9800.0000.00
353521135,989,333275,29867130.0011.0000.0000.0000.0000.0014.9800.0000.00
353358135,818,487471,07882670.0011.0000.0000.0000.0000.0014.9800.0000.00
353559136,028,572409,8939390.0311.0000.0000.0000.0000.0014.9800.0000.00

This data confirms that the 1 chromosome sample has matched all of the plotted series.

In [44]:
plt_obj = dm.superset_allele_mask(am_170, [dm.di_8_267, dm.di_81_857], [dm.di_9_39], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[44]:

<Bokeh Notebook handle for In[44]>

This plot shows the series association for chromosome samples that do express 170 series 193_843 SNPs, the series 8_267, the series 81_857, but not the series 9_39

In [45]:
HTML(plt_obj.get_html())
Out[45]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
354033136,588,0315,647717600.0021.0000.0000.0000.0000.0024.9800.0000.00
353244135,758,231618,28411716850.0021.0000.0000.0000.0000.0024.9800.0000.00
354130136,653,925107,9282415040.0021.0000.0000.0000.0000.0024.9800.0000.00
353925136,506,37532,564414420.0021.0000.0000.0000.0000.0024.9800.0000.00
353791136,393,65892,6843213610.0021.0000.0000.0000.0000.0024.9800.0000.00
353958136,535,87619,014713030.0021.0000.0000.0000.0000.0024.9800.0000.00
354129136,652,953108,222512960.0021.0000.0000.0000.0000.0024.9800.0000.00
353269135,766,890509,0956212650.0021.0000.0000.0000.0000.0024.9800.0000.00
353919136,500,47542,0851312270.0021.0000.0000.0000.0000.0024.9800.0000.00
354127136,652,49180,281611140.0021.0000.0000.0000.0000.0024.9800.0000.00
353504135,964,764136,36869460.0021.0000.0000.0000.0000.0024.9800.0000.00
353902136,494,985278,653818570.0021.0000.0000.0000.0000.0024.9800.0000.00
353384135,839,80584,66248150.0021.0000.0000.0000.0000.0024.9800.0000.00
353521135,989,333275,29867130.0021.0000.0000.0000.0000.0024.9800.0000.00
353358135,818,487471,07882670.0121.0000.0000.0000.0000.0024.9800.0000.00

This data shows that this plot accounts for 2 of the chromsome samples that express 170 series 193_843 SNPs but do not express series 9_39.

In [46]:
plt_obj = dm.superset_allele_mask(am_170, [dm.di_8_267], [dm.di_9_39, dm.di_81_857], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)
Out[46]:

<Bokeh Notebook handle for In[46]>

This plot shows the series expressed by the 1 remaining chromosome sample that expresses 170 series 193_843 SNPs but does not express the series 9_39.

In [47]:
HTML(plt_obj.get_html())
Out[47]:
indexfirstlengthsnpsallelesmatchesafrafxamreaseursassax
353244135,758,231618,28411716850.0011.0000.0000.0000.0000.0014.9800.0000.00
353791136,393,65892,6843213610.0011.0000.0000.0000.0000.0014.9800.0000.00
353269135,766,890509,0956212650.0011.0000.0000.0000.0000.0014.9800.0000.00
353906136,496,49357,432911700.0011.0000.0000.0000.0000.0014.9800.0000.00
353907136,496,80555,824910230.0011.0000.0000.0000.0000.0014.9800.0000.00
353984136,556,805190,2803910140.0011.0000.0000.0000.0000.0014.9800.0000.00
353935136,511,87421,32159760.0011.0000.0000.0000.0000.0014.9800.0000.00
353504135,964,764136,36869460.0011.0000.0000.0000.0000.0014.9800.0000.00
353938136,514,70928,43868200.0011.0000.0000.0000.0000.0014.9800.0000.00
353384135,839,80584,66248150.0011.0000.0000.0000.0000.0014.9800.0000.00
353521135,989,333275,29867130.0011.0000.0000.0000.0000.0014.9800.0000.00
353358135,818,487471,07882670.0011.0000.0000.0000.0000.0014.9800.0000.00

The data confirms that the plotted association of series is expressed by 1 chromosome sample.

SNP Series

SNP series are documented with a table that shows the properties of the series and the regional population samples that express it. The individual series SNPs are documented in a separate table that shows the properties of each individual SNP.

The column in the SNP data tables labeled "niv" means not_expressed_is_variant. The SNP allele with the lowest frequency in the thousand genome data is always considered to be the variant even when the reference genome labeled the more common allele as the variant. An niv value of 1 means the allele considered as the variant for this analysis is the one that the reference genome considers to be the standard. Note that the word allele is used to indicate a single chromosome sample that expresses an individual SNP or an SNP series.

In [48]:
HTML(dm.series_html(dm.di_11_765))
Out[48]:

11_765 series

indexfirstlengthsnpsallelesafrafxamreaseursassax
353380135,837,906870,0761176520.01320.671461.3800.004843.15561.01450.48

11_765 series snps

indexposidnivallelesafrafxamreaseursassax
875450135,837,906rs7570971178520.01320.651491.3700.004933.13601.05490.51
875770135,907,088rs6730157179120.01320.651511.3800.004933.10611.06520.54
875981135,954,797rs1375131178920.01320.651491.3600.004923.10611.06530.55
876953136,138,627rs3940549178520.01320.651501.3800.004923.12611.07480.50
877231136,176,540rs13384711179530.02320.641511.3700.004933.09631.09530.54
877937136,328,890rs56369224179320.01320.641511.3720.014923.09621.08520.53
878126136,381,348rs12465802179520.01320.641511.3700.004963.11621.07520.53
878351136,429,366rs62168795180720.01340.671541.3800.005063.12601.02510.52
879308136,608,646rs4988235080820.01340.671501.3400.005113.15601.02510.51
879345136,616,754rs182549081820.01340.661531.3500.005123.12631.06540.54
879828136,707,982rs6754311182100.00340.661541.3500.005143.12621.04570.57