import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.series_plots as dm
import genomes_dnj.lct_interval.anal_series as an
import genomes_dnj.lct_interval_snp_anal.lct_interval_snp_anal as snp_anal
import genomes_dnj.lct_interval.series_masks as sm
from genomes_dnj.lct_interval.bokeh_match_count_plot import snp_match_count_plot_cls as smcp
output_notebook(hide_banner=True)

Methods

Full details of all of the methods are available in the source code. This notebook describes them and provides some examples to illustrate their use.

Identification of Series

SNPs were grouped into series through an algorithmic process. The data for that process was a two dimensional array. There was a row for each SNP expressed by at least 16 1000 genomes chromosome samples. There were 5008 columns, one for each 1000 genomes chromosome sample. Values were true if a sample expressed a SNP and false if it did not.

The process for identifying a series started with a specific SNP and an array of the column indexes for the samples that expressed that SNP. Those indexes were used to generate an array of counts of the number of the test samples that expressed each SNP farther along the chromosome. The counts were compared with the 90% match criterion to identify additional SNPs that were expressed by 90% of the test samples. A second array was generated with counts of all samples that expressed each of the SNPs selected by the initial match process. The second match check required the count of the test samples expressing the SNP to be at least 90% of the count of all samples expressing the SNP.

This process was carried out recursively by making the last identified SNP in a series the start of another attempt to find more SNPs in the series. The process was continued until a 1,000,000 base interval was scanned without finding any additional SNPs in the series. The cases where this process resulted in at least 4 SNPs were identified as a series. Samples that expressed at least 90% of the SNPs in a series were considered to express the series.

Rounding can result in inclusion samples that express less than 90% of the SNPs. The recursive process can favor identification of series that have some history of fragment generation of the lower part of the series. But, samples that express the fragments are not counted as expressing the series.

Yes No Comparisons

Plots were constructed by selecting chromosome samples that satisfied some filter criteria. The criteria are based on a yes list of series and a no list of series. To be selected a chromosome sample must express all of the series in the yes list and none of the series in the no list.

Series Representation

Series are identified by the number of SNPs in the series and the number of 1000 genomes chromosome samples that express it. The series 11_765 associated with lactase persistence includes 11 SNPs and is expressed by 765 of the 5008 1000 genomes samples of chromsome 2.

import genomes_dnj.lct_interval.plot_colors as clr
HTML(clr.pop_colors)

Series are represented in a plot by a colored rectangle. The hue of the rectangle represents the most overexpressed population for the chromosome samples that express the series. More saturated colors represent more overexpressed populations. The table above shows the fully saturated colors used to represent overexpression of series by different 1000 genomes populations. The plot x axis maps to positions on chromosome 2. The height of the rectangle represents the log of the number of 1000 genomes chromosome 2 samples that express the series. Each SNP in the series is represented with a vertical line. Height and color are properties of the series regardless of how many chromosome samples express the series in a particular plot.

Superset Plots

Most plots in these notebooks are superset plots. The intent of a superset plot is to find all of the series in the studied region expressed by some set of chromosome samples. Some of those series may be expressed specifically by the samples that satisfy the filter test. For other series those samples are just a subset of the 1000 genomes chromosome samples that express the series. A yes no comparison is used to select input chromosome samples. The output series are identified by trying to match the input samples with the samples that express each of the series in the studied region. Those series that meet the match criterion are added to the result set. When the intent was to limit the series to those expressed by all of the selected chromosome samples, the match criterion was generally 90%. When the intent was to identify all of the series most commonly expressed by the selected chromosome samples, the match criterion was generally 50%. The use of a 50% match prevents the selection of overlapping series that are expressed by different subsets of the selected chromosome samples. When the intent was to display a complete hierarchy of series that are all expressed by all the chosen set of chromosome samples, the match criterion was generally 0.1%. The data associated with each plot shows the number of input samples that actually express each of the result series.

lct_plt_obj = dm.superset_yes_no([dm.di_11_765], min_match=0.9)
plt = lct_plt_obj.do_plot()
show(plt)

This plot is an example of a superset plot. The only criterion for selecting chromosome samples is the expression of the series 11_765. The series in the plot are ones that are expressed by at least 90% of the 1000 genomes phase 3 chromsome 2 samples that express the series 11_765.

HTML(lct_plt_obj.get_html())

This table shows the plot data for the 11_765 superset plot. Each matched series has a row in the table. Series are ordered by the number of 1000 genomes chromosome samples that express the series. Matches give the number of those samples selected for the plot that express the particular series.

For example the top line indicates that the 10 SNP series expressed by 2206 chromosome 2 samples is expressed by all 765 of the chromsomes samples that satisfied this plot's yes no criteria. Those samples were 35% of the total 2206 1000 genomes chromosome 2 samples that express the series 10_2206. Those samples were 100% of the 765 chromosome samples that satisfied this plot's yes no criteria.

Number of matches and relative population expression are broken out by population. Both the African and South Asian populations are divided between those that came from the native area and those who live externally. The data in this table was the reason for that choice. The 32 instances of 11_765 found in American Southwest and Caribbean African populations are very unlikely to have come from Africa and significantly over represent the frequency of lactase persistence in Afican populations.

For calculations of population overexpression chromosome samples that failed the no test were excluded from the population under consideration. The relative expression calculation is based on a ratio of the samples that matched to the population size minus the samples that failed the no test.

Subset Plots

A subset plot starts with a test sample set and identifies series that are expressed by some subset of that sample set. The samples that express each of the series in the result set must be a subset of the samples in the test set up to some match criterion. Generally, a 90% match criterion was used for these plots.

plt_obj = dm.subset_yes_no([dm.di_193_843], min_match=0.8)
plt = plt_obj.do_plot()
show(plt)

The plot above shows series that are expressed by chromosome 2 samples that are subsets of the chromosome samples that express the series 193_843. Those series are all part of the hierarchy that has been named the SAS tree. The match criterion has been relaxed to 0.8 because some of the chromosome samples that express the series 8_267 have experienced a recombination event that cause loss of enough 193_843 SNPs for those samples not to match 193_843.

HTML(plt_obj.get_html())

The data in the table above shows 19 series where 100% of the chromsome samples that express the series also express 193_843.

Three other series, 4_815, 6_713, and 8_267, are of particular interest. All three are contained within the interval of chromosome 2 covered by 193_843. Recombination events that lose association with 193_843 must be somewhere within the DNA covered by the 193_843 series. More than 80% of the chromosome samples that express any of the three series also express 193_843. But a signficant number of those chromosomes samples do not.

The reasons are two recombination events that are analyszed in the part of this notebook focused on allele masks. One involved a chromosome that expressed the series 8_267 and 6_713. It resulted from a recombination event that lost the last 23 SNPs of 193_843. The result of that recombination event became overexpressed through a selection process associated with the appearance of the series 9_39.

The other recombination event involved a chromosome that expressed 193_843 and the other major series that form the root of the SAS tree with a chromosome that expressed the series 5_684 that is a major component of the EAS tree hierarchy. That recombination event resulted in a number of overexpressed variations on the association of 4_815 and 5_684 that appear to have conferred some kind of selecrtive advantage on the recombinant chromosomes.

Another class of cases include 15_59, 20_56, 5_28, and 5_25. All of these series cover parts of the studied region of chromosome 2 that are beyond the end of 193_843. As is commonly the case hierarchies selected by these series have experienced recombination events between the more stable lower part of the region and the less stable upper part.

The two other cases are the series 10_25 and 8_17. Both have experienced recombination events within the region covered by 193_843. But neither case has generated an overexpressed result.

Basis Plots

A basis plot is used to match samples to the most specific series in an hierarchy that the samples express. Basis filtering can be used in combination with yes no comparisons to explore recombination events for samples that express particular parts of an hierarchy.

plt_obj = dm.basis_yes_no([dm.di_6_1503, dm.di_26_1414])
plt = plt_obj.do_basis_plot()
show(plt)

This plot shows the basis series for all of the chromosome 2 samples in the phase 3 1000 genomes data that express both the series 6_1503 and 26_1414. These criteria identify the chromosome 2 samples that express hierarchies that are part of the EUR tree.

Series are labeled in the top left corner by their id. The number at the end of the series is the number of chromosome 2 samples that do express it and do not express any other more specific series in the lactase persistence region of chromosome 2. For example the 135 chromosome 2 samples that express the series 4_911 as a basis series do not express the series 11_765. All but 3 of the chromosome samples that express 11_765 also express 4_911. But, since 911 chromosamples express 4_911 and only 765 chromosome samples express 11_765, 11_765 is the more specific series.

The color used to represent the basis series is the color used to represent the whole series. But the height of the basis series rectangles is a function of the log of the number of chromosome samples that actually express it as a basis series.

HTML(plt_obj.get_basis_html())

The data accompanying the basis plots needs to be consulted to understand the population distribution of the chromosome samples that express a particular series as a basis. The data in this table for the series 26_1414, 4_911, and 11_765 provide some insight in the evolution of lactase persistence.

The 202 chromosome samples that express 26_1414 as a basis suggest that the root series of the EUR tree already had some kind of selective advantage. Those samples are heavily under expressed in African populations, modestly under expressed in European populations, modestly overexpressed in American and South Asian populations, and most heavily overexpressed in East Asian populations. The root series of the EUR tree does seem to have been favored by the expansion out of Africa. But, it appears to have been distributed over all of the other 1000 genomes population regions.

The 135 chromosomes that express 4_911 as a basis do not include any African or East Asian chromosomes. They are modestly overexpressed for European populations, more overexpressed for South Asian populations and most overexpressed for American populations. It seems unlikely that the American samples got to America through the migration either of East Asian or European populations.

Series SNP distributions

Another kind of plot is used to examine the number of SNPs from a series that are expressed by samples from the 1000 genomes data. This plot shows number of series SNPs on the x axis and number of 1000 genomes chromosome samples that express that number of the series SNPs on the y axis.

This example shows that kind of plot for the 117 SNPs from the series 117_1685

count_data = an.sa_117_1685.unique_snps_per_allele()
plt_obj = smcp(count_data)
plt = plt_obj.do_plot()
show(plt)

This plot shows the numbers of samples that express different numbers of SNPs from the series 117_1685

count_data

array([(0, 2680), (1, 153), (2, 15), (3, 7), (4, 1), (5, 4), (6, 2),
       (7, 1), (8, 1), (10, 2), (11, 5), (12, 1), (13, 1), (17, 3),
       (20, 1), (21, 13), (22, 3), (24, 2), (43, 22), (44, 2), (46, 1),
       (47, 1), (50, 1), (53, 1), (54, 1), (57, 1), (58, 1), (59, 11),
       (60, 28), (61, 167), (62, 53), (63, 2), (64, 1), (68, 1), (70, 3),
       (75, 1), (83, 2), (87, 2), (92, 6), (93, 1), (94, 2), (95, 14),
       (96, 87), (97, 4), (98, 1), (100, 5), (102, 5), (104, 1), (105, 6),
       (106, 70), (107, 5), (108, 6), (109, 1), (110, 1), (111, 41),
       (112, 3), (113, 9), (114, 12), (115, 55), (116, 294), (117, 1182)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

This data shows the SNP fragment size and the number of samples expressing that number of 117_1685 SNPs as an array of tuples. Each tuple associate a number of SNPs with the number of samples that express that number of SNPs

In this example the distribution is strongly peaked with 2680 samples expressing none of the 117 SNPs and 1182 samples expressing all 117 of them. This kind of result was observed for all of the 76 series documented in the detailed results of this study.

The distribution includes several cases of peaks for significant numbers of samples expressing the same size fragment of the 117 SNPs. Investigations documented in the detailed results showed that multiple fragments of the same size generally indicate expression of multiple instances of fragments with the same or similar sequences of SNPs.

sa_64_1575 = snp_anal.series_anal_cls(dm.di_64_1575, sm.series_data)
count_data = sa_64_1575.unique_snps_per_allele()
plt_obj = smcp(count_data)
plt = plt_obj.do_plot()
show(plt)

This plot shows the SNP distribution analysis for the series 64_1575. It is another case where substantial numbers of fragments exist that generally can be traced back to recombination events and some kind of process that selected the fragment for overexpression.

count_data

array([(0, 2689), (1, 191), (2, 136), (3, 2), (4, 33), (5, 40), (6, 25),
       (7, 4), (9, 7), (10, 1), (11, 17), (12, 1), (13, 1), (14, 1),
       (15, 2), (20, 3), (21, 42), (22, 4), (23, 1), (26, 7), (27, 3),
       (28, 9), (29, 1), (30, 5), (31, 37), (32, 6), (33, 4), (35, 1),
       (36, 12), (37, 6), (39, 2), (40, 50), (41, 8), (42, 1), (43, 15),
       (44, 2), (46, 2), (49, 6), (50, 1), (51, 2), (52, 46), (53, 3),
       (54, 1), (57, 3), (58, 31), (61, 21), (62, 19), (63, 135),
       (64, 1369)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

This data is the array of SNP counts and sample numbers for the 64_1575 analysis.

Allele Mask Plots

Detailed analysis of series fragments can be carried out with allele mask plots. An allele mask is an array with a boolean value for each of the 5008 chromosome samples in the 1000 genomes phase 3 data. The value is true for those chromosome samples that satisfy some logical test. Allele masks for samples that express fragments of a specific size from some series are useful for analyzing those series fragments.

from genomes_dnj.lct_interval.series_masks import am_193_843, nam_193_843
an.sa_193_843.unique_snps_per_allele(am_193_843)

array([(174, 4), (175, 1), (181, 3), (185, 1), (186, 1), (188, 2),
       (191, 15), (192, 120), (193, 696)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

The an.sa_193_43 object is an instance of the series_anal_cls. It is used for analysis of the samples that express any subset of the 193 SNPs in the series 193_843. The allele mask am_193_843 is a boolean mask that identifies the 1000 genomes chromosome 2 samples that express the series 193_843. The unique_snps_allele method returns a list of the number of SNPs and the number of chromosome samples that express that number of 193_843 SNPs. This data shows that 696 of the 843 chromosome samples that express 193_843 express all 193 of its SNPs. Another 120 of the chromosomes express 192 of them. Smaller numbers of chromosome samples do express smaller numbers of SNPs that are fragments of the whole series.

an.sa_193_843.unique_snps_per_allele(nam_193_843)

array([(0, 3458), (1, 215), (2, 174), (3, 2), (4, 3), (5, 52), (6, 8),
       (7, 3), (8, 2), (9, 1), (10, 1), (11, 1), (12, 39), (15, 1),
       (17, 1), (18, 3), (19, 1), (21, 1), (23, 3), (26, 1), (27, 1),
       (28, 12), (29, 94), (30, 1), (44, 2), (54, 1), (55, 27), (63, 2),
       (65, 2), (98, 1), (100, 1), (101, 1), (139, 1), (158, 1), (164, 3),
       (165, 3), (169, 1), (170, 41)], 
      dtype=[('count', '<u2'), ('snps', '<u2')])

The nam_193_843 boolean mask identifies chromosome 2 samples that do not express 193_843. The results are a list of numbers of 193_843 SNPs and the number of chromsome 2 samples that express that number of 193_843 SNPs. The results show that 3458 samples express none of the 193 SNPs.

The data does show that a significant number of samples express 1 or 2 of the 193 SNPs. It also shows that 5, 6, 12, 28, 29, 55 and 170 SNP fragments are expressed by numbers of samples that indicate substantial overexpression. There also is a background of other fragments of the 193 SNPs that could be available for some kind of future selection process.

aps_29, am_29 = an.sa_193_843.snps_from_aps_value(29, nam_193_843)
aps_29

array([93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,
       93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  0,  0,
        0,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  1,  0,  1,  0,  0,  1,
        0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  1,  1,  1,  0,  0,
        0,  0,  1,  0,  1,  0,  0,  0,  0,  1,  0,  1,  0,  0,  1,  1,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  0,  0,  0,  0,  1,
        1,  0,  0,  1,  0,  0,  1,  0,  1,  0,  1,  1,  0,  0,  1,  0,  0,
        0,  1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data looks at the 29 SNP fragments that are expressed by 94 chromosome samples. The method snps_from_aps_value takes a count of the number of SNPs and a boolean mask of chromosome 2 samples. It returns an array of the number of samples that express each of the 193 SNPs and a boolean mask of samples that express 29 SNP fragments of the 193 SNPs. Only samples identified in the input mask are elgible for the output mask.

In this case 93 of the chromsomes samples express the first 29 SNPs. That pattern reflects the result of some recombination event. The other chromosome sample expresses a very irregular collection of 29 different SNPs generated by some kind of unknown process.

plt_obj = dm.superset_allele_mask(am_29, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)

This plot shows the series expressed by at least 50% of the chromosome samples that express a 29 SNP fragment of the series 193_843 SNPs. It indentifies the results of recombination events that have associated the series 4_815 from the South Asian tree with 5_684 and 32_1361 from the East Asian tree and with 64_1575, 10_2206, and 7_1868 from the European tree.

HTML(plt_obj.get_html())

The data from the plot show that 93 out of the 94 chromsome samples that express 29 193_843 SNPs express the series 4_815.

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815], min_match=0.5)
plt = plt_obj.do_plot()
am_4_815 = plt_obj.plot_context.yes_allele_mask
show(plt)

HTML(plt_obj.get_html())

The data from the plot show that 87 of the 93 chromsome samples that express 29 193_843 SNPs and the series 4_815 also express the series 5_864.

an.sa_193_843.alleles_per_snp(am_4_815)

array([93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,
       93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93, 93,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data confirms that the 93 chromosome samples that express 4_815 are the 93 chromsome samples that express the first 29 SNPs of the series 193_843.

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_684, dm.di_32_1361, dm.di_64_1575], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

This plot limits the chromosme samples to those that express all of the 6 series shown in the plot and no others.

HTML(plt_obj.get_html())

The data shows that 58 chromosomes satisfy the plot's criteria. All but 1 of them are East Asian.

The next 6 plots show the series expressed by the 6 chromosome samples that express the first 29 193_843 SNPs, series 4_815, but not series 5_684. Every one of those chromosome samples expresses a different association of series.

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361, dm.di_81_857, dm.di_5_47], [dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361, dm.di_81_857], [dm.di_5_47, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_32_1361], [dm.di_81_857, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_1460, dm.di_117_1685], 
                                  [dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815, dm.di_5_1460], 
                                  [dm.di_117_1685, dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

plt_obj = dm.superset_allele_mask(am_29, [dm.di_4_815], [dm.di_5_1460, dm.di_32_1361, dm.di_5_684], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

HTML(plt_obj.get_html())

The following plot shows the series expressed by the 1 chromosome sample that expresses 29 193_843 SNPs but does not express the series 4_815

plt_obj = dm.superset_allele_mask(am_29, [dm.di_117_1685], [dm.di_4_815], min_match=0.001)
plt = plt_obj.do_plot()
am_not_4_815 = plt_obj.plot_context.yes_allele_mask
show(plt)

HTML(plt_obj.get_html())

an.sa_193_843.alleles_per_snp(am_not_4_815)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

This data confirms that this chromosome sample is the one that expresses the divergent series of 29 193_843 SNPs.

aps_170, am_170 = an.sa_193_843.snps_from_aps_value(170, nam_193_843)
aps_170

array([41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
       41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41, 41,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0])

This data shows the expressed SNPs and for the chromosome samples that express 170 193_843 SNPs. In this case, all 41 of the chromosome samples express the first 170 SNPs.

plt_obj = dm.superset_allele_mask(am_170, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)

This plot indicates that a selection process associated with the emergence of the series 9_39 has resulted in overexpression of the first 170 193_843 SNPs. The series 9_39 appears to be rooted in an hierachy formed by a recombination of 8_267 from the South Asian tree and the series including 32_1361 and 81_857 at the root of the upper region of the East Asian tree

HTML(plt_obj.get_html())

The data associated with the plot shows that 38 of the chromosome samples that express the series 9_38 express the first 170 SNPs of series 193_843.

aps_169, am_169 = an.sa_193_843.snps_from_aps_value(169, nam_193_843)
aps_169

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0])

This data is for the 1 chromosome sample that expresses 169 series 193_843 SNPs. It expresses all but 1 of the first 170 SNPs.

plt_obj = dm.superset_allele_mask(am_169, min_match=0.5)
plt = plt_obj.do_plot()
show(plt)

This plot shows that the 1 chromosome sample that expresses 169 series 193_843 SNPs is the remaining chromosome sample that expresses the series 9_39

HTML(plt_obj.get_html())

This data confirms that the 1 chromosome sample has matched all of the plotted series.

plt_obj = dm.superset_allele_mask(am_170, [dm.di_8_267, dm.di_81_857], [dm.di_9_39], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

This plot shows the series association for chromosome samples that do express 170 series 193_843 SNPs, the series 8_267, the series 81_857, but not the series 9_39

HTML(plt_obj.get_html())

This data shows that this plot accounts for 2 of the chromsome samples that express 170 series 193_843 SNPs but do not express series 9_39.

plt_obj = dm.superset_allele_mask(am_170, [dm.di_8_267], [dm.di_9_39, dm.di_81_857], min_match=0.001)
plt = plt_obj.do_plot()
show(plt)

This plot shows the series expressed by the 1 remaining chromosome sample that expresses 170 series 193_843 SNPs but does not express the series 9_39.

HTML(plt_obj.get_html())

The data confirms that the plotted association of series is expressed by 1 chromosome sample.

SNP Series

SNP series are documented with a table that shows the properties of the series and the regional population samples that express it. The individual series SNPs are documented in a separate table that shows the properties of each individual SNP.

The column in the SNP data tables labeled "niv" means not_expressed_is_variant. The SNP allele with the lowest frequency in the thousand genome data is always considered to be the variant even when the reference genome labeled the more common allele as the variant. An niv value of 1 means the allele considered as the variant for this analysis is the one that the reference genome considers to be the standard. Note that the word allele is used to indicate a single chromosome sample that expresses an individual SNP or an SNP series.

HTML(dm.series_html(dm.di_11_765))

index	first	length	snps	alleles		matches		afr		afx		amr		eur		sas		sax
353921	136,501,840	53,819	10	2206	0.35	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48
354170	136,682,274	93,624	7	1868	0.41	764	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	44	0.47
353462	135,915,358	79,721	4	1699	0.45	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48
353901	136,494,186	271,765	64	1575	0.49	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48
353283	135,771,974	368,330	6	1503	0.51	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48
353797	136,398,174	75,924	26	1414	0.54	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48
353604	136,092,061	315,418	4	911	0.84	762	1.00	2	0.01	31	0.65	146	1.38	484	3.16	54	0.97	45	0.48
353380	135,837,906	870,076	11	765	1.00	765	1.00	2	0.01	32	0.67	146	1.38	484	3.15	56	1.01	45	0.48

index	first	length	snps	alleles		matches		afr		afx		amr		eas		eur		sas		sax
353504	135,964,764	136,368	6	946	0.88	832	0.99	78	0.47	26	0.50	141	1.22	165	0.99	82	0.49	114	1.89	226	2.22
353248	135,759,095	628,204	193	843	1.00	843	1.00	79	0.47	26	0.49	143	1.22	167	0.98	83	0.49	117	1.91	228	2.21
353384	135,839,805	84,662	4	815	0.82	667	0.79	47	0.35	18	0.43	135	1.46	147	1.09	71	0.53	79	1.63	170	2.08
353521	135,989,333	275,298	6	713	0.94	668	0.79	47	0.35	18	0.43	135	1.46	146	1.09	73	0.54	79	1.63	170	2.08
353511	135,977,595	306,204	4	416	1.00	415	0.49	33	0.40	9	0.35	77	1.34	96	1.15	24	0.29	53	1.76	123	2.42
353358	135,818,487	471,078	8	267	0.84	223	0.26	1	0.02	3	0.21	56	1.81	47	1.05	48	1.07	24	1.48	44	1.61
353320	135,789,764	571,688	4	91	1.00	91	0.11	0	0.00	0	0.00	64	5.08	25	1.36	2	0.11	0	0.00	0	0.00
353285	135,772,482	525,048	7	90	1.00	90	0.11	24	1.32	3	0.53	1	0.08	19	1.05	0	0.00	17	2.60	26	2.36
353396	135,846,793	454,499	7	79	1.00	79	0.09	0	0.00	0	0.00	66	6.03	11	0.69	2	0.13	0	0.00	0	0.00
353292	135,775,651	610,782	29	68	1.00	68	0.08	1	0.07	1	0.23	5	0.53	0	0.00	10	0.73	20	4.05	31	3.72
353809	136,404,524	150,033	15	59	0.92	54	0.06	0	0.00	0	0.00	2	0.27	1	0.09	9	0.83	16	4.08	26	3.93
353355	135,814,490	711,812	14	57	1.00	57	0.07	0	0.00	0	0.00	0	0.00	16	1.39	0	0.00	18	4.34	23	3.29
354133	136,654,760	116,007	20	56	0.80	45	0.05	1	0.11	2	0.71	0	0.00	9	0.99	1	0.11	9	2.75	23	4.17
353811	136,405,656	87,993	4	49	1.00	49	0.06	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	17	4.77	32	5.33
353626	136,133,578	283,624	4	45	1.00	45	0.05	0	0.00	0	0.00	40	6.41	0	0.00	0	0.00	1	0.31	4	0.73
353437	135,884,166	499,757	6	43	1.00	43	0.05	0	0.00	0	0.00	1	0.17	0	0.00	12	1.39	9	2.88	21	3.98
353262	135,763,383	650,325	14	39	1.00	39	0.05	30	3.82	6	2.45	3	0.56	0	0.00	0	0.00	0	0.00	0	0.00
353450	135,894,943	479,066	4	38	1.00	38	0.05	0	0.00	0	0.00	38	7.22	0	0.00	0	0.00	0	0.00	0	0.00
353413	135,865,015	494,175	9	29	1.00	29	0.03	2	0.34	0	0.00	7	1.74	0	0.00	7	1.20	4	1.90	9	2.53
353998	136,568,458	22,850	5	28	0.82	23	0.03	1	0.22	1	0.69	0	0.00	1	0.22	1	0.22	8	4.79	11	3.90
353719	136,288,733	138,847	4	27	1.00	27	0.03	0	0.00	0	0.00	24	6.41	1	0.18	2	0.37	0	0.00	0	0.00
353459	135,909,151	834,668	15	27	1.00	27	0.03	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	13	6.62	14	4.23
353280	135,770,895	701,964	19	27	1.00	27	0.03	23	4.23	3	1.77	1	0.27	0	0.00	0	0.00	0	0.00	0	0.00
353468	135,921,205	438,735	5	26	1.00	26	0.03	0	0.00	2	1.23	3	0.83	0	0.00	15	2.87	2	1.06	4	1.25
354047	136,595,044	44,512	15	25	0.80	20	0.02	1	0.25	2	1.59	0	0.00	1	0.25	1	0.25	6	4.13	9	3.67
353386	135,840,292	457,645	10	25	0.96	24	0.03	0	0.00	0	0.00	2	0.60	0	0.00	6	1.24	8	4.59	8	2.72
353568	136,037,384	313,943	6	19	1.00	19	0.02	0	0.00	0	0.00	3	1.14	0	0.00	4	1.05	2	1.45	10	4.29
353266	135,765,502	729,701	6	19	1.00	19	0.02	15	3.92	4	3.36	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00
353548	136,013,433	296,033	8	17	0.88	15	0.02	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	7	6.42	8	4.35
353456	135,899,723	462,140	5	16	1.00	16	0.02	0	0.00	0	0.00	4	1.80	0	0.00	7	2.18	0	0.00	5	2.55

index	first	last	length	snps	alleles	basis	afr		afx		amr		eas		eur		sas		sax
353797	136,398,174	136,474,098	75,924	26	1414	202	6	0.15	2	0.16	39	1.39	80	1.97	20	0.49	19	1.29	36	1.45
354129	136,652,953	136,761,175	108,222	5	1296	3	0	0.00	0	0.00	0	0.00	3	4.97	0	0.00	0	0.00	0	0.00
354127	136,652,491	136,732,772	80,281	6	1114	7	0	0.00	0	0.00	0	0.00	2	1.42	1	0.71	2	3.93	2	2.33
353984	136,556,805	136,747,085	190,280	39	1014	9	0	0.00	0	0.00	0	0.00	7	3.86	1	0.55	1	1.53	0	0.00
353604	136,092,061	136,407,479	315,418	4	911	135	0	0.00	0	0.00	56	2.99	0	0.00	35	1.29	16	1.63	28	1.69
353902	136,494,985	136,773,638	278,653	81	857	2	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	2	8.16
353938	136,514,709	136,543,147	28,438	6	820	5	0	0.00	0	0.00	1	1.44	0	0.00	0	0.00	2	5.50	2	3.26
353380	135,837,906	136,707,982	870,076	11	765	765	2	0.01	32	0.67	146	1.38	0	0.00	484	3.15	56	1.01	45	0.48
354061	136,603,638	136,632,125	28,487	16	511	1	0	0.00	0	0.00	1	7.22	0	0.00	0	0.00	0	0.00	0	0.00
354064	136,605,402	136,624,686	19,284	8	328	3	0	0.00	0	0.00	1	2.41	0	0.00	2	3.32	0	0.00	0	0.00
354123	136,651,773	136,774,627	122,854	51	176	2	0	0.00	0	0.00	0	0.00	2	4.97	0	0.00	0	0.00	0	0.00
354051	136,596,162	136,785,729	189,567	17	137	3	0	0.00	0	0.00	0	0.00	1	1.66	2	3.32	0	0.00	0	0.00
353987	136,560,126	136,742,080	181,954	4	57	1	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	1	13.76	0	0.00
353444	135,889,881	136,382,936	493,055	6	57	53	2	0.19	1	0.30	14	1.91	0	0.00	28	2.63	5	1.30	3	0.46
353407	135,859,157	135,980,938	121,781	6	35	34	1	0.15	1	0.47	12	2.55	0	0.00	17	2.49	2	0.81	1	0.24
353252	135,759,628	136,443,550	683,922	12	32	1	0	0.00	0	0.00	1	7.22	0	0.00	0	0.00	0	0.00	0	0.00
354074	136,608,339	136,738,273	129,934	7	31	1	1	4.97	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00
354116	136,648,736	136,762,850	114,114	4	26	1	0	0.00	0	0.00	0	0.00	0	0.00	1	4.98	0	0.00	0	0.00
354176	136,687,844	136,709,809	21,965	4	25	1	0	0.00	0	0.00	0	0.00	1	4.97	0	0.00	0	0.00	0	0.00
353315	135,786,061	136,586,684	800,623	7	22	22	0	0.00	0	0.00	0	0.00	19	4.29	0	0.00	3	1.88	0	0.00
353570	136,039,834	136,349,562	309,728	7	20	17	0	0.00	1	0.94	4	1.70	0	0.00	11	3.22	0	0.00	1	0.48
353692	136,243,714	136,351,532	107,818	5	18	1	0	0.00	0	0.00	1	7.22	0	0.00	0	0.00	0	0.00	0	0.00
353922	136,504,541	136,528,224	23,683	4	17	1	1	4.97	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00	0	0.00

index	first	length	snps	alleles		matches		afr		eas		eur		sas		sax
353921	136,501,840	53,819	10	2206	0.03	60	0.64	0	0.00	59	4.89	0	0.00	1	0.23	0	0.00
354170	136,682,274	93,624	7	1868	0.03	62	0.66	1	0.08	59	4.73	1	0.08	1	0.22	0	0.00
353901	136,494,186	271,765	64	1575	0.04	60	0.64	0	0.00	59	4.89	0	0.00	1	0.23	0	0.00
353791	136,393,658	92,684	32	1361	0.06	79	0.84	1	0.06	71	4.47	1	0.06	4	0.70	2	0.21
353384	135,839,805	84,662	4	815	0.11	93	0.99	2	0.11	84	4.49	1	0.05	4	0.59	2	0.18
353498	135,959,272	416,583	5	684	0.13	87	0.93	0	0.00	82	4.68	0	0.00	4	0.63	1	0.09

index	first	length	snps	alleles		matches		afr		eas		eur		sas		sax
353921	136,501,840	53,819	10	2206	0.03	60	0.65	0	0.00	59	4.89	0	0.00	1	0.23	0	0.00
354170	136,682,274	93,624	7	1868	0.03	62	0.67	1	0.08	59	4.73	1	0.08	1	0.22	0	0.00
353901	136,494,186	271,765	64	1575	0.04	60	0.65	0	0.00	59	4.89	0	0.00	1	0.23	0	0.00
353791	136,393,658	92,684	32	1361	0.06	79	0.85	1	0.06	71	4.47	1	0.06	4	0.70	2	0.21
353384	135,839,805	84,662	4	815	0.11	93	1.00	2	0.11	84	4.49	1	0.05	4	0.59	2	0.18
353498	135,959,272	416,583	5	684	0.13	87	0.94	0	0.00	82	4.68	0	0.00	4	0.63	1	0.09

index	first	length	snps	alleles		matches		sax
354033	136,588,031	5,647	7	1760	0.00	1	1.00	1	8.16
354130	136,653,925	107,928	24	1504	0.00	1	1.00	1	8.16
353925	136,506,375	32,564	4	1442	0.00	1	1.00	1	8.16
353791	136,393,658	92,684	32	1361	0.00	1	1.00	1	8.16
353958	136,535,876	19,014	7	1303	0.00	1	1.00	1	8.16
354129	136,652,953	108,222	5	1296	0.00	1	1.00	1	8.16
353919	136,500,475	42,085	13	1227	0.00	1	1.00	1	8.16
354127	136,652,491	80,281	6	1114	0.00	1	1.00	1	8.16
353902	136,494,985	278,653	81	857	0.00	1	1.00	1	8.16
353384	135,839,805	84,662	4	815	0.00	1	1.00	1	8.16
353554	136,021,814	347,522	5	47	0.02	1	1.00	1	8.16

index	first	length	snps	alleles	matches		eas
353814	136,406,646	31,432	5	1460	1	1.00	1	4.97
353906	136,496,493	57,432	9	1170	1	1.00	1	4.97
353849	136,447,707	42,559	4	1149	1	1.00	1	4.97
353907	136,496,805	55,824	9	1023	1	1.00	1	4.97
353984	136,556,805	190,280	39	1014	1	1.00	1	4.97
353935	136,511,874	21,321	5	976	1	1.00	1	4.97
353807	136,403,994	81,102	13	911	1	1.00	1	4.97
353938	136,514,709	28,438	6	820	1	1.00	1	4.97
353384	135,839,805	84,662	4	815	1	1.00	1	4.97

index	first	length	snps	alleles		matches		afr
353244	135,758,231	618,284	117	1685	0.00	1	1.00	1	4.97
353478	135,933,921	434,642	123	1561	0.00	1	1.00	1	4.97
354130	136,653,925	107,928	24	1504	0.00	1	1.00	1	4.97
354189	136,704,466	27,748	5	212	0.00	1	1.00	1	4.97
354181	136,696,010	68,554	18	172	0.01	1	1.00	1	4.97
354143	136,658,238	19,681	6	167	0.01	1	1.00	1	4.97
354156	136,666,402	102,253	17	121	0.01	1	1.00	1	4.97
353866	136,468,307	15,591	5	18	0.06	1	1.00	1	4.97

index	pos	id	niv	alleles	afr		afx		amr		eas		eur		sas		sax
875450	135,837,906	rs7570971	1	785	2	0.01	32	0.65	149	1.37	0	0.00	493	3.13	60	1.05	49	0.51
875770	135,907,088	rs6730157	1	791	2	0.01	32	0.65	151	1.38	0	0.00	493	3.10	61	1.06	52	0.54
875981	135,954,797	rs1375131	1	789	2	0.01	32	0.65	149	1.36	0	0.00	492	3.10	61	1.06	53	0.55
876953	136,138,627	rs3940549	1	785	2	0.01	32	0.65	150	1.38	0	0.00	492	3.12	61	1.07	48	0.50
877231	136,176,540	rs13384711	1	795	3	0.02	32	0.64	151	1.37	0	0.00	493	3.09	63	1.09	53	0.54
877937	136,328,890	rs56369224	1	793	2	0.01	32	0.64	151	1.37	2	0.01	492	3.09	62	1.08	52	0.53
878126	136,381,348	rs12465802	1	795	2	0.01	32	0.64	151	1.37	0	0.00	496	3.11	62	1.07	52	0.53
878351	136,429,366	rs62168795	1	807	2	0.01	34	0.67	154	1.38	0	0.00	506	3.12	60	1.02	51	0.52
879308	136,608,646	rs4988235	0	808	2	0.01	34	0.67	150	1.34	0	0.00	511	3.15	60	1.02	51	0.51
879345	136,616,754	rs182549	0	818	2	0.01	34	0.66	153	1.35	0	0.00	512	3.12	63	1.06	54	0.54
879828	136,707,982	rs6754311	1	821	0	0.00	34	0.66	154	1.35	0	0.00	514	3.12	62	1.04	57	0.57