import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.series_plots as dm
import genomes_dnj.lct_interval.chrom_plots as cp
output_notebook(hide_banner=True)
The studies of SNP series statistics showed large variations in the number of active series at different positions of the autosomal chromosomes, the number of SNPs in those series and lengths of the series in DNA bases. These studies also showed that chromosomes could be partitioned into regions between positions where the number of active series went to zero. The studies in these notebooks focus on one of these regions of chromosome 2 where the gene lct is located.
For chromosome 2, the mean number of SNPs in a series is 10 and the mean length of a series is 55,000 DNA bases. But, the lactase persistence region of chromosome 2 includes many series with SNP counts and lengths that far exceed these means and are expressed by large numbers of the 1000 genomes chromosome 2 samples. Four of the most exceptional series are 117_1685, 123_1561, 62_1265, and 193_843. Those series include a total of 495 SNPs Each of them covers a large part the first 600,000 bases of the studied region of chromosome 2. That DNA segment includes the genes rab3gap1, zranb3 and the lower half of r3hdm1. Over 2,000 of the 5,008 1000 genomes chromosome 2 samples express at least one of these four series. All four of them are expressed by 842 of those samples.
There is no obvious process for generating series that include these large numbers of SNPs simply from some event that selects for the functional role of a single SNP. The problem posed by the varying combinations of the series expressed by different chromosome 2 samples is even more difficult. But, the series do tend to be part of hierarchies where some series is specific for the hierarchy and appears to have played some kind of role in the process that selected that hierarchy for overexpression. For example the series 193_843 is the selector series for the hierarchies that are associated with the SAS tree.
The series 28_434 is an example of a more complex selector series. It acts as a selector series for multiple hierarchies with varying combinations of the series 117_1685 and 123_1561. Those hierarchies have been generated by processes that include fragmentation of series, recombination events, and appearance of new series that selected some 28_434 subtree for overexpression.
This notebook provides an introduction to these four series and some of the events that appear to have caused them to be so widely expressed.
The 843 chromosome 2 samples that express the 193 SNPs in the 628,000 DNA base series 193_843 all also express the overlapping series 123_1561 and 62_1265. All but one of them express 117_1685. The series 193_843 is a major part of the root of the SAS tree. But, no samples express just the root of that tree. The large number of samples expressing 193_843 appear to be the result of processes during the human expansion from Africa that generated an hierarchy of overexpressed new series. But, the process that generated 193_843 itself has not left any obvious trace.
plt_obj = dm.superset_yes_no([dm.di_193_843], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
The identity of 62_1265, the 62 SNP series expressed by 1265 samples as a series independent of 193_843 is primarily due to chromosome 2 samples that express the series 67_329. The 371 samples that express 67_329, 123_1561, and 117_1685 without 193_843 include 309 that express 67_329.
plt_obj = dm.superset_yes_no([dm.di_62_1265, dm.di_117_1685, dm.di_123_1561], [dm.di_193_843], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
The overexpression of the 209 SNP series 209_56 expressed by 56 chromosome 2 samples appears to have resulted from an indepdent selection process for chromosome 2 samples that express 62_1265, 123_1561, and 117_1685 but not 193_843. This process appears to have started with a more extended series that included 209_56 and 14_48. The full series of selected SNPs also included 117_1685, 123_1561, 62_1265, 26_1414 and 10_2206. All of these series have participated in other independent selection processes that formed different hierarchies.
plt_obj = dm.superset_yes_no([dm.di_209_56], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
There are 43 samples that express 62_1265 and 117_1685 but not 123_1561 or 193_843. Those samples include 39 that are the result of the appearance of the series 9_39. The hierarchy selected by this series is rooted in a genetic event that recombined series 8_267 that is part of the SAS tree with the series 32_1361 and 81_857 that are part of the EAS tree root. Fragments of the series 193_843 and 123_1561 are still visible in the chromosme samples that express series that derive from this event.
plt_obj = dm.superset_yes_no([dm.di_117_1685, dm.di_62_1265], [dm.di_123_1561], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
plt_obj = dm.superset_yes_no([dm.di_117_1685, dm.di_74_210], [dm.di_28_434, dm.di_123_1561], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
The series 290_16 is another case of chromosomes expressing only 117_1685. The number of chromosomes expressing this series is modest. But its 290 SNPs are a particularly large number.
plt_obj = dm.superset_yes_no([dm.di_290_16], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
plt_obj = dm.superset_yes_no([dm.di_123_1561, dm.di_117_1685], [dm.di_62_1265], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
plt_obj = dm.superset_yes_no([dm.di_123_1561, dm.di_28_434], [dm.di_117_1685], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
The next plot shows another hierarchy expressed by 73 of the 28_434 samples. This hierarchy is formed by samples that express 28_434 and 117_1685 but not 123_1561 or 13_1696. It is expressed by 67 of the 6_68 samples and all 40 of the 14_40 samples.
plt_obj = dm.superset_yes_no([dm.di_117_1685, dm.di_28_434], [dm.di_13_1696, dm.di_123_1561], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
This plot shows the hierarchy rooted in the 28_434, 117_1685, and 13_1696 association. This hierarchy appears to have resulted from the appearance of the series 22_73.
plt_obj = dm.superset_yes_no([dm.di_117_1685, dm.di_28_434, dm.di_13_1696], [dm.di_123_1561], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())
plt_obj = dm.superset_yes_no([dm.di_22_35], min_match=0.5)
plt = plt_obj.do_plot()
show(plt)
HTML(plt_obj.get_html())