In [1]:
import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.chrom_plots as cp
output_notebook(hide_banner=True)

SNP Series Statistics

This notebook presents some statistics for all SNPs and series in the autosome analysis data set. Some plots of series data are presented for chromosome 2 and for a smaller region of chromosome 2 that includes the genetic variations responsible for lactase persistence.

Note that recreation of this table requires SNP data for all chromosomes. Other plots in this notebook can be recreated just from data for chromosome 2.

In [2]:
import genomes_dnj.series_anal.chrom_series_stats as cs
stats_obj = cs.autosome_snp_series_stats_cls()
stats_obj.do_stats()
HTML(stats_obj.stats_html())
Out[2]:
chrseriesin_snpsnot_in_snpsin_ratiosnp_meansnp_medsnp_stdlen_meanlen_medlen_std
174,387796,158647,6350.5510613.5353,61623,62089,920
281,160865,501682,0450.5610613.2455,16425,38284,828
368,222744,533556,8780.5710614.5858,39425,48797,904
469,448784,085545,1170.5911715.2049,05925,03070,950
561,476667,469502,5870.5710713.8955,85526,33187,644
661,184701,990483,9920.5911716.6657,92425,32599,090
756,078607,361476,4030.5610614.9253,05022,92993,440
853,578575,287451,5020.5610614.0151,63622,11296,151
941,365419,307381,4880.5210612.6643,50120,06566,119
1047,184509,139410,1340.5510615.8349,81221,91685,663
1145,800519,153387,4620.5711620.6761,72823,955128,574
1245,356491,595385,5950.5610617.8356,69123,88499,609
1334,878371,639284,2680.5710613.5548,56422,67973,631
1430,811330,382267,4060.5510614.0550,91622,99683,338
1527,304275,859266,1000.5110613.2550,94718,93691,429
1629,271281,091305,9870.489611.7942,89615,39993,677
1724,686246,647257,8020.499617.9747,59718,41877,604
1827,266274,494243,6520.5310612.0541,69119,79364,610
1920,431208,902214,0230.4910615.5038,40514,98173,496
2020,845203,339200,6470.509611.3941,31517,68173,079
2113,213128,139127,2260.509610.8631,57716,38044,877
2212,675121,440134,9680.479611.4042,66214,35882,520
all946,61810,123,5108,212,9170.5510614.7951,70922,52489,020

A simple method was used with the 1000 genomes phase 3 data to group SNPs into series across all of the autosomes. The criteria used to group the SNPs roughly was:

  • Each chromosome sample expressing the series needed to express 90% of the SNP's in it.
  • At least 90% of the chromosome samples expressing any SNP in the series had to express the series.
  • The series had to include at least 4 SNPs.

    The table above shows that over 940,000 series were identified and that more then half of the SNPs were grouped into some series. The average series contained 10 SNPs. But there was a substantial amount of variation both in the number of SNPs and the length of the chromosome covered by a series. The median values for both the number of SNPs and the length is significantly lower then the mean because particularly long series with particularly large numbers of SNPs make a large contribution to the averages.

  • Chromosomes Expressing Active SNP Series

    Statistics were accumulated for the active series at each position in a chromosome where a series started or ended. An active series is one that had a start prior to the measurement position and an end after it. The plot below for chromosome 2 shows that it is common for almost all of the 5008 sampled chromosomes in the 1000 genomes phase 3 data to be expressing an active series. The label "LCT" identifies the location in the region of chromosome 2 where the lactase gene and the series of SNP's associated with lactase persistence are located.

    In [3]:
    plt0 = cp.chrom2_stats('active_series_allele_count')
    show(plt0)
    
    Out[3]:

    <Bokeh Notebook handle for In[3]>

    Active SNP Series

    This plot shows the number of active series at each measured location of chromosome 2

    In [4]:
    plt1 = cp.chrom2_stats('active_series')
    show(plt1)
    
    Out[4]:

    <Bokeh Notebook handle for In[4]>

    SNPs In Active Series

    This plot shows the number of SNPs in active series at each measured location. The region associated with lactase persistence is the one with the largest number on chromosome 2.

    In [5]:
    plt2 = cp.chrom2_stats('active_series_snp_count')
    show(plt2)
    
    Out[5]:

    <Bokeh Notebook handle for In[5]>

    Lactase Region Chromosome Samples Expressing Active Series

    This plot shows the count of chromosome samples expressing an active series in a small part of chromosome 2 that includes the region associated with lactase persistence. Note the regular pattern of the count of these samples dropping to or at least near zero. The region associated with lactase persistence is exceptionally long. But, it does appear to have some internal structure.

    In [6]:
    plt3 = cp.chrom2_intvl_stats('active_series_allele_count')
    show(plt3)
    
    Out[6]:

    <Bokeh Notebook handle for In[6]>

    Lactase Region Active Series

    This plot shows the number of active series in the same region of chromosome 2.

    In [7]:
    plt4 = cp.chrom2_intvl_stats('active_series')
    show(plt4)
    
    Out[7]:

    <Bokeh Notebook handle for In[7]>

    Lactase Region SNPs In Active Series

    This plot shows the count of SNPs in active series for that same part of chromosome 2.

    In [8]:
    plt5 = cp.chrom2_intvl_stats('active_series_snp_count')
    show(plt5) 
    
    Out[8]:

    <Bokeh Notebook handle for In[8]>