import numpy as np
from IPython.display import HTML
from bokeh.plotting import output_notebook, show
import genomes_dnj.lct_interval.chrom_plots as cp
output_notebook(hide_banner=True)
This notebook presents some statistics for all SNPs and series in the autosome analysis data set. Some plots of series data are presented for chromosome 2 and for a smaller region of chromosome 2 that includes the genetic variations responsible for lactase persistence.
Note that recreation of this table requires SNP data for all chromosomes. Other plots in this notebook can be recreated just from data for chromosome 2.
import genomes_dnj.series_anal.chrom_series_stats as cs
stats_obj = cs.autosome_snp_series_stats_cls()
stats_obj.do_stats()
HTML(stats_obj.stats_html())
A simple method was used with the 1000 genomes phase 3 data to group SNPs into series across all of the autosomes. The criteria used to group the SNPs roughly was:
The table above shows that over 940,000 series were identified and that more then half of the SNPs were grouped into some series. The average series contained 10 SNPs. But there was a substantial amount of variation both in the number of SNPs and the length of the chromosome covered by a series. The median values for both the number of SNPs and the length is significantly lower then the mean because particularly long series with particularly large numbers of SNPs make a large contribution to the averages.
Statistics were accumulated for the active series at each position in a chromosome where a series started or ended. An active series is one that had a start prior to the measurement position and an end after it. The plot below for chromosome 2 shows that it is common for almost all of the 5008 sampled chromosomes in the 1000 genomes phase 3 data to be expressing an active series. The label "LCT" identifies the location in the region of chromosome 2 where the lactase gene and the series of SNP's associated with lactase persistence are located.
plt0 = cp.chrom2_stats('active_series_allele_count')
show(plt0)
This plot shows the number of active series at each measured location of chromosome 2
plt1 = cp.chrom2_stats('active_series')
show(plt1)
This plot shows the number of SNPs in active series at each measured location. The region associated with lactase persistence is the one with the largest number on chromosome 2.
plt2 = cp.chrom2_stats('active_series_snp_count')
show(plt2)
This plot shows the count of chromosome samples expressing an active series in a small part of chromosome 2 that includes the region associated with lactase persistence. Note the regular pattern of the count of these samples dropping to or at least near zero. The region associated with lactase persistence is exceptionally long. But, it does appear to have some internal structure.
plt3 = cp.chrom2_intvl_stats('active_series_allele_count')
show(plt3)
This plot shows the number of active series in the same region of chromosome 2.
plt4 = cp.chrom2_intvl_stats('active_series')
show(plt4)
This plot shows the count of SNPs in active series for that same part of chromosome 2.
plt5 = cp.chrom2_intvl_stats('active_series_snp_count')
show(plt5)