Statistics of Human Genetic Diversity

In [3]:
import genomes_dnj_2.notebooks_python.chrom_statistics_overview as npy
npy.output_notebook(hide_banner=True)

Both the physical structure of DNA and a large body of experimental evidence make a strong case for a role of DNA as a smart scaffold that enables controlled activity of protein and perhaps RNA complexes. Thousands of proteins are known to bind to DNA. Many of them are known to be transcription factors that influence transcription of specific RNAs in response to interactions with chemical signalling systems. Recent studies of cell type specific regulation of protein synthesis have established the role of cell type specific enhancer DNA segments. Protein complexes bind to these segments and interact with distant protein complexes that directly interact with DNA to promote and control RNA transcription.

Sequencing of whole human genomes has shown that only a small part of human DNA directly codes for protein amino acids. Large numbers of human genetic variations are found in the rest of the human genome. Statistical studies have show that many of these genetic variations have some association with susceptibility to some kind of disease.

The thousand genome project established a pattern of correlation of single nucleotide polymorphisms (SNPs) on haploid chromosomes and used these patterns to impute haploid chromosome associations of genetic variations for their 5008 sampled sets of autosomal chromosomes. The work labeled genomes_dnj exploited the thousand genome data with an heuristic technique to group SNPs expressed by at least 16 chromosome samples from the thousand genome data into series that included four or more SNPs. That work started with 18,336,427 SNPs and grouped 10,123,510 of them into 946,618 series of four or more SNPs. An outline of the procedure used to group the SNPs has been posted on the genomes_dnj blog.

The genomes_dnj work focused on a detailed study of 76 series that all are expressed in the region of chromosome 2 that is associated with the phenotype of lactase persistence. That work established the reality of the series and recovered a number of patterns for the history that generated them.

The work reported here is the start of an effort labeled genomes_dnj_2 that is intended to generalize the genomes_dnj work to the whole genome. The initial work is focused on the identification of some simple statistics that can be used to characterize the genome wide genetic variation that results from the 946,618 SNP series.

Some significant conclusions already suggested by the genomes_dnj work are strongly reinforced by the initial work with these statistics. Genetic variation is pervasive throughout the human genome. There is no standard genome. The variations have widespread population dependencies. These results suggest a model for human evolution that is driven by pervasive selection for quantitative traits with modest but significant functional impacts. The biochemistry of many of these functional variations is likely driven by interactions of protein complexes with multiple locations on the DNA scaffold.

In [4]:
npy.show(npy.snp_stats_plot)

These two tables show the results of the procedure for grouping SNPs. The procedure is described in more detail in a post on the genomes_dnj blog.

Basic Series Statistics

Each of the 946,618 series of four or more SNPs is characterized by three basic statistics. One is the number of the 5008 thousand genome chromosome samples that express the SNPs in the series. The second is the number of SNPs that are associated with the series. The third is the length of the series in DNA bases. That length is the difference in chromosome DNA positions between the SNP in the last SNP in the series with the highest chromosome position and the first SNP in the series with the lowest.

In [5]:
npy.show(npy.all_series_stats_plot)

This table shows mean, median, and maximum values for these statistics across the 946,618 series. In all cases, the mean is substantially larger than the median and the maximum value is very large compared with either of the averages. The total population of series includes a significant number of series with values for the statistics that are very much larger than the average.

Population type statistics have also been considered to characterize the source populations of the chromosome samples that express different series. Some discussion of that kind of statistic will be presented later in the notebook.

Genome Statistics

Another use of the statistics is aimed at evaluating the distribution of genetic variation across the genome. This use depends on characterizing the statistics of all of the series that are active at each position in the genome. A series is considered to be active between the positions of its first and last SNPs. The statistics for the active series change every time a new series starts or an active series ends. The data used for the tables and charts below are the result of an analysis that accumulated these statistics at all of the 1,893,236 chromosome positions in the human autosomal chromosomes where a series of four or more SNPs starts or ends.

The analysis uses seven different statistics:

  1. Samples in series are the number of samples among the 5008 thousand genome phase 3 chromosome samples that are expressing the SNPs in at least one series that is active at the observed genome chromosome position.
  2. The series count is the number of active series at the observed genome position.
  3. The sample weighted length is the sum of the product of the samples expressing a series times the length of the series for all active series at the observed position.
  4. The sample weighted SNPs is the equivalent statistic for the SNPs in the active series and the samples that express them.
  5. The SNPs in series is just the sum of the SNPs for all of the active series at the observed position.
  6. The mean series length is the mean value for all of series that are active at the observed position.
  7. The mean series SNPs is the equivalent statistics for active series SNPs.
In [6]:
npy.show(npy.genome_stats_html_plot)

This data shows a one row table for each of the statistics. These tables show the distribution of the statistics across all of the positions in the autosomal chromosomes after the SNP poor centromere regions have been subtracted out. The first column in each table is the mean of the value for the statistic over all of the non centromere chromosome positions. The second column is the median value. At half of the positions, the value is smaller than the median. At the other half it is larger.

The other columns provide a kind of distribution for each statistic. Except for the samples in series, the column values are on a power of 2 scale. The data value for each column is the fraction of chromosome positions that have an higher value for the statistic than the value in the column header. For example at 0.30 of the non centromere autosomal chromosome positions, more than 4500 of the 5008 thousand genome haploid chromosome samples are expressing at least one active SNP series.

The samples in series table establishes the most significant result in this notebook. At most positions in these chromosomes, most of the chromosome samples are expressing some series of genetic variants. Genetic variation is the norm in the human genome. The human genome has no single standard DNA sequence.

The series count table reinforces that conclusion and emphasizes the reality of pervasive human genetic diversity. At the median position for all of the autosomal chromosomes, 13 diverse series of SNPs are active.

The other statistics primarily add additional views of the range of genetic diversity. Significant diversity is observed at the average position in the human genome. But, the statistics also identify regions of the genome that show much higher measures of diversity than that average.

In [7]:
npy.show(npy.genome_stats_plots())

These plots provide another view of the distribution of the statistics of diversity throughout the genome. Each of the 1,893,236 series start or end positions in the human autosomal chromosomes associate a value for each of the statistics with the length of the chromosome DNA between that observation and the next one. The data for these plots associated a statistical value with each of those lengths. The array of values and lengths was sorted by the value. A cumulative sum of the lengths was used to determine how the value for the statistic increased as fractions of the whole genome were summed. The x axis represetnts the value for the fraction of the genome. The y axis for the samples in series plot is linear. All of the other plots have a log y axis. The difference in the y axis values between any two points on the plotted curves represents the range of the statistics values over the fraction of the genome that is the difference in the two point's x axis values.

Chromosome Statistic Plots

Chromosome plots of these statistics make patterns of genetic variation visible. Significant genetic diversity is a pervasive reality throughout the human autosomal chromosomes. But, large variations both in the magnitude of that diversity and in the details of its patterns are easily observed with chromosome plots of these statistics. The next several groups of plots show a variety of examples intended to illustrate the range of the observed diversity and some of its more common patterns.
In [8]:
npy.show(npy.cp1.do_standard_field_plots( 30000000, 40180000))

The first group of plots shows a 10 million DNA base segment of chromosome 1. The samples in series plot shows a pattern with basic features that are pervasive throughout the genome. That is a tendency for starts or ends of series to be aligned so that the count of samples expressing at least one series repeatedly goes from zero or close to zero to close to 5008 or the reverse over a short distance in chromosome positions. This chromosome segment shows some of the range of variation in the series statistics. The number of active series at different chromosome positions ranges from less than 10 to more than 60. The mean length of the series ranges from less than 20,000 to more than 300,000 DNA bases. Patterns of series start or end alignment often create rectangular shapes in samples_in_series charts.

In [9]:
npy.show(npy.cp1.do_standard_field_plots(34000000, 35000000))

This group of plots shows a one million base segment from the plots above. It illustrates a region with shorter and fewer series. The mean series length is generally around 20,000 DNA bases. The series count ranges from around 5 to 15. Many of the samples in series rectangles cover regions of the chromosome where only a minority of the chromosome samples are expressing any series. But, several cases of more than 4000 samples expressing at least one series are visible. Even in this region of relatively low magnitude statistics, patterns of genetic diversity are pervasive.

In [10]:
npy.show(npy.cp1.do_standard_field_plots(34040000, 34400000))

This group of plots looks at a shorter segment of the above plots. Steps in the series count chart show most of the individual series start and end positions.

In [11]:
npy.show(npy.cp1.do_standard_field_plots(70000000, 80000000))

This group of plots shows another ten million base segment of chromosome 1. Several examples of chromosome regions with more than average genetic activity are visible. These regions show active series counts in the range from 30 to more than 50 and mean series lengths that range from 100,000 to more than 300,000 DNA bases.

In [12]:
npy.show(npy.cp1.do_standard_field_plots(73000000, 75000000))

This group of plots focuses on a two million base segment of the plots above. They show the alternation between an high genetic activity region, a low genetic activity region, and another high activity genetic region. Although the measures of genetic diversity are much smaller in the middle region they are still significant. That region includes three peaks in the series count that indicate close to 20 different series and where almost all of the chromosome samples are expressing at least one series.

In [13]:
npy.show(npy.cp11.do_standard_field_plots(44000000, 60000000))

This group of plots shows the centromere region of chromosome 11. By the measures of the plotted statistics, it appears to be the most genetically active region in the human genome. The peak count of active series is close to 200. The peak mean series length is close to 800,000 DNA bases. The association of high genetic diversity with the boundaries of centromere regions is not a universal pattern. But, it is a common one. Chromosomes 3, 5, 6, 7, 8, 10, 11, 12, 16, 18, 19, and 20 all show some pattern of exceptional genetic diversity associated with one or both borders of the chromosome's centromere.

In [14]:
npy.show(npy.cp11.do_standard_field_plots(44000000, 46270000))

This group of plots shows the lower activity region at the lower end of the above plot. While the measures of genetic diversity in this region are much less than those for the region closer to the centromere, they are still clearly significant. The more active area have series counts that peak several times at 20. The less active area has several cases of the series count peaking at 10. There are many peaks where almost all chromosome samples are expressing at least one series.

In [15]:
npy.show(npy.cp11.do_standard_field_plots(44619000, 45110000))

This group of plots focus on the lowest activity region of the above two sets of plots. Its statistics are dwarfed by those of the more active regions in the above plots. But, they still indicate a pervasive pattern of genetic diversity.

Population Type Statistics

Population type statistics distinguish SNP series by the population sources of the thousand genome chromosome samples that express them. This analysis focuses on the relative contribution of African populations to the chromosome samples that express the studied 946,618 SNP series.

Series were classified by the ratio between the observed frequency of chromosome samples from different regional populations to the frequency predicted from the number of that population's chromosome samples among the 5008 thousand genome total. Most of the classification was done by observed to predicted ratios for the African population living in Africa. But, one class identified series that were not expressed within East Asian, European, or South Asian thousand genomes populations. The specific criteria for the different classes are:

  1. The no_afr series have an African region observed to predicted ratio less than 0.1.
  2. The low_afr series have an African region observed to predicted ratio between 0.1 and 0.5.
  3. The some_afr series have an African region observed to predicted ratio between 0.5 and 2.0.
  4. The high_afr series have an African region observed to predicted ratio higher than 2.0 but are expressed by some chromosome samples from East Asian, European, or South Asian populations.
  5. The all_afr series have observed to predicted ratios less than 0.01 for East Asian, European, and South Asian chromosome samples.

Each of these classes accounts for a substantial portion of the 946,618 identified series of four or more SNPs. Most of these series are expressed by a significant number of African chromosome samples. Those series have an history that extends far back into the African past that cannot be explained just by some random process associated with the human expansion out of Africa. The large number of all African series provide evidence for a long history of a pattern of pervasive diversity in the selection of SNP series that extends far back into human genetic history. The large number of series expressed by both African and non African populations indicate a very large change in SNP and SNP series frequencies in the human population during the expansion out of Africa. The substantial number of non African series show that the expansion was associated with either or both substantial formation of new series or selection of very low frequency African series.

In [16]:
npy.show(npy.pop_type_series_stats_plot)

This table provides the basic series statistics by population type.

Series Statistic Distribution By Population Type

Distributions of statistics across series were calculated by population type. The tabular presentation of this data includes tables for series length, the number of chromosome samples expressing the series, and the number of SNPs in the series. The column headers show power of 2 intervals across the range of observed statistic values. The data cell values are the number of series of the row's population type that had a value for the statistic between the value of the data value's column and the value of the preceding column. For example, 21,405 no_afr series have a series length between 40,000 and 80,000 DNA bases.

In [17]:
npy.show(npy.pop_type_bin_stats_plot)

These plots provide another view of the distribution of statistic values across series by population type. These plots were generated from series data that was sorted by statistic value. Each x axis represents a statistic value on a log scale. The y axis represents the fraction of series with a value for the statistic less than the x axis value. Different colored regions represent the fraction of series by population type.

In [18]:
npy.show(npy.spo.do_log_plot('length'))
In [19]:
npy.show(npy.spo.do_log_plot('snp_count'))
In [20]:
npy.show(npy.spo.do_log_plot('sample_count'))

SNP Counts and Series Lengths

This table shows a distribution of lengths and SNP counts across all of the series. Column labels represent SNP counts. Statistic values represent a number of series that have SNP counts in the range between the preceding column label and the label of the column with the statistic value. The first column are length labels. Each statistic value represents a number of series that have a length in the range between the preceding row label and the label of the row with the statistic value. For example, 11,162 samples have a snp count between 16 and 32 and a length between 10,000 and 20,000 DNA bases.

In [21]:
npy.show(npy.genome_length_snp_series_stats_plot)

The data shows a substantial correlation between series SNP counts and series lengths. But, there are many long series that don't have high SNP counts. Most high SNP count series are not the longest series.