Skip to content

PEDIA’s Machine Learning Challenge

We used 155 cases with a molecularly confirmed diagnosis of a monogenic disease and annotated the following five scores for every gene with a rare variant in the exome:

  • gestalt-score,
  • feature-score,
  • pheno-score,
  • boqa-score,
  • cadd-score,

The gestalt score is based on image analysis from Face2Gene. Feature-, pheno- and boqa- scores are based on a deep phenotypic description of the patient by HPO terminology. The CADD score represents information from the molecular level and is given for the variant in each gene that has the highest pathogenicity.
Here is an example of these scores for the gene NSD1 in a patient with Sotos syndrome:

gene_symbol: NSD1
gestalt_score: 0.3846551123,
feature_score: 0.8401393759,
pheno_score: 0.9519,
boqa_score: 0.0005455328423121414,
cadd_score: 10.735047,

Besides NSD1, there are several hundred more genes and the challenge is to build a classifier that identifies the disease-causing one with high accuracy. So far we achieved our best results with an SVM of the sklearn library and came up with the PEDIA score and we visualized it as a manhattan plot. Every red dot is the disease causing gene in one sample and if it is above the imaginary horizontal line at 0, we would have classified it correctly.


Actually the prioritization that we achieve with the PEDIA score is already a major improvement to conventional approaches that do not include data from image analysis, as you can see in the bar plots.

If you would like to train a classifier yourself please contact us on with the subject “PEDIA challenge” and we will then send you a training dataset. If your classifier achieves a higher performance than our current version we will also be happy to make you a coauthor in our study.

PEDIA study: Prioritization of Exome Data by Image Analysis

Interpreting thousands of sequence variants from an exome analysis is challenging, especially if only the data of the patient itself is available and no more family members can be used for filtering. In such cases a detailed description of the phenotypic findings is the key for prioritizing candidate mutations. In patients with dysmorphic features automated image analysis has recently shown to be very sensitive in detecting even mild phenotypic features. Face2Gene is a leader in this field and we teamed up with them to test the performance of prioritization strategies that combine information from the molecular as well as from the phenotype level. We would be happy if you would join the PEDIA study and contribute a syndromic case from your lab. Please see the study website and contact us for more details.

Non-coding variant interpretation using GeneTalk

Variants in noncoding regions are always difficult to interpret on a genomic level . In a recently published article in Human Mutation entitled

Rare Noncoding Mutations Extend the Mutational Spectrum in the PGAP3 Subtype of HPMRS

we present an approach for filtering pathogenic nonconding variants and explain how to interpret the implications of the variants on protein level. We analysed the variants on RNA level by applying targeted gene panel sequencing on a cDNA library and sequenced all transcripts of the GPI anchor synthesis pathway.

When gDNA variant data is analysed the filter is set to identify variants in the coding regions, that have implications on the protein level. Contrariwise, when filtering cDNA variant data, mutations in noncoding regions provide valuable information about their pathogenicity. Variants that are found in remaining introns indicate that the transcript is not correctly spliced. By comparing the allelic balance between gDNA and cDNA data the regulation (miRNA-emdiated, nonsense mediate decay, or methylation) of expression of a transcript can be determined. In our example a rare 3′UTR variant led to the down regulation of a healthy allele resulting in a predominant expression of the mutant allele.

Using GeneTalk on variant data from cDNA sequencing pathogenic noncoding variants can be identified!


pLI score

Many of you perform trio exome sequencing to detect de novo mutations in an affected individual. This is for a good reason, as we all know that the probability of a coding DNM to be disease-causing is very high. However, sometimes there are more than a single loss-of-function DNM in an exome, or the gene in which it occurs is simple not related to any disease yet. We looked over the shoulder of experts that are facing such a case. They usually go through the ExAc cohort data and count how many LoF variants they can find in such a gene. Now, there is even a much more elegant way to do so. The ExAc consortium computed a score, called pLI, that indicates the probability that a gene is intolarent to a loss of function mutation. The statistical framework behind this score is explained in detail by Samocha, et al. Basically, the depature for a certain mutational class from the expectation is quantified. The figure below shows the distribution of z-scores for synonymous (gray), missense (orange), and protein-truncating, PTV (red) mutations in about 18,000 genes. There is a considerable right-shift in the distribution of missense and PTVs, indicating that more genes are intolarent to these classes of mutations.  The proportion of genes that are very likely intolerant of loss-of-function variation (pLI ≥ 0.9) is highest for ClinGen haploinsufficient genes, and stratifies by the severity and age of onset of the  haploinsufficient phenotype. Vice versa, if you encounter a missense or protein-truncating DNM in a gene with a pLI close to 1, the chances are high that this mutation is disease causing. As we don’t want you to count all the LoF mutations in a gene in ExAC by hand anymore, we added the pLI to the gene info in GeneTalk. See an example for ZEB2 below.


Analyzing Family Exome Data …

When you ask two laywers for their opinion you might get three different answers. In genetics it could get even worse: Corpas et al. analyzed four publicly available exomes of a family with different software tools.

Surprisingly there was only little overlap in the sequence variants that were assessed as clinically relevant. We would like you to form your own opinion, that’s why we made the exome data of this study also available in the Demo account. Enjoy!

Allele balance filter

The allele balance is defined as the ratio of reads that support the alternative allele in a next-generation sequencing data set. In a usual vcf file this ratio can be computed by the information encoded in the AD, AO, and DP flag. For heterozygous genotypes and especially de novo calls the allele balance is also often a valuable indicator for quality. Krumm et al. showed e.g. that almost all candidates could be validated by Sanger, if the allelic balance was restricted to the inteval [0.3,0.7].

In GeneTalk you can now use the allelic balance to refine you call set to high confindence candidates. Enjoy!


Rare Variant Association Studies

Yippie, our new paper about rare variant association studies, RVAS, just appeared in Bioinformatics! In this work we describe strategies to optimise the probability to detect the disease-causing mutations in a cohort of patients. Obviously, the detection power depends on the size of your case group and the genetic variability of the true disease gene. We tested multiple Mendelian disorders and found that our approach outperformed the existing analysis strategies that are based on simple intersection filtering. In the figure below you can see three example studies from the literature, a cohort of 10 patients with Kabuki make-up syndrome, 7 patients with Catel-Manzke syndrome, 13 patients with Mabry syndrome, where the disease causing gene can be readily identified. A suitable matching technique for the controls helps to decrease spurious artifacts from heterogeneous data quality and population backgrounds.

Maybe who have also got some unsolved cases, so don’t hesitate to contact our tech support for more information about this approach.


We are looking forward to meeting you at our Booth #462 !

The frequency filter Hardy Weinberg dreamed of

You are suspecting a recessive mode of inheritance. What cutoff for the frequency filter should you select? Well, it depends, are you looking for a homozygous pathogenic mutation or compound heterozygotes? This makes a big difference and the Hardy-Weinberg principle might help you to decide. Let’s review the math: In a sufficiently large population there is a relationship between allele- and genotype-frequencies if certain conditions are met. If f(a) is the frequency of allele ‘a’, the frequency of the genotype ‘aa’ should be close to g(aa)=f(a)*f(a). Let’s have a look at the figure: There are 10 individuals, 1 of them shows genotype ‘aa’, 4 individuals are heterozygous and the remaining 5 show the wildtype genotype ‘AA’. One of the heterozygous individuals inherited their allele ‘a’ from their mother and the other 3 from their father. Thus, the genotype frequency of ‘aa’ is g(aa)=1/10. For the allele frequency we have to count the total number of ‘a’s and devide them by all copies of the gene, f(a)=6/20. In this example we could state that the allele- and genotype-frequency are in equilibrium as f(a)*f(a)=36/400 is close to g(aa)=1/10.

Now, what will happen, if there is selective pressure on individuals with genotype ‘aa’? This is certainly the case for pathogenic alleles in recessive disease genes and the homozygous individual in the example above is already fading. In this case the ‘aa’s are removed from the pool, but the effect on the allele frequency is not so overwhelming, it’s still at f(a)=4/18. However, the allele- and genotype-frequencies are not anymore in Hardy-Weinberg equilibrium. Actually Hardy-Weinberg disquilibrium is often a strong indication for pathogenicity and the mere existence of homozygotes in a healthy control group are a strong argrument for ruling out a candidate mutation.

Now let’s think about how you can use that information for your filtering strategy. Let’s assume the recessive disease you are trying to elucidate has an incidence of 1 in about 10.000 individuals. This means the risk allele carrier rate could be as high as 2% or two in a hundred healthy individuals. However, if there are more than e.g. 6 homozygotes in 60,000 controls, you should wonder whether this is really the disease causing mutation.

When designing the new frequency filter that is working one gentoype frequencies of several thousands of healthy controls we could almost hear your complains: “Come on, do you really expect me to do this mental arithmetic every time I am analyzing a case?” That’s why we tried to be smart on the ‘aa’s: Once you set the frequency cutoff for the heterozygous genotype frequency, we will automatically set the homozygous genotype frequency to a reasonably lower value and only leave the fine adjustment to you. Enjoy!




Pedigree Predictor

The number of potentially disease-causing variants in an affected individual can effectively be reduced if further samples of the family are available for the analysis. As you already know GeneTalk provides highly effective filters for different modes of inheritance that work on multiple VCF files.
However, these filters will only yield the correct results if the relationships between the samples have been defined properly. If labels of samples have been erroneously mixed-up or the DNA of another individual has been sequenced, your diagnostic workup will fail.
Luckily, with several thousands of genotypes in a multiple VCF file it is possible to reconstruct the relationships between samples. We implemented a new feature in GeneTalk and are proud to present the pedigree predictor: In the pedigree editor there is a new “guess” button that will estimate the relationships of your samples. The predicted family structure is displayed in a pedigree. If the predicted relationships do not agree with your expectations this might point to a sample mix-up. So don’t get tricked!

If you experience any predictions that you don’t trust, please let us know (verena.heinrich (at)!