Skip to content

PEDIA’s Machine Learning Challenge

We used 155 cases with a molecularly confirmed diagnosis of a monogenic disease and annotated the following five scores for every gene with a rare variant in the exome:

  • gestalt-score,
  • feature-score,
  • pheno-score,
  • boqa-score,
  • cadd-score,

The gestalt score is based on image analysis from Face2Gene. Feature-, pheno- and boqa- scores are based on a deep phenotypic description of the patient by HPO terminology. The CADD score represents information from the molecular level and is given for the variant in each gene that has the highest pathogenicity.
Here is an example of these scores for the gene NSD1 in a patient with Sotos syndrome:

gene_symbol: NSD1
gestalt_score: 0.3846551123,
feature_score: 0.8401393759,
pheno_score: 0.9519,
boqa_score: 0.0005455328423121414,
cadd_score: 10.735047,

Besides NSD1, there are several hundred more genes and the challenge is to build a classifier that identifies the disease-causing one with high accuracy. So far we achieved our best results with an SVM of the sklearn library and came up with the PEDIA score and we visualized it as a manhattan plot. Every red dot is the disease causing gene in one sample and if it is above the imaginary horizontal line at 0, we would have classified it correctly.


Actually the prioritization that we achieve with the PEDIA score is already a major improvement to conventional approaches that do not include data from image analysis, as you can see in the bar plots.

We will also twitter the latest news about our study, so please follow us:

If you would like to train a classifier yourself please contact us on with the subject “PEDIA challenge” and we will then send you a training dataset. If your classifier achieves a higher performance than our current version we will also be happy to make you a coauthor in our study.