Many computational approaches exist for predicting the effects of amino acid substitutions from protein sequence. These are often (incorrectly) used for judging disease predisposition from individual exomic variation. Notably, all available prediction methods “top out” at about the same performance for a set of experimentally determined variant effects – regardless of the complexity of underlying algorithms or the number of protein features considered in making the prediction. We note that prediction errors likely stem from the fact that different methods are trained to recognize different patterns. Some consider protein molecular functional changes, others focus on selection pressure differences, but most aim to differentiate variation across orthologs from known severely damaging variants, e.g. those that cause monogenic disease. Regardless of the training set, however, the vast majority of these methods values evolutionary information as key contributor to the final decision.
We considered whether the protein sequence position class – rheostat or toggle – affects these predictions. The classes are defined as follows: experimentally evaluated effects of amino acid substitutions at toggle positions are binary, while rheostat positions show progressive changes. In our testing, all evaluated methods failed two key expectations: toggle neutrals were incorrectly predicted as more non-neutral than rheostat non-neutrals, while toggle and rheostat neutrals were incorrectly predicted to be different. Since many toggle positions are conserved, and most rheostats are not, predictors appear to annotate position conservation better than mutational effect. This finding can explain why predictors assign disproportionate weight to evolutionary information as an input features, as well as the field’s inability to improve predictor performance.
We thus propose that distinguishing between rheostat and toggle positions is necessary prior to attempting variant effect prediction.