# GENEO and Explainable Machine Learning applied to protein pocket detection

This post deals with some new geometrical techniques for explainable machine learning, called GENEOs [1], that we are applying in a research group working at University of Milan and University of Bologna, to problems of drug design and molecular docking. The research is developed in collaboration with Dompé, an Italian pharmaceutical company.

The key concept here is that of data observers. With the expression data observer we mean an agent, that could be human or artificial, that looks at data to extract relevant and valuable information. For example when a dermatologist examines a picture of skin moles, he or she is playing the role of a data observer who must tell something about the nature of the depicted moles. This simple example, however, highlights some remarkable facts: first of all we usually are more interested in the judgment of the observer rather than the data itself, we can immediately forget about the skin image once we know that the moles are not anomalous. Secondly if we have reasons to think that the dermatologist judgement isn’t accurate, we may want to address other observers to combine their feedback in order to see if they agree. Eventually it’s obvious that not all observers are suitable for every problem, in our example we had a dermatologist to examine moles as it is very likely that another observer, such as a cardiologist, wouldn’t have given an equally accurate response. Till now we talked about human observers but, in our digital world, where the amount of data is growing at a tremendously fast rate, we are interested in the development of artificial data observers. GENEOs (Group Equivariant Non Expansive Operators) have been introduced in order to build a mathematical theory to formalize the concept of (artificial) data observer. In their simpler definition, GENEOs are operators that have, as input, functions defined on a topological space X and, as output, the same kind of functions. GENEOs have two defining properties: equivariance and non expansivity. Equivariance, with respect to a group G of geometrical transformations of X, means that a GENEO F must commute with every element of G. In this way we can say that the transformations of G are safely ignored by GENEOs.

Non expansivity is a kind of regularity: it means that the distance between the input data is not smaller than the distance between the output. This kind of regularity is useful since in several cases operators are required to simplify the metric structure of data. Moreover, with mild hypotheses it can be proven that the space of GENEOs is a compact and convex topological space. Such results are crucial to establish a new machine learning paradigm based on GENEOs. The adoption of GENEOs has some important advantages: equivariance allows to inject the a priori knowledge in the model and guarantees that irrelevant geometric transformations will be ignored, causing a desirable reduction in the number of parameters with respect to other methods such as neural networks. Convexity allows us to combine different operators to find new operators which may act in a better way, and the problem of finding the ‘best’ GENEO for a given problem, can be solved.

It’s clear that aspects like parameter number reduction and knowledge injection make GENEO-based models quite explainable, framing this newborn approach in the field of explainable machine learning.

As a first but promising application of this method, we developed a model for the detection of pockets on the surface of proteins to be employed in the early phases of drug development. Our initial results suggest that most of the promises have been fulfilled: our GENEO based model uses a data representation built with prior knowledge, data are processed by convolutional GENEOs that were constructed to be equivariant w.r.t. isometries of the euclidean space (the presence of a pocket isn’t influenced by the spatial configuration of the protein). The method has just 17 parameters including those that define GENEO families and those involved in the convex combination and, due to its simplicity, the model is fully explainable. Last but most important result: this model reached performances comparable with, and sometimes, better than other state-of-the-art methods in identifying pockets and in ranking them from the most to the less promising.