Mathematical modeling is fundamental to understanding and predicting natural phenomena. The classic approach of hypothesis-driven modeling has been successful in many domains of science. However, in biology the resulting models often have many unknown parameters or it remains unclear which hypothesis should be used as the basis for the model. It has therefore been a long-standing dream of systems biologists to directly derive models and hypotheses from data in a purely data-driven and unbiased way. Ivo Sbalzarini, Professor of Computer Science at the TU Dresden and research group leader at the Max Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG), and his research group were looking into this challenge with the goal to develop an algorithm that can directly learn interpretable and physically correct mathematical models from data.

Suryanarayana Maddu, a computational scientist from India and PhD student in the group of Ivo Sbalzarini at the Center for Systems Biology Dresden (CSBD) in the framework of ScaDS.AI (the BMBF Center for Scalable Data Analytics and Artificial Intelligence Dresden/Leipzig), tackled this fundamental machine-learning problem in collaboration with Christian Müller, Professor of Statistics at the LMU München and group leader at the Flatiron Institute in New York City. The result is a statistical learning framework that they developed. Their algorithm can automatically learn mathematical models directly from data, while adhering to fundamental physical laws and being robust. This allows scientists to interpret their data faster, enables predictive computer simulations of complex space-time dynamics, and catalyzes mechanistic insight into biological processes. The algorithm is based on the idea of group-sparse regression, which incorporates existing physical or chemical knowledge into a machine learning problem and then finds the simplest physically consistent model that can robustly explain the data.

In their study, the researchers around Ivo Sbalzarini looked at examples from biology, like the identification of cellular signal transduction pathways, the engines of chemical information processing in living biological cells. The algorithm was fed enzyme concentration measurements along with a few basic chemistry rules on how chemical reactions occur. The algorithm was then able to identify the correct signaling pathway and its associated kinetic reaction rates directly from a small amount of noisy data. The authors demonstrate that incorporating physics knowledge into the machine learning process makes the algorithm significantly more robust against perturbations in the data and uncertainties in the modeling process itself. Further, the researchers also applied their algorithm on protein concentration data from a membrane patterning event that occurs in cells. The challenge was to learn the physical model responsible for the observed patterns. This time the algorithm was also hinted on the presence of a hidden variable whose measurements were not given. The group-regression algorithm was then able to learn the underlying physical model, along with the values of the hidden variable, which came as a pleasant bonus.

In the future, the Sbalzarini lab envisions the algorithm to be integrated into microscopes for real-time analysis and data-driven physical modeling on the measurement data. The biologist or the physicist can then interact with the algorithm, possibly in the CSBD Virtual Reality CAVE, to explore basic rules that the algorithm incorporates into the learning process in order to output a physically consistent mechanistic model that can explain the biological process at hand in space and time. This is an important step toward the data-driven and AI-powered digital laboratory of the future.

**Original Publication: **

Suryanarayana Maddu, Bevan L. Cheeseman, Christian L. Müller, and Ivo F. Sbalzarini: Learning physically consistent differential equation models from data using group sparsity. Phys. Rev. E 103, 042310, 13 April 2021 doi: 10.1103/PhysRevE.103.042310