
Arriving at the Cambridge train station on January 7th, I was ready for a one-month secondment as part of the REMODEL Staff Exchange project. This project focuses on the mathematical aspects of deep learning algorithms and their applications, a research topic of key relevance for my PhD work. Cambridge, with its lengthy academic history and numerous beautiful colleges, is an ideal location for a research stay abroad. My host for this visit was Professor Carola-Bibiane Schönlieb, leader of the Cambridge Image Analysis (CIA) Group.
Understanding Large Language Models
Transformers have revolutionized AI, forming the foundation of large language models (LLMs) like ChatGPT. These models rely on self-attention mechanisms, enabling them to process and generate text with remarkable fluency and coherence. However, despite their success, transformers still present several challenges, including unstable training dynamics, poor interpretability, and significant computational costs. A key issue is the way information propagates through transformer layers. Since each layer refines the input embeddings iteratively, this process resembles a dynamical system where data points evolve over multiple steps. However, as the network depth increases, training can become unstable due to vanishing or exploding gradients. This instability can lead to unpredictable behavior in LLMs, causing degraded performance. My research aims to address these issues by viewing transformers through the lens of dynamical systems, providing new insights into their training behavior.
Transformers as Dynamical Systems
By interpreting transformer layers as discretized ordinary differential equations (ODEs), we can analyze their stability and optimize their training. This mathematical perspective offers several advantages:
- Preventing Gradient Instabilities: Residual connections in transformers can be understood as discretized time steps in an ODE. By carefully designing these steps, we can prevent gradients from vanishing or exploding, ensuring stable training even in deep networks.
- Controlling Model Behavior: Transformers can be formulated using energy-based models, where attention and feed-forward layers correspond to energy functions. This structure allows us to impose constraints on how models generate outputs, reducing issues like biased or harmful content in LLMs.
- Enhancing Interpretability: The dynamical systems approach provides a framework to understand how information flows through a network. By identifying conserved quantities or invariant structures, we can gain a deeper understanding of why certain LLMs perform well and how to improve them.
Research and Collaboration in Cambridge
Beyond my own project, being in Cambridge has provided an opportunity to engage with a wider research community. I have attended seminars covering diverse topics such as developing new methods for analyzing blood samples, recovering sheet music from ancient books using image analysis, and what happens when large language models are repeatedly trained on generated content.
