Language models (LMs) have become central to natural language processing, but our understanding of how they operate is still incomplete. In this talk, I will discuss three aspects that are crucial to completing this picture. First, I will talk about tokenisation—how text is split into tokens before being fed to an LM—and its theoretical and empirical consequences. This will cover the NP-completeness of finding optimal tokenisers, methods for recovering word-level probabilities from token-level models, and recent work on estimating the bias introduced by tokenisers. Second, I will turn to optimisation, investigating the dynamics of memorisation and convergence throughout an LM’s training. Third, I will consider inference, examining the potential and limitations of causal abstraction as a tool for mechanistic interpretability. Finally, I will close the talk by connecting these perspectives, shedding light on how language models see, learn, and process language.


test
Theory 1 – Tiago Pimentel: “Tokenisation, Optimisation, and Inference: How LLMs See, Learn, and Process Language”
Speakers
Dr. Tiago Pimentel Martins Da Silva
Schedule
24 November 2025
15:45 - 16:15