Provably Overwhelming Transformer Models with Designed Inputs
Abstract
We develop an algorithm which, given a trained transformer model M as input, as well as a string of tokens s of length nfix and an integer nfree, can generate a mathematical proof that M is ``overwhelmed'' by s, in time and space O˜(n2fix+n3free). We say that M is ``overwhelmed'' by s when the output of the model evaluated on this string plus any additional string t, M(s+t), is completely insensitive to the value of the string t whenever length(t) ≤nfree. Along the way, we prove a particularly strong worst-case form of ``over-squashing'', which we use to bound the model's behavior. Our technique uses computer-aided proofs to establish this type of operationally relevant guarantee about transformer models. We empirically test our algorithm on a single layer transformer complete with an attention head, layer-norm, MLP/ReLU layers, and RoPE positional encoding. We believe that this work is a stepping stone towards the difficult task of obtaining useful guarantees for trained transformer models.
Publication Details
- Authors
- Publication Type
- Journal Article
- Year of Publication
- 2025
- Journal
- https://arxiv.org/abs/2502.06038
- Date Published
- 02/2025