Institut de Mathématiques de Bordeaux

Attention-based models, such as Transformer, excel across

various tasks but lack a comprehensive theoretical understanding. To

address this gap, we introduce the single-location regression task,

where only one token in a sequence determines the output, and its

position is a latent random variable, retrievable via a linear

projection of the input. To solve this task, we propose a dedicated

predictor, which turns out to be a simplified version of a non-linear

self-attention layer. We study its theoretical properties, both in terms

of statistics (gap to Bayes optimality) and optimization (convergence of

gradient descent). In particular, despite the non-convex nature of the

problem, the predictor effectively learns the underlying structure. This

highlights the capacity of attention mechanisms to handle sparse token

information. Based on Marion et al., Attention Layers Provably Solve

Single-Location Regression, ICLR 2025, and Duranthon et al., Statistical

Advantage of Softmax Attention: Insights from Single-Location

Regression, submitted.

(Maths-ia) Attention layers provably solve single-location regression