kNN-Attention - O(n log n) Attention by Doing the Perfectly Obvious

Author: Alice Luna Sparkuhl

Abstract

We present kNN-Attention, an O(n log n) attention mechanism that achieves full recall capabilities while maintaining logarithmic complexity. By treating attention as a database query operation and using Hierarchical Navigable Small World (HNSW) algorithms, we retrieve only the top-k most relevant tokens for attention computation. This approach enables contexts of arbitrary length without the lossy compression inherent in linear attention variants. We demonstrate that while current CPU-based implementations show overhead until ~100k tokens, the asymptotic improvements are dramatic: extrapolations suggest billion-token contexts could be processed in approximately 4 hours versus 1 year with standard attention. The method is compatible with existing MHA/GQA implementations and requires no architectural changes during initial training. We discuss practical implementation challenges, present initial results, and outline future research directions including memory compression techniques and self-organizing memory systems.

Background

The O(n²) complexity of attention has been one of the biggest challenges for enabling long context windows since the Transformer architecture was created. There have been many attempts at solving this issue using various techniques. On the one hand there were the linear attention variants / state space models (SSMs) such as RWKV, Mamba, Sliding Window Attention, Gated Deltanet, as well as a number of O(n*log(n)) approaches such as Hyena. However, even though some of these perform better at language modeling than full attention, this problem is not yet solved.

Most current applications of LLMs involve the manipulation of data and making targeted changes within ever-larger contexts, such as in coding. Models that are incapable of accurately copying past sections of context are fundamentally unsuited to this. Unfortunately, all the attention variants with better complexity classes are not capable of accurately copying past text. All variants of linear attention fundamentally work similarly - they compress past context into a state of fixed size and retrieve information from there. This is fundamentally a lossy process and cannot ever achieve perfect recall on information a model saw arbitrarily long ago.

One way models are now being developed to mitigate the cost of full attention while retaining the capability of copying is to train hybrid models which use mostly linear attention variants and a few full attention layers. While this is an effective mitigation to allow for longer contexts, the fundamental issue of the complexity class remains.

Method

The mechanism of attention works utilizing three projections named ‘Query’, ‘Key’ and ‘Value’ - often abbreviated to Q, K, V. We notice that these terms match those of regular database operations for querying information in a vector database. We also notice that at any point in time, only very few tokens of context are actually relevant - instead, most tokens are just noise and actively harmful to the quality of the representations produced by the operation. Thus, we hypothesize, that only attending to a fixed amount of token which would receive the highest attention scores would actually be beneficial for the quality of inference, especially over extremely long contexts with an even smaller ratio of relevant tokens.

We introduce kNN-Attention. An attention variant in which all past tokens are added to a standard vector database (USearch) operating on the Hierarchical Navigable Small World algorithm which yields logarithmic complexity for insertion and retrieval. Instead of attending over all past tokens, we use the current query vectors as the query for the lookup and do an ANN query, retrieving the top-k tokens which would be attended to most strongly. Then, we calculate attention as usual over these selected tokens.

Since attention uses a number of attention heads with their own Ks and Vs, there is an obvious design consideration to be made: Should or shouldn’t we keep an individual database for every attention head, retrieving the KV that will be attended to most strongly for each individual head? We are of the opinion that this is not currently worth it, as it causes the number of lookup operations to explode. In our implementation, the computational cost of lookups and insertion has already been a severe bottleneck, so any steps to reduce this cost are necessary to make it practical. To compensate, the value for top-k should be set higher.

A practical challenge is how to handle prefill - where many context tokens are calculated at once instead of one-by-one during autoregressive generation. We decided that a good middle-ground is to process this in chunks with a chunk size selected large enough to fully utilize the GPU. Care must also be taken to maintain causality, as the ability to attend to the most recent tokens is often most important, but standard database operations do not allow retrieving only keys with an index below a specific number when processing multiple queries with different indices simultaneously.

To solve this issue, we decided that during prefill with a chunk size of, say, 256 tokens, the mechanism always attends to the 256 most recent tokens in addition to the top-k highest valued tokens. The lookup happens before the insertion of new tokens, which guarantees causality, and we insert the new tokens into the database after the lookup.

Finally, there is the question of how using an attention method like this affects the training process. Our method is actually compatible with normal MHA and GQA implementations. During training, the context size chosen is usually quite small, so the first stage of pretraining should happen without using any vector database operations at all. By training without sparsity, the model more efficiently learns what to attend to and what not to attend to. Only once pretraining is finished, we recommend having a short period of training with the sparse mechanism so that the model learns to properly utilize the sparse mechanism. However, even without this step, the mechanism should work well, as attention operations already end up extremely sparse by nature with negligible probability mass put on the vast majority of tokens.

Results, Costs and Benefits

When we tested this mechanism using USearch for the vector database, using CPU operations for the lookup, we observed that the overhead of the database operations caused significant slowdown before reaching very long contexts, only achieving faster times starting around 100 000 tokens. Using extrapolation, we found that the better complexity class reduced the time to a billion tokens from 1 year to roughly 4 hours using a small model dimensionality.

Thus, the sensible way to implement this is to benchmark at which context lengths it becomes a speed upgrade, calculate attention results normally using full attention below that and only apply this method for contexts where it is an improvement. It should also be noted that there comes a memory overhead with this, as the size of the vector database is similar to the KV Cache itself, causing the same data to have to reside on GPU and CPU.

For this to become truly revolutionary or standard, there will need to be efficient GPU implementations of any vector database algorithm which does not exceed logarithmic insertion and lookup time. At the time of writing, this does not appear to exist yet. The most efficient implementation would use the scores calculated during the kNN-lookup for the attention calculation itself.

Further Research and Optimizations

We notice that with an attention mechanism of this complexity class which does not eventually break down, truly massive contexts could be enabled. As long as there is memory for the database, nothing would have to be recomputed, and in theory the complete history between a model and a user could be accessible. One of the biggest hurdles is managing the memory cost of storing all the KVs. The first optimization is from DeepSeek v3.2 Exp - their Lightning Indexer approximates attention scores using an extremely compressed representation for keys. This would reduce memory and primarily the lookup cost. DeepSeek’s MLA is also a consideration. Using the optimization in which the attention scores are calculated in the latent space, extreme compression of the KV space is achieved. Further research is necessary to determine whether this can be unified with this mechanism.

Secondly, it may not be necessary to store all keys and all values. For example, if many keys have extremely similar values attached to them, one could keep only one of those values, pointed at by many keys.

Thirdly, we could combine this with mechanisms that reuse the same KV-cache over several or all layers, further limiting the memory cost.

Eventually, we might assign a fixed amount of memory to this per user. Using various compression and cache mechanisms, we could attempt to compress the information using the available space to retain all information shared by a user as best as possible while staying within a fixed memory budget. This would transform the model into an alternative SSM with self-organizing memory which, as opposed to current SSMs, would have logarithmic lookup and insertion time relative to its memory size. A model could manually organize and summarize parts of its memory, remove obsolete entries, etc. This would represent a true evolution from the models we have today into something that has good access to significant chunks of the total history with a user without relying on the limitations of RAG mechanisms, which break down when a model doesn’t know there is information that could be useful and thus never makes the correct query. By having RAG fused directly into the queries of its attention mechanism, all information is available to models at all times.