Unlearning Layer In Attention

Table of Contents

Core Idea
#

If attention scores drive association, then a small module before softmax could weaken specific token relationships.

The simplest version is a learned multiplicative mask:

Attention(Q, K, V) = softmax((QK^T / sqrt(d)) + M_unlearn) V

M_unlearn is trained to suppress a targeted association while preserving general behavior.

Most unlearning methods are training-heavy or model-editing-heavy. An attention-layer adapter could be:

Does attention suppression actually remove behavior, or does the model route around it?
Can the adapter generalize beyond memorized token pairs?
Can it be applied only during sensitive requests?
How do we avoid broad collateral damage?

Prototype: 4-8 weeks.
Paper-grade: 3-6 months.
Complexity: Medium-high.
Main risk: association may be distributed outside the attention links being suppressed.