You can’t cheaply recompute without re-running the whole model – so KV cache starts piling up Feature Large language model ...
Anti-forgetting representation learning method reduces the weight aggregation interference on model memory and augments the representation performance.