Making inference faster and cheaper without losing quality.
Generation is sequential: one token at a time, each depending on the last. The GPU spends most of its time loading model parameters from memory, not doing math. That bottleneck creates room for speedups.