Low-Rank Adaptation (LoRA)

Train a small adapter instead of the full model.

Full SFT updates every parameter in the model. For a 7B model, that’s expensive. LoRA freezes the original parameters and adds a tiny set of new ones to specific layers. Typically less than 1% of the model. Train those, leave everything else untouched. The result is a lightweight adapter that can be swapped in and out: one base model, many specializations.

QLoRA

LoRA still loads the full frozen model into memory. QLoRA shrinks it first: aggressively compress the base model, then train the adapter on top. Uses a fraction of the memory, runs faster, and makes fine-tuning possible on hardware that couldn’t hold the original model.

References

LoRA: low-rank adaptation of large language models Hu et al., 2021
QLoRA: efficient finetuning of quantized LLMs Dettmers et al., 2023