Exllama

From llamawiki.ai
ExLlama
Initial Release1 June 2023 (18 months ago) (2023-06-01) (approximately)
Original Author / Maintainerturboderp
GitHub LinkLink
LicenseMIT License
Batch GenerationcheckY
ChatcheckY
Training☒N
Quantization☒N
Run on CPU alone☒N
Run on GPU / CUDAcheckY
GUIBasic built in, or via oobabooga or other programs
Model FormatsGPTQ

ExLlama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights.

ExLlama's focus is on performance, with a stated objective of being the fastest and most GPU memory efficient model for running large language models on modern GPUs.

See Also[edit | edit source]