Initial Release | 1 June 2023 | (approximately)
---|---|
Original Author / Maintainer | turboderp |
GitHub Link | Link |
License | MIT License |
Batch Generation | ![]() |
Chat | ![]() |
Training | ![]() |
Quantization | ![]() |
Run on CPU alone | ![]() |
Run on GPU / CUDA | ![]() |
GUI | Basic built in, or via oobabooga or other programs |
Model Formats | GPTQ |
ExLlama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights.
ExLlama's focus is on performance, with a stated objective of being the fastest and most GPU memory efficient model for running large language models on modern GPUs.