Initial Release | 1 June 2023 | (approximately)
---|---|
Original Author / Maintainer | turboderp |
GitHub Link | Link |
License | MIT License |
Batch Generation | |
Chat | |
Training | |
Quantization | |
Run on CPU alone | |
Run on GPU / CUDA | |
GUI | Basic built in, or via oobabooga or other programs |
Model Formats | GPTQ |
ExLlama is a standalone Python/C++/CUDA implementation of Llama for use with 4-bit GPTQ weights.
ExLlama's focus is on performance, with a stated objective of being the fastest and most GPU memory efficient model for running large language models on modern GPUs.