Skip to content

Commit c0789ab

Browse files
authored
Merge pull request #1 from AmpereComputingAI/daniel/readme
add quantization instructions to readme
2 parents 06e1efb + 341314b commit c0789ab

File tree

1 file changed

+25
-0
lines changed

1 file changed

+25
-0
lines changed

README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,31 @@ Quick start example will be presented at docker container launch:
2222

2323
Make sure to visit us at [Ampere Solutions Portal](https://solutions.amperecomputing.com/solutions/ampere-ai)!
2424

25+
## Quantization
26+
Ampere® optimized build of llama.cpp provides support for two new quantization methods, Q4_K_4 and Q8R16, offering model size and perplexity similar to Q4_K and Q8_0, respectively, but performing up to 1.5-2x faster on inference.
27+
28+
First, you'll need to convert the model to the GGUF format using [this script](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
29+
30+
```bash
31+
python3 convert-hf-to-gguf.py [path to the original model] --outtype [f32, f16, bf16 or q8_0] --outfile [output path]
32+
```
33+
34+
For example:
35+
36+
```bash
37+
python3 convert-hf-to-gguf.py path/to/llama2 --outtype f16 --outfile llama-2-7b-f16.gguf
38+
```
39+
40+
Next, you can quantize the model using the following command:
41+
```bash
42+
./llama-quantize [input file] [output file] [quantization method]
43+
```
44+
45+
For example:
46+
```bash
47+
./llama-quantize llama-2-7b-f16.gguf llama-2-7b-Q8R16.gguf Q8R16
48+
```
49+
2550
## Support
2651

2752
Please contact us at <ai-support@amperecomputing.com>

0 commit comments

Comments
 (0)