You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* updated source to supoort mps and cuds
* mpre changes.
* mpre changes.
* Current point in time.
* missed a change
* Updated files and README
* Committing changes before applying stash
* .gitignore.. whatever.
* Update the requirements.tst for llama.cpp and gguf
* Update to requitements for ctransformers
* Update requirements.txt
bumped llama-cpp-python
* Handle GGML files for model loading
* chaekcpint
Copy file name to clipboardExpand all lines: README.md
+36-22Lines changed: 36 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,26 @@
1
-
# OLD VERSION - 1.3.1 Patched for macOS and Apple Silicon
1
+
# MERGED 1.5 Version. macOS TEST VERSION
2
2
3
-
Patched and working with macOS and Apple Silicon M1/M2 GPU now.
3
+
This is a development version and I have not added many changes I had planned. Please feel free to use at your own risk as there may be bugs not yet found.
4
+
5
+
Items Added to this version.
6
+
* "Stop Server" under the sessions tab. Use with caution if in multi-user, will probably disable this if in multi-user mode, however it offers better shutdown than just killing the process on the server.
7
+
* Added Python Class for handling diverse GPU/Compute devices like CUDA, CPU or MPS Changed code to use "torch device" once set initially to a device. Will fall back to CPU.
8
+
9
+
Items working and tested on macOS
10
+
* More support for Apple Silicon M1/M2 processors.
11
+
* Working with new llama-cpp-python 0.1.81
12
+
* Works with LLaMa2 Models
13
+
* There GGML models will need conversion to GGUF format if using llama-cpp-python 0.1.81.
14
+
* Earlier version llama-coo-python still works
15
+
* Have not concluded testing of library dependencies, will have that updated in build instructions for oobagooba-macOS.
16
+
* Still mainly supporting GGML, now GGUF, GG-Universal Format files. You will have to convert your GGML files to GGUF format.
17
+
18
+
Removed from this
19
+
* Tried to continue what was already started in removing FlexGEN from the repo.
20
+
* Removed Docker - if someone wants to help maintain for macOS, let me know.
21
+
* SLowly removing information on CUDA as it is not relevant to macOS.
22
+
23
+
**Updated Installation Instructions** for libraries in the [oobabooga-macOS Quickstart](https://github.com/unixwzrd/oobabooga-macOS/blob/main/macOS_Apple_Silicon_QuickStart.md) and the longer [Building Apple Silicon Support](https://github.com/unixwzrd/oobabooga-macOS/blob/main/macOS_Apple_Silicon_QuickStart.md)
4
24
5
25
GGML support is in this release, and has not been extensively tested. From the look of upstream commits, there are some changes which must be made before this will work with Llama2 models.
6
26
@@ -13,15 +33,15 @@ Otherwise, use these instructions I have on putting together the macOS Python en
13
33
14
34
I will be updating this README file with new information specifically regarding macOS and Apple Silicon.
15
35
16
-
I would like to work closely with the oobaboogs team and try to implement simkilar solutions so the web UI can have a similar look and feel.
36
+
I would like to work closely with the oobabooga team and try to implement similar solutions so the web UI can have a similar look and feel.
17
37
18
38
Maintaining and improving support for macOS and Apple Silicon in this project has required significant research, debugging, and development effort. If you find my contributions helpful and want to show your appreciation, you can Buy Me a Coffee, sponsor this project, or consider me for job opportunities.
19
39
20
40
While the focus of this branch is to enhance macOS and Apple Silicon support, I aim to maintain compatibility with Linux and POSIX operating systems. Contributions and feedback related to Linux compatibility are always welcome.
21
41
22
42
Anyone who would like to assist with supporting Apple Silicon, let me know. There is much to do and I can only do so much by myself.
23
43
24
-
-[OLD VERSION - 1.3.1 Patched for macOS and Apple Silicon](#old-version---131-patched-for-macos-and-apple-silicon)
44
+
-[MERGED 1.5 Version. macOS TEST VERSION](#merged-15-version--macos-test-version)
25
45
-[Features](#features)
26
46
-[Installation](#installation)
27
47
-[Downloading models](#downloading-models)
@@ -36,9 +56,9 @@ Anyone who would like to assist with supporting Apple Silicon, let me know. Ther
36
56
-[AutoGPTQ](#autogptq)
37
57
-[ExLlama](#exllama)
38
58
-[GPTQ-for-LLaMa](#gptq-for-llama)
39
-
-[FlexGen](#flexgen)
40
59
-[DeepSpeed](#deepspeed)
41
60
-[RWKV](#rwkv)
61
+
-[RoPE (for llama.cpp and ExLlama only)](#rope-for-llamacpp-and-exllama-only)
42
62
-[Gradio](#gradio)
43
63
-[API](#api)
44
64
-[Multimodal](#multimodal)
@@ -47,7 +67,6 @@ Anyone who would like to assist with supporting Apple Silicon, let me know. Ther
47
67
-[Community](#community)
48
68
-[Credits](#credits)
49
69
50
-
51
70
## Features
52
71
53
72
* 3 interface modes: default, notebook, and chat
@@ -56,7 +75,7 @@ Anyone who would like to assist with supporting Apple Silicon, let me know. Ther
56
75
* LoRA: load and unload LoRAs on the fly, load multiple LoRAs at the same time, train a new LoRA
57
76
* Precise instruction templates for chat mode, including Alpaca, Vicuna, Open Assistant, Dolly, Koala, ChatGLM, MOSS, RWKV-Raven, Galactica, StableLM, WizardLM, Baize, Ziya, Chinese-Vicuna, MPT, INCITE, Wizard Mega, KoAlpaca, Vigogne, Bactrian, h2o, and OpenBuddy
58
77
*[Multimodal pipelines, including LLaVA and MiniGPT-4](https://github.com/oobabooga/text-generation-webui/tree/main/extensions/multimodal)
59
-
* 8-bit and 4-bit inference through bitsandbytes **CPU only mode for macOS, bitsandbytes does not support Apple Silicon M1/M2 processors**
78
+
* 8-bit and 4-bit inference through bitsandbytes **CPU only mode for macOS, bitsandbytes does not support Apple Silicon GPU**
60
79
* CPU mode for transformers models
61
80
*[DeepSpeed ZeRO-3 inference](docs/DeepSpeed.md)
62
81
*[Extensions](docs/Extensions.md)
@@ -165,7 +184,7 @@ Optionally, you can use the following command-line flags:
|`--loader LOADER`| Choose the model loader manually, otherwise, it will get autodetected. Valid options: transformers, autogptq, gptq-for-llama, exllama, exllama_hf, llamacpp, rwkv, flexgen|
187
+
|`--loader LOADER`| Choose the model loader manually, otherwise, it will get autodetected. Valid options: transformers, autogptq, gptq-for-llama, exllama, exllama_hf, llamacpp, rwkv |
169
188
170
189
#### Accelerate/transformers
171
190
@@ -203,8 +222,8 @@ Optionally, you can use the following command-line flags:
203
222
|`--n_batch`| Maximum number of prompt tokens to batch together when calling llama_eval. |
204
223
|`--no-mmap`| Prevent mmap from being used. |
205
224
|`--mlock`| Force the system to keep the model in RAM. |
206
-
|`--cache-capacity CACHE_CAPACITY`| Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. |
207
-
|`--n-gpu-layers N_GPU_LAYERS`| Number of layers to offload to the GPU. Only works if llama-cpp-python was compiled with BLAS. Set this to 1000000000 to offload all layers to the GPU. |
225
+
|`--cache-capacity CACHE_CAPACITY`| Maximum cache capacity. Examples: 2000MiB, 2GiB. When provided without units, bytes will be assumed. Does not apply for Apple Silicon GPU since it uses unified memory. |
226
+
|`--n-gpu-layers N_GPU_LAYERS`| Number of layers to offload to the GPU. Only works if llama-cpp-python was compiled with Apple Silicon GPU Support for BLAS and llama-cpp using Metal. Load the model and look for **llama_model_load_internal: n_layer in ths STDERR and this will show you the number of layers in the model. Set this value to that number or possibly n + 2, This si very sensitive now an will overrun your data area or tensor cache causing a segmentation fault. |
208
227
|`--n_ctx N_CTX`| Size of the prompt context. |
209
228
|`--llama_cpp_seed SEED`| Seed for llama-cpp models. Default 0 (random). |
210
229
|`--n_gqa N_GQA`| grouped-query attention. Must be 8 for llama2 70b. |
@@ -226,8 +245,6 @@ Optionally, you can use the following command-line flags:
226
245
|------------------|-------------|
227
246
|`--gpu-split`| Comma-separated list of VRAM (in GB) to use per GPU device for model layers, e.g. `20,7,7`|
228
247
|`--max_seq_len MAX_SEQ_LEN`| Maximum sequence length. |
229
-
|`--compress_pos_emb COMPRESS_POS_EMB`| Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
230
-
|`--alpha_value ALPHA_VALUE` | Positional embeddings alpha factor for NTK RoPE scaling. Same as above. Use either this or compress_pos_emb, not both. `
231
248
232
249
#### GPTQ-for-LLaMa
233
250
@@ -243,14 +260,6 @@ Optionally, you can use the following command-line flags:
|`--percent PERCENT [PERCENT ...]`| FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
251
-
|`--compress-weight`| FlexGen: Whether to compress weight (default: False).|
252
-
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
253
-
254
263
#### DeepSpeed
255
264
256
265
| Flag | Description |
@@ -266,6 +275,13 @@ Optionally, you can use the following command-line flags:
266
275
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
267
276
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
268
277
278
+
#### RoPE (for llama.cpp and ExLlama only)
279
+
280
+
| Flag | Description |
281
+
|------------------|-------------|
282
+
|`--compress_pos_emb COMPRESS_POS_EMB`| Positional embeddings compression factor. Should typically be set to max_seq_len / 2048. |
283
+
|`--alpha_value ALPHA_VALUE`| Positional embeddings alpha factor for NTK RoPE scaling. Scaling is not identical to embedding compression. Use either this or compress_pos_emb, not both. |
284
+
269
285
#### Gradio
270
286
271
287
| Flag | Description |
@@ -293,8 +309,6 @@ Optionally, you can use the following command-line flags:
0 commit comments