Commit 5d46bab
authored
llama : initial Mamba-2 support (ggml-org#9126)
* llama : initial Mamba-2 support
* ggml : SIMD ggml_ssm_scan for Mamba-2
* ggml : improve ggml_mul speed when masking recurrent states
* llama : support running Mamba-Codestral-7B-v0.1
* llama : fix Mamba-2 conv state saving
* ggml : make the ggml_mul fast broadcast path more consistently formatted
* llama : remove unused variable
* llama : add missing break
* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.
* llama : avoid redundant state copy for Mamba 1 and 2
* metal : attempt to adapt SSM_SCAN for Mamba-2
* metal : fix SSM_SCAN pipeline scope
* metal : use log and exp instead of log1pf and expf in SSM_SCAN
* metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
* metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
* metal : fix SSM_SCAN state head offset
* metal : fix wrong number of tokens per sequence in SSM_SCAN
* ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.
* ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
* convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models
to avoid some reshapes.
Not sure if it's a good idea,
but it makes the graph slightly cleaner.
* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
* convert : fix flake8 lint
* metal : fix confusion between ; and ,
* metal : add missing args for nb references in ssm_scan_f32_group
* metal : single-user mamba2 inference works
* kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.
* convert : avoid AutoConfig for Mamba and Mamba2 hparams
* kv-cache : allow context shift for recurrent models
* graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
* ggml : fix mamba2 ssm scan when compiled with SVE
* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
* cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
* mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON
* cuda : graceful fallback for Mamba-1 models with weird embd size1 parent e17991c commit 5d46bab
File tree
24 files changed
+1083
-319
lines changed- ggml
- include
- src
- ggml-cpu
- ggml-cuda
- ggml-metal
- gguf-py/gguf
- src
- tests
24 files changed
+1083
-319
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4781 | 4781 | | |
4782 | 4782 | | |
4783 | 4783 | | |
| 4784 | + | |
| 4785 | + | |
| 4786 | + | |
| 4787 | + | |
| 4788 | + | |
| 4789 | + | |
| 4790 | + | |
| 4791 | + | |
4784 | 4792 | | |
4785 | 4793 | | |
4786 | 4794 | | |
| |||
4855 | 4863 | | |
4856 | 4864 | | |
4857 | 4865 | | |
| 4866 | + | |
| 4867 | + | |
| 4868 | + | |
| 4869 | + | |
| 4870 | + | |
| 4871 | + | |
| 4872 | + | |
| 4873 | + | |
| 4874 | + | |
| 4875 | + | |
| 4876 | + | |
| 4877 | + | |
| 4878 | + | |
| 4879 | + | |
| 4880 | + | |
| 4881 | + | |
| 4882 | + | |
| 4883 | + | |
| 4884 | + | |
| 4885 | + | |
| 4886 | + | |
| 4887 | + | |
| 4888 | + | |
| 4889 | + | |
| 4890 | + | |
| 4891 | + | |
| 4892 | + | |
| 4893 | + | |
| 4894 | + | |
| 4895 | + | |
| 4896 | + | |
| 4897 | + | |
| 4898 | + | |
| 4899 | + | |
| 4900 | + | |
| 4901 | + | |
| 4902 | + | |
| 4903 | + | |
| 4904 | + | |
| 4905 | + | |
| 4906 | + | |
| 4907 | + | |
| 4908 | + | |
| 4909 | + | |
| 4910 | + | |
| 4911 | + | |
| 4912 | + | |
| 4913 | + | |
| 4914 | + | |
| 4915 | + | |
| 4916 | + | |
| 4917 | + | |
| 4918 | + | |
| 4919 | + | |
| 4920 | + | |
| 4921 | + | |
| 4922 | + | |
| 4923 | + | |
| 4924 | + | |
| 4925 | + | |
| 4926 | + | |
| 4927 | + | |
| 4928 | + | |
| 4929 | + | |
| 4930 | + | |
| 4931 | + | |
| 4932 | + | |
| 4933 | + | |
| 4934 | + | |
| 4935 | + | |
| 4936 | + | |
| 4937 | + | |
| 4938 | + | |
| 4939 | + | |
| 4940 | + | |
| 4941 | + | |
| 4942 | + | |
| 4943 | + | |
| 4944 | + | |
| 4945 | + | |
| 4946 | + | |
| 4947 | + | |
| 4948 | + | |
| 4949 | + | |
| 4950 | + | |
| 4951 | + | |
| 4952 | + | |
| 4953 | + | |
| 4954 | + | |
| 4955 | + | |
| 4956 | + | |
| 4957 | + | |
| 4958 | + | |
| 4959 | + | |
4858 | 4960 | | |
4859 | 4961 | | |
4860 | 4962 | | |
| |||
6615 | 6717 | | |
6616 | 6718 | | |
6617 | 6719 | | |
6618 | | - | |
| 6720 | + | |
| 6721 | + | |
| 6722 | + | |
| 6723 | + | |
| 6724 | + | |
| 6725 | + | |
| 6726 | + | |
6619 | 6727 | | |
6620 | 6728 | | |
6621 | 6729 | | |
6622 | 6730 | | |
6623 | 6731 | | |
| 6732 | + | |
| 6733 | + | |
6624 | 6734 | | |
6625 | 6735 | | |
6626 | 6736 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2031 | 2031 | | |
2032 | 2032 | | |
2033 | 2033 | | |
2034 | | - | |
| 2034 | + | |
| 2035 | + | |
2035 | 2036 | | |
2036 | 2037 | | |
2037 | 2038 | | |
| |||
0 commit comments