Skip to content

Conversation

@krisbiradar
Copy link

No description provided.

Introduces LLamaFlashAttentionType enum and integrates flash attention configuration into LLamaContextParams. Adds support for diffusion-based models in SafeLlamaModelHandle. Updates NativeApi and SafeLLamaContextHandle with new adapter metadata and sequence state methods. Syncs llama.cpp submodule.
@martindevans
Copy link
Member

Thanks for putting this together. I've started a build here to produce new binaries, once those are ready we can update the csproj and run the tests.

@krisbiradar
Copy link
Author

Hi @martindevans , i guess i need to hardcode the binary file url for now once the changes are tested and ready to merge we can restore the original url right ?

@krisbiradar
Copy link
Author

Have changed it assuming i had to hardcode if anything else is required do let me know.

@martindevans
Copy link
Member

Sorry for the delay, I'll create a new release on https://github.com/SciSharp/LLamaSharpBinaries shortly, then you can just put the ID of that release into the csproj file.

@martindevans
Copy link
Member

Ok I've created https://github.com/SciSharp/LLamaSharpBinaries/releases/tag/86587da, you can put that release ID into the csproj here and it should auto download on build (probably best to do a clean and rebuild to be sure).

@krisbiradar
Copy link
Author

Will do it in sometime.

@krisbiradar
Copy link
Author

krisbiradar commented Sep 14, 2025

Hi @martindevans , few tests are failing but i checked out to the last commit before my contribution even there the tests are failing... am i missing something ?

Copy link
Member

@martindevans martindevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, here's the llama.h diff I'm working from for this review: ggml-org/llama.cpp@11dd5a4...86587da#diff-dd7c3bc82b7d2728f1fe661e7820aebbfd53f558a5fdd85bbd6dd618213a118d

There are a few bits missing from this PR that are in the diff, the most important is the flash_attn parameter which has been removed on the C++ side but not the C# side, that will cause all the other parameters to be misaligned. Hopefully fixing this might resolve some of the test failures 🤞

There are a few functions that are missing:

  • llama_adapter_get_alora_n_invocation_tokens
  • llama_adapter_get_alora_invocation_tokens
  • llama_flash_attn_type_name
  • llama_state_seq_flags (3 related function)

If you're not interested in implementing those right now please just dump their signatures into a comment in NativeApi.cs, that way we don't get out of sync.

@krisbiradar
Copy link
Author

I tried fixing these but still the tests failed. I will try to dig deeper this weekend.

@martindevans
Copy link
Member

I'm on holiday this week, so I won't be around much to help out. I'll try to keep tabs on this thread though.

A lot of the test failures seem to be similar, something to do with invalid batches in the Reranker. e.g.

decode: cannot decode batches with this context (calling encode() instead)
init: invalid seq_id[9][0] = 1 >= 1
encode: failed to initialize batch
llama_decode: failed to decode, ret = -1

That seems to be coming from here in llama.cpp, that code looks like checks that none of the tokens in the batch are being assigned to a sequence >= n_seq_id

The LLamaReranker is relatively new, so it's very possible there's a bug of some kind in it.

@krisbiradar
Copy link
Author

Sure,will go through this

@martindevans
Copy link
Member

I just opened a PR on your fork into this branch that fixes 2 of the failing tests.

The other seems to be a genuine failure, something wrong with the way batching works. I tried out the BatchedExecutor forking demo as another check and that fails.

@krisbiradar
Copy link
Author

Will check that , sorry for the delay have been busy with other tasks lately.

@martindevans
Copy link
Member

No worries, I was meaning to check this out myself ages ago too! I'm hoping to get some time to properly look through this and other PRs this weekend 🤞

@Lyrcaxis
Copy link
Contributor

Lyrcaxis commented Oct 10, 2025

We got an answer that solves an issue caused by a significant change:
image

New LLamaContextParams.Default() needs to become like so to support current LLamaSharp behaviour:

static LLamaModelParams Default() {
    var _params = llama_model_default_params();
    _params.kv_unified = true;
    return _params;

    [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
    static extern LLamaContextParams llama_context_default_params();
}

result.type_k = @params.TypeK ?? GGMLType.GGML_TYPE_F16;
result.type_v = @params.TypeV ?? GGMLType.GGML_TYPE_F16;
result.offload_kqv = !@params.NoKqvOffload;
result.flash_attention = @params.FlashAttention;
Copy link
Contributor

@Lyrcaxis Lyrcaxis Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of completely removing the option to use flash attention, can you pass to llama_flash_attn_type?
I would suggest keeping the previous FlashAttention bool as it was -- but turn it to nullable, so null == Auto.

result.llama_flash_attn_type = @params.FlashAttention switch
{
    true => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_ENABLED,
    false => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_DISABLED,
    null => LLamaFlashAttentionType.LLAMA_FLASH_ATTENTION_TYPE_AUTO
}
result.kv_unified = true; // if we wanna hardcode it here instead of in `Default()`.

@krisbiradar
Copy link
Author

Well i tried doing this but it didn't seem to make a difference.
Will check on this today tho.

@krisbiradar
Copy link
Author

@martindevans

I took a look at the batched sampling test failure — it appears to be caused by a regression in the unified_kv_cache implementation. See PR #14363 and Issue #14847 for context.

as a solution i passed the MaxSeq = 4 in Context params and it worked just fine

let me know if i got this right.

@krisbiradar
Copy link
Author

Hi @martindevans , @Lyrcaxis you got some time to take a look at this ?

Copy link
Contributor

@Lyrcaxis Lyrcaxis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Set FlashAttention to true in BuilderExtensions for improved performance. Refactored NativeApi P/Invoke method signatures to single-line format for better readability and consistency.
FlashAttention is now enabled by default in model parameter initialization for embedding and text generation. The unused SeqMax parameter has been removed from unit tests to simplify configuration. Minor formatting improvements were made in IContextParamsExtensions and NativeApi for consistency.
@Lyrcaxis
Copy link
Contributor

Lyrcaxis commented Nov 2, 2025

@krisbiradar @martindevans alright, I figured some way out.

So, if kv_unified is true, the context no longer splits into n_seq_max equal context blocks.
What's remaining is to indeed set n_seq_max to some high value. But I think it shouldn't be on executor level.

On my local integration tests, I had solved everything on the constructor level -- with basicallly just:

defaultParams.n_seq_max = (uint) Math.Min(ContextSize / 8, 256);
defaultParams.kv_unified = true;

This effectively maintains previous LlamaSharp behaviour, but not sure if you'll like that, Martin. Let me know your thoughts.

Also, since the binaries for gemma 3n were built, new noteworthy models came out (e.g. LFM2-MOE). Might be worth an update.

@krisbiradar
Copy link
Author

@krisbiradar @martindevans alright, I figured some way out.

So, if kv_unified is true, the context no longer splits into n_seq_max equal context blocks. What's remaining is to indeed set n_seq_max to some high value. But I think it shouldn't be on executor level.

On my local integration tests, I had solved everything on the constructor level -- with basicallly just:

defaultParams.n_seq_max = (uint) Math.Min(ContextSize / 8, 256);
defaultParams.kv_unified = true;

This effectively maintains previous LlamaSharp behaviour, but not sure if you'll like that, Martin. Let me know your thoughts.

Also, since the binaries for gemma 3n were built, new noteworthy models came out (e.g. LFM2-MOE). Might be worth an update.

Thanks will push this right away!

@krisbiradar
Copy link
Author

any luck @Lyrcaxis , @martindevans ?

@krisbiradar
Copy link
Author

are we good to go or there are more changes , do let me know @martindevans

@martindevans
Copy link
Member

Sorry I haven't kept up with this PR! If it's functional now, I'm happy to merge it as-is. We'll probably need to address this new n_seq_max thing in a more complete way in future PRs, but let's get this merged first, it's long overdue! Thanks so much for your work on this and keeping it active ❤️

@krisbiradar
Copy link
Author

Sorry I haven't kept up with this PR! If it's functional now, I'm happy to merge it as-is. We'll probably need to address this new n_seq_max thing in a more complete way in future PRs, but let's get this merged first, it's long overdue! Thanks so much for your work on this and keeping it active ❤️

sorry i forgot to run all tests when fixing the max_seq issue had tested only the failing one's now that i fixed the max_seq issue it is causing the context size tests to fail.

and i guess the fix worked because we just happened to hardcode the correct value of the context size
the actual value of the context size will actually be 256 per sequence i.e 256 * n_max_seq so changed those tests.

hope this time i get the pr merged!!!

@krisbiradar
Copy link
Author

Am i required to change anything here again ? because the tests do pass on windows environment @Lyrcaxis @martindevans ?

@martindevans
Copy link
Member

Not sure what's going on there, it looks like it failed to find some files before it even started the tests. I've restarted the failed tests to see if it was a one-off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants