Skip to content

Commit 916fe13

Browse files
committed
Adding Compute-Context-Length(CCL)
Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
1 parent 19061c6 commit 916fe13

File tree

2 files changed

+6
-6
lines changed

2 files changed

+6
-6
lines changed

QEfficient/generation/text_generation_inference.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -924,7 +924,7 @@ def run_continuous_batching_decode(self, prompt_queue, generation_len):
924924
max_position_id = np.max(decode_inputs["position_ids"])
925925

926926
# Update ccl_id and comp_ctx_lengths_decode based on the maximum position id
927-
ccl_id_initial = self.prefill_ccl_len
927+
ccl_id_initial = 0
928928
ccl_id = ccl_id_initial
929929
for i in range(ccl_id_initial, len(self.comp_ctx_lengths_decode)):
930930
if max_position_id < self.comp_ctx_lengths_decode[i]:

examples/compute_context_length.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@
1212
from QEfficient import QEFFAutoModelForCausalLM
1313

1414
## Using optional variable comp_ctx_lengths variable you can pass a list of context lengths. It will run the model with default context length if comp_ctx_lengths=None. ##
15-
## - The first Prefill_ccl_len numbers in this list are the context lengths that will be used during prefilling. ##
16-
## - During the decoding process, based on the position_id or cache index it will work with the specific compute-context-length in the list. It will start from a proper compute-context-length in the list based on input prompt length and will gradually increase the compute-context-length if the cache index passes the current compute-context-length. ##
15+
## - The first comp_ctx_lengths_prefill list shows the compute-ctx-length list for prefilling process. ##
16+
## - The second comp_ctx_lengths_decode list will be used for decoding. During the decoding process, based on the position_id or cache index it will work with the specific compute-context-length in the list. It will start from a proper compute-context-length in the list based on input prompt length and will gradually increase the compute-context-length if the cache index passes the current compute-context-length. ##
1717

1818

19-
ctx_len = 2048
19+
ctx_len = 1024
2020
comp_ctx_lengths_prefill = [256]
21-
comp_ctx_lengths_decode = [512, 1024, ctx_len]
21+
comp_ctx_lengths_decode = [512, ctx_len]
2222

23-
model_name = "Qwen/Qwen2.5-7B"
23+
model_name = "ibm-granite/granite-3.2-8b-instruct "
2424
model = QEFFAutoModelForCausalLM.from_pretrained(
2525
model_name,
2626
continuous_batching=True,

0 commit comments

Comments
 (0)