Skip to content

Commit 2d785ab

Browse files
authored
Merge pull request #2
Upstream sync Fix Sonnet 4 APAC
2 parents ff70559 + 9c89baf commit 2d785ab

File tree

12 files changed

+364
-42
lines changed

12 files changed

+364
-42
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,4 +159,5 @@ cython_debug/
159159
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160160
.idea/
161161

162-
Config
162+
Config
163+
.vscode/launch.json

README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,9 @@ If you find this GitHub repository useful, please consider giving it a free star
2626
- [x] Support Embedding API
2727
- [x] Support Multimodal API
2828
- [x] Support Cross-Region Inference
29+
- [x] Support Application Inference Profiles (**new**)
2930
- [x] Support Reasoning (**new**)
31+
- [x] Support Interleaved thinking (**new**)
3032

3133
Please check [Usage Guide](./docs/Usage.md) for more details about how to use the new APIs.
3234

@@ -148,7 +150,48 @@ print(completion.choices[0].message.content)
148150

149151
Please check [Usage Guide](./docs/Usage.md) for more details about how to use embedding API, multimodal API and tool call.
150152

153+
### Application Inference Profiles
151154

155+
This proxy now supports **Application Inference Profiles**, which allow you to track usage and costs for your model invocations. You can use application inference profiles created in your AWS account for cost tracking and monitoring purposes.
156+
157+
**Using Application Inference Profiles:**
158+
159+
```bash
160+
# Use an application inference profile ARN as the model ID
161+
curl $OPENAI_BASE_URL/chat/completions \
162+
-H "Content-Type: application/json" \
163+
-H "Authorization: Bearer $OPENAI_API_KEY" \
164+
-d '{
165+
"model": "arn:aws:bedrock:us-west-2:123456789012:application-inference-profile/your-profile-id",
166+
"messages": [
167+
{
168+
"role": "user",
169+
"content": "Hello!"
170+
}
171+
]
172+
}'
173+
```
174+
175+
**SDK Usage with Application Inference Profiles:**
176+
177+
```python
178+
from openai import OpenAI
179+
180+
client = OpenAI()
181+
completion = client.chat.completions.create(
182+
model="arn:aws:bedrock:us-west-2:123456789012:application-inference-profile/your-profile-id",
183+
messages=[{"role": "user", "content": "Hello!"}],
184+
)
185+
186+
print(completion.choices[0].message.content)
187+
```
188+
189+
**Benefits of Application Inference Profiles:**
190+
- **Cost Tracking**: Track usage and costs for specific applications or use cases
191+
- **Usage Monitoring**: Monitor model invocation metrics through CloudWatch
192+
- **Tag-based Cost Allocation**: Use AWS cost allocation tags for detailed billing analysis
193+
194+
For more information about creating and managing application inference profiles, see the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-create.html).
152195

153196
## Other Examples
154197

deployment/BedrockProxy.template

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -151,6 +151,7 @@ Resources:
151151
Resource:
152152
- arn:aws:bedrock:*::foundation-model/*
153153
- arn:aws:bedrock:*:*:inference-profile/*
154+
- arn:aws:bedrock:*:*:application-inference-profile/*
154155
- Action:
155156
- secretsmanager:GetSecretValue
156157
- secretsmanager:DescribeSecret
@@ -185,6 +186,7 @@ Resources:
185186
Ref: DefaultModelId
186187
DEFAULT_EMBEDDING_MODEL: cohere.embed-multilingual-v3
187188
ENABLE_CROSS_REGION_INFERENCE: "true"
189+
ENABLE_APPLICATION_INFERENCE_PROFILES: "true"
188190
MemorySize: 1024
189191
PackageType: Image
190192
Role:

deployment/BedrockProxyFargate.template

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,7 @@ Resources:
193193
Resource:
194194
- arn:aws:bedrock:*::foundation-model/*
195195
- arn:aws:bedrock:*:*:inference-profile/*
196+
- arn:aws:bedrock:*:*:application-inference-profile/*
196197
Version: "2012-10-17"
197198
PolicyName: ProxyTaskRoleDefaultPolicy933321B8
198199
Roles:
@@ -222,6 +223,8 @@ Resources:
222223
Value: cohere.embed-multilingual-v3
223224
- Name: ENABLE_CROSS_REGION_INFERENCE
224225
Value: "true"
226+
- Name: ENABLE_APPLICATION_INFERENCE_PROFILES
227+
Value: "true"
225228
Essential: true
226229
Image:
227230
Fn::Join:

docs/Usage.md

Lines changed: 54 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ export OPENAI_BASE_URL=<API base url>
1515
- [Multimodal API](#multimodal-api)
1616
- [Tool Call](#tool-call)
1717
- [Reasoning](#reasoning)
18+
- [Interleaved thinking (beta)](#Interleaved thinking (beta))
1819

1920
## Models API
2021

@@ -135,6 +136,7 @@ print(doc_result[0][:5])
135136
**Example Request**
136137

137138
```bash
139+
curl $OPENAI_BASE_URL/chat/completions \
138140
curl $OPENAI_BASE_URL/chat/completions \
139141
-H "Content-Type: application/json" \
140142
-H "Authorization: Bearer $OPENAI_API_KEY" \
@@ -340,7 +342,6 @@ curl $OPENAI_BASE_URL/chat/completions \
340342
-d '{
341343
"model": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
342344
"messages": [
343-
{
344345
"role": "user",
345346
"content": "which one is bigger, 3.9 or 3.11?"
346347
}
@@ -441,4 +442,55 @@ for chunk in response:
441442
reasoning_content += chunk.choices[0].delta.reasoning_content
442443
elif chunk.choices[0].delta.content:
443444
content += chunk.choices[0].delta.content
444-
```
445+
```
446+
447+
## Interleaved thinking (beta)
448+
449+
**Important Notice**: Please carefully review the following points before using reasoning mode for Chat completion API.
450+
451+
Extended thinking with tool use in Claude 4 models supports [interleaved thinking](https://docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-extended-thinking.html#claude-messages-extended-thinking-tool-use-interleaved) enables Claude 4 models to think between tool calls and run more sophisticated reasoning after receiving tool results. which is helpful for more complex agentic interactions.
452+
With interleaved thinking, the `budget_tokens` can exceed the `max_tokens` parameter because it represents the total budget across all thinking blocks within one assistant turn.
453+
454+
455+
**Example Request**
456+
457+
- Non-Streaming
458+
459+
```bash
460+
curl http://127.0.0.1:8000/api/v1/chat/completions \
461+
-H "Content-Type: application/json" \
462+
-H "Authorization: Bearer bedrock" \
463+
-d '{
464+
"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
465+
"max_tokens": 2048,
466+
"messages": [{
467+
"role": "user",
468+
"content": "有一天,一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧,于是偷偷把分数改成了 88 分。她的父亲看到试卷后,怒发冲冠,狠狠地给了她一巴掌,怒吼道:“你这 8 怎么一半是绿的一半是红的,你以为我是傻子吗?”女孩被打后,委屈地哭了起来,什么也没说。过了一会儿,父亲突然崩溃了。请问这位父亲为什么过一会崩溃了?"
469+
}],
470+
"extra_body": {
471+
"anthropic_beta": ["interleaved-thinking-2025-05-14"],
472+
"thinking": {"type": "enabled", "budget_tokens": 4096}
473+
}
474+
}'
475+
```
476+
477+
- Streaming
478+
479+
```bash
480+
curl http://127.0.0.1:8000/api/v1/chat/completions \
481+
-H "Content-Type: application/json" \
482+
-H "Authorization: Bearer bedrock" \
483+
-d '{
484+
"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
485+
"max_tokens": 2048,
486+
"messages": [{
487+
"role": "user",
488+
"content": "有一天,一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧,于是偷偷把分数改成了 88 分。她的父亲看到试卷后,怒发冲冠,狠狠地给了她一巴掌,怒吼道:“你这 8 怎么一半是绿的一半是红的,你以为我是傻子吗?”女孩被打后,委屈地哭了起来,什么也没说。过了一会儿,父亲突然崩溃了。请问这位父亲为什么过一会崩溃了?"
489+
}],
490+
"stream": true,
491+
"extra_body": {
492+
"anthropic_beta": ["interleaved-thinking-2025-05-14"],
493+
"thinking": {"type": "enabled", "budget_tokens": 4096}
494+
}
495+
}'
496+
```

docs/Usage_CN.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ export OPENAI_BASE_URL=<API base url>
1515
- [Multimodal API](#multimodal-api)
1616
- [Tool Call](#tool-call)
1717
- [Reasoning](#reasoning)
18+
- [Interleaved thinking (beta)](#Interleaved thinking (beta))
19+
1820

1921
## Models API
2022

@@ -440,4 +442,56 @@ for chunk in response:
440442
reasoning_content += chunk.choices[0].delta.reasoning_content
441443
elif chunk.choices[0].delta.content:
442444
content += chunk.choices[0].delta.content
443-
```
445+
```
446+
447+
## Interleaved thinking (beta)
448+
449+
**重要提示**:在使用 Chat Completion API 的推理模式(reasoning mode)前,请务必仔细阅读以下内容。
450+
451+
Claude 4 模型支持借助工具使用的扩展思维功能(Extended Thinking),其中包含交错思考([interleaved thinking](https://docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-extended-thinking.html#claude-messages-extended-thinking-tool-use-interleaved) )。该功能使 Claude 4 可以在多次调用工具之间进行思考,并在收到工具结果后执行更复杂的推理,这对处理更复杂的 Agentic AI 交互非常有帮助。
452+
453+
在交错思考模式下,budget_tokens 可以超过 max_tokens 参数,因为它代表一次助手回合中所有思考块的总 Token 预算。
454+
455+
456+
**Request 示例**
457+
458+
- Non-Streaming
459+
460+
```bash
461+
curl http://127.0.0.1:8000/api/v1/chat/completions \
462+
-H "Content-Type: application/json" \
463+
-H "Authorization: Bearer bedrock" \
464+
-d '{
465+
"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
466+
"max_tokens": 2048,
467+
"messages": [{
468+
"role": "user",
469+
"content": "有一天,一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧,于是偷偷把分数改成了 88 分。她的父亲看到试卷后,怒发冲冠,狠狠地给了她一巴掌,怒吼道:“你这 8 怎么一半是绿的一半是红的,你以为我是傻子吗?”女孩被打后,委屈地哭了起来,什么也没说。过了一会儿,父亲突然崩溃了。请问这位父亲为什么过一会崩溃了?"
470+
}],
471+
"extra_body": {
472+
"anthropic_beta": ["interleaved-thinking-2025-05-14"],
473+
"thinking": {"type": "enabled", "budget_tokens": 4096}
474+
}
475+
}'
476+
```
477+
478+
- Streaming
479+
480+
```bash
481+
curl http://127.0.0.1:8000/api/v1/chat/completions \
482+
-H "Content-Type: application/json" \
483+
-H "Authorization: Bearer bedrock" \
484+
-d '{
485+
"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
486+
"max_tokens": 2048,
487+
"messages": [{
488+
"role": "user",
489+
"content": "有一天,一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧,于是偷偷把分数改成了 88 分。她的父亲看到试卷后,怒发冲冠,狠狠地给了她一巴掌,怒吼道:“你这 8 怎么一半是绿的一半是红的,你以为我是傻子吗?”女孩被打后,委屈地哭了起来,什么也没说。过了一会儿,父亲突然崩溃了。请问这位父亲为什么过一会崩溃了?"
490+
}],
491+
"stream": true,
492+
"extra_body": {
493+
"anthropic_beta": ["interleaved-thinking-2025-05-14"],
494+
"thinking": {"type": "enabled", "budget_tokens": 4096}
495+
}
496+
}'
497+
```

src/api/app.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,16 @@ async def health():
4545

4646
@app.exception_handler(RequestValidationError)
4747
async def validation_exception_handler(request, exc):
48+
logger = logging.getLogger(__name__)
49+
50+
# Log essential info only - avoid sensitive data and performance overhead
51+
logger.warning(
52+
"Request validation failed: %s %s - %s",
53+
request.method,
54+
request.url.path,
55+
str(exc).split('\n')[0] # First line only
56+
)
57+
4858
return PlainTextResponse(str(exc), status_code=400)
4959

5060

src/api/models/base.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import logging
12
import time
23
import uuid
34
from abc import ABC, abstractmethod
@@ -14,6 +15,8 @@
1415
Error,
1516
)
1617

18+
logger = logging.getLogger(__name__)
19+
1720

1821
class BaseChatModel(ABC):
1922
"""Represent a basic chat model
@@ -46,6 +49,7 @@ def generate_message_id() -> str:
4649
@staticmethod
4750
def stream_response_to_bytes(response: ChatStreamResponse | Error | None = None) -> bytes:
4851
if isinstance(response, Error):
52+
logger.error("Stream error: %s", response.error.message if response.error else "Unknown error")
4953
data = response.model_dump_json()
5054
elif isinstance(response, ChatStreamResponse):
5155
# to populate other fields when using exclude_unset=True

0 commit comments

Comments
 (0)