Merge pull request #2

ofawx · web-flow · commit 2d785ab76a16 · 2025-08-11T10:55:14.000+10:00
Upstream sync
Fix Sonnet 4 APAC
diff --git a/.gitignore b/.gitignore
@@ -159,4 +159,5 @@ cython_debug/
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 .idea/
 
-Config
+Config
+.vscode/launch.json
diff --git a/README.md b/README.md
@@ -26,7 +26,9 @@ If you find this GitHub repository useful, please consider giving it a free star
 - [x] Support Embedding API
 - [x] Support Multimodal API
 - [x] Support Cross-Region Inference
+- [x] Support Application Inference Profiles (**new**)
 - [x] Support Reasoning (**new**)
+- [x] Support Interleaved thinking (**new**)
 
 Please check [Usage Guide](./docs/Usage.md) for more details about how to use the new APIs.
 
@@ -148,7 +150,48 @@ print(completion.choices[0].message.content)
 
 Please check [Usage Guide](./docs/Usage.md) for more details about how to use embedding API, multimodal API and tool call.
 
+### Application Inference Profiles
 
+This proxy now supports **Application Inference Profiles**, which allow you to track usage and costs for your model invocations. You can use application inference profiles created in your AWS account for cost tracking and monitoring purposes.
+
+**Using Application Inference Profiles:**
+
+```bash
+# Use an application inference profile ARN as the model ID
+curl $OPENAI_BASE_URL/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer $OPENAI_API_KEY" \
+  -d '{
+    "model": "arn:aws:bedrock:us-west-2:123456789012:application-inference-profile/your-profile-id",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello!"
+      }
+    ]
+  }'
+```
+
+**SDK Usage with Application Inference Profiles:**
+
+```python
+from openai import OpenAI
+
+client = OpenAI()
+completion = client.chat.completions.create(
+    model="arn:aws:bedrock:us-west-2:123456789012:application-inference-profile/your-profile-id",
+    messages=[{"role": "user", "content": "Hello!"}],
+)
+
+print(completion.choices[0].message.content)
+```
+
+**Benefits of Application Inference Profiles:**
+- **Cost Tracking**: Track usage and costs for specific applications or use cases
+- **Usage Monitoring**: Monitor model invocation metrics through CloudWatch
+- **Tag-based Cost Allocation**: Use AWS cost allocation tags for detailed billing analysis
+
+For more information about creating and managing application inference profiles, see the [Amazon Bedrock User Guide](https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-create.html).
 
 ## Other Examples
 
diff --git a/deployment/BedrockProxy.template b/deployment/BedrockProxy.template
@@ -151,6 +151,7 @@ Resources:
             Resource:
               - arn:aws:bedrock:*::foundation-model/*
               - arn:aws:bedrock:*:*:inference-profile/*
+              - arn:aws:bedrock:*:*:application-inference-profile/*
           - Action:
               - secretsmanager:GetSecretValue
               - secretsmanager:DescribeSecret
@@ -185,6 +186,7 @@ Resources:
             Ref: DefaultModelId
           DEFAULT_EMBEDDING_MODEL: cohere.embed-multilingual-v3
           ENABLE_CROSS_REGION_INFERENCE: "true"
+          ENABLE_APPLICATION_INFERENCE_PROFILES: "true"
       MemorySize: 1024
       PackageType: Image
       Role:
diff --git a/deployment/BedrockProxyFargate.template b/deployment/BedrockProxyFargate.template
@@ -193,6 +193,7 @@ Resources:
             Resource:
               - arn:aws:bedrock:*::foundation-model/*
               - arn:aws:bedrock:*:*:inference-profile/*
+              - arn:aws:bedrock:*:*:application-inference-profile/*
         Version: "2012-10-17"
       PolicyName: ProxyTaskRoleDefaultPolicy933321B8
       Roles:
@@ -222,6 +223,8 @@ Resources:
               Value: cohere.embed-multilingual-v3
             - Name: ENABLE_CROSS_REGION_INFERENCE
               Value: "true"
+            - Name: ENABLE_APPLICATION_INFERENCE_PROFILES
+              Value: "true"
           Essential: true
           Image:
             Fn::Join:
diff --git a/docs/Usage.md b/docs/Usage.md
@@ -15,6 +15,7 @@ export OPENAI_BASE_URL=<API base url>
 - [Multimodal API](#multimodal-api)
 - [Tool Call](#tool-call)
 - [Reasoning](#reasoning)
+- [Interleaved thinking (beta)](#Interleaved thinking (beta))
 
 ## Models API
 
@@ -135,6 +136,7 @@ print(doc_result[0][:5])
 **Example Request**
 
 ```bash
+curl $OPENAI_BASE_URL/chat/completions \
 curl $OPENAI_BASE_URL/chat/completions \
   -H "Content-Type: application/json" \
   -H "Authorization: Bearer $OPENAI_API_KEY" \
@@ -340,7 +342,6 @@ curl $OPENAI_BASE_URL/chat/completions \
   -d '{
     "model": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
     "messages": [
-        {
             "role": "user",
             "content": "which one is bigger, 3.9 or 3.11?"
         }
@@ -441,4 +442,55 @@ for chunk in response:
         reasoning_content += chunk.choices[0].delta.reasoning_content
     elif chunk.choices[0].delta.content:
         content += chunk.choices[0].delta.content
-```
+```
+
+## Interleaved thinking (beta)
+
+**Important Notice**: Please carefully review the following points before using reasoning mode for Chat completion API.
+
+Extended thinking with tool use in Claude 4 models supports [interleaved thinking](https://docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-extended-thinking.html#claude-messages-extended-thinking-tool-use-interleaved) enables Claude 4 models to think between tool calls and run more sophisticated reasoning after receiving tool results. which is helpful for more complex agentic interactions.
+With interleaved thinking, the `budget_tokens` can exceed the `max_tokens` parameter because it represents the total budget across all thinking blocks within one assistant turn.
+
+
+**Example Request**
+
+- Non-Streaming
+
+```bash
+curl http://127.0.0.1:8000/api/v1/chat/completions \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer bedrock" \
+-d '{
+"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
+"max_tokens": 2048,
+"messages": [{
+"role": "user",
+"content": "有一天，一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧，于是偷偷把分数改成了 88 分。她的父亲看到试卷后，怒发冲冠，狠狠地给了她一巴掌，怒吼道：“你这 8 怎么一半是绿的一半是红的，你以为我是傻子吗？”女孩被打后，委屈地哭了起来，什么也没说。过了一会儿，父亲突然崩溃了。请问这位父亲为什么过一会崩溃了？"
+}],
+"extra_body": {
+"anthropic_beta": ["interleaved-thinking-2025-05-14"],
+"thinking": {"type": "enabled", "budget_tokens": 4096}
+}
+}'
+```
+
+- Streaming
+
+```bash
+curl http://127.0.0.1:8000/api/v1/chat/completions \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer bedrock" \
+-d '{
+"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
+"max_tokens": 2048,
+"messages": [{
+"role": "user",
+"content": "有一天，一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧，于是偷偷把分数改成了 88 分。她的父亲看到试卷后，怒发冲冠，狠狠地给了她一巴掌，怒吼道：“你这 8 怎么一半是绿的一半是红的，你以为我是傻子吗？”女孩被打后，委屈地哭了起来，什么也没说。过了一会儿，父亲突然崩溃了。请问这位父亲为什么过一会崩溃了？"
+}],
+"stream": true,
+"extra_body": {
+"anthropic_beta": ["interleaved-thinking-2025-05-14"],
+"thinking": {"type": "enabled", "budget_tokens": 4096}
+}
+}'
+```
diff --git a/docs/Usage_CN.md b/docs/Usage_CN.md
@@ -15,6 +15,8 @@ export OPENAI_BASE_URL=<API base url>
 - [Multimodal API](#multimodal-api)
 - [Tool Call](#tool-call)
 - [Reasoning](#reasoning)
+- [Interleaved thinking (beta)](#Interleaved thinking (beta))
+
 
 ## Models API
 
@@ -440,4 +442,56 @@ for chunk in response:
         reasoning_content += chunk.choices[0].delta.reasoning_content
     elif chunk.choices[0].delta.content:
         content += chunk.choices[0].delta.content
-```
+```
+
+## Interleaved thinking (beta)
+
+**重要提示**：在使用 Chat Completion API 的推理模式（reasoning mode）前，请务必仔细阅读以下内容。
+
+Claude 4 模型支持借助工具使用的扩展思维功能（Extended Thinking），其中包含交错思考（[interleaved thinking](https://docs.aws.amazon.com/bedrock/latest/userguide/claude-messages-extended-thinking.html#claude-messages-extended-thinking-tool-use-interleaved) ）。该功能使 Claude 4 可以在多次调用工具之间进行思考，并在收到工具结果后执行更复杂的推理，这对处理更复杂的 Agentic AI 交互非常有帮助。
+
+在交错思考模式下，budget_tokens 可以超过 max_tokens 参数，因为它代表一次助手回合中所有思考块的总 Token 预算。
+
+
+**Request 示例**
+
+- Non-Streaming
+
+```bash
+curl http://127.0.0.1:8000/api/v1/chat/completions \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer bedrock" \
+-d '{
+"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
+"max_tokens": 2048,
+"messages": [{
+"role": "user",
+"content": "有一天，一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧，于是偷偷把分数改成了 88 分。她的父亲看到试卷后，怒发冲冠，狠狠地给了她一巴掌，怒吼道：“你这 8 怎么一半是绿的一半是红的，你以为我是傻子吗？”女孩被打后，委屈地哭了起来，什么也没说。过了一会儿，父亲突然崩溃了。请问这位父亲为什么过一会崩溃了？"
+}],
+"extra_body": {
+"anthropic_beta": ["interleaved-thinking-2025-05-14"],
+"thinking": {"type": "enabled", "budget_tokens": 4096}
+}
+}'
+```
+
+- Streaming
+
+```bash
+curl http://127.0.0.1:8000/api/v1/chat/completions \
+-H "Content-Type: application/json" \
+-H "Authorization: Bearer bedrock" \
+-d '{
+"model": "us.anthropic.claude-sonnet-4-20250514-v1:0",
+"max_tokens": 2048,
+"messages": [{
+"role": "user",
+"content": "有一天，一个女孩参加数学考试只得了 38 分。她心里对父亲的惩罚充满恐惧，于是偷偷把分数改成了 88 分。她的父亲看到试卷后，怒发冲冠，狠狠地给了她一巴掌，怒吼道：“你这 8 怎么一半是绿的一半是红的，你以为我是傻子吗？”女孩被打后，委屈地哭了起来，什么也没说。过了一会儿，父亲突然崩溃了。请问这位父亲为什么过一会崩溃了？"
+}],
+"stream": true,
+"extra_body": {
+"anthropic_beta": ["interleaved-thinking-2025-05-14"],
+"thinking": {"type": "enabled", "budget_tokens": 4096}
+}
+}'
+```
diff --git a/src/api/app.py b/src/api/app.py
@@ -45,6 +45,16 @@ async def health():
 
 @app.exception_handler(RequestValidationError)
 async def validation_exception_handler(request, exc):
+    logger = logging.getLogger(__name__)
+    
+    # Log essential info only - avoid sensitive data and performance overhead
+    logger.warning(
+        "Request validation failed: %s %s - %s", 
+        request.method, 
+        request.url.path,
+        str(exc).split('\n')[0]  # First line only
+    )
+    
     return PlainTextResponse(str(exc), status_code=400)
 
 
diff --git a/src/api/models/base.py b/src/api/models/base.py
@@ -1,3 +1,4 @@
+import logging
 import time
 import uuid
 from abc import ABC, abstractmethod
@@ -14,6 +15,8 @@
     Error,
 )
 
+logger = logging.getLogger(__name__)
+
 
 class BaseChatModel(ABC):
     """Represent a basic chat model
@@ -46,6 +49,7 @@ def generate_message_id() -> str:
     @staticmethod
     def stream_response_to_bytes(response: ChatStreamResponse | Error | None = None) -> bytes:
         if isinstance(response, Error):
+            logger.error("Stream error: %s", response.error.message if response.error else "Unknown error")
             data = response.model_dump_json()
         elif isinstance(response, ChatStreamResponse):
             # to populate other fields when using exclude_unset=True
diff --git a/src/api/models/bedrock.py b/src/api/models/bedrock.py
diff --git a/src/api/schema.py b/src/api/schema.py
diff --git a/src/api/setting.py b/src/api/setting.py
diff --git a/src/requirements.txt b/src/requirements.txt