Skip to content

Commit b8d7f90

Browse files
committed
AsyncAPI docs (#1974)
(cherry picked from commit 4c1e903)
1 parent 4248f1b commit b8d7f90

File tree

11 files changed

+728
-0
lines changed

11 files changed

+728
-0
lines changed

docs/summary.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,18 +30,27 @@
3030
* [Example](workloads/realtime/traffic-splitter/example.md)
3131
* [Configuration](workloads/realtime/traffic-splitter/configuration.md)
3232
* [Troubleshooting](workloads/realtime/troubleshooting.md)
33+
* [Async APIs](workloads/async/introduction.md)
34+
* [Example](workloads/async/example.md)
35+
* [Predictor](workloads/async/predictors.md)
36+
* [Configuration](workloads/async/configuration.md)
37+
* [Statuses](workloads/async/statuses.md)
38+
* [Webhooks](workloads/async/webhooks.md)
39+
* [Metrics](workloads/async/metrics.md)
3340
* Batch APIs
3441
* [Example](workloads/batch/example.md)
3542
* [Predictor](workloads/batch/predictors.md)
3643
* [Configuration](workloads/batch/configuration.md)
3744
* [Jobs](workloads/batch/jobs.md)
3845
* [Statuses](workloads/batch/statuses.md)
46+
* [Metrics](workloads/batch/metrics.md)
3947
* Task APIs
4048
* [Example](workloads/task/example.md)
4149
* [Definition](workloads/task/definitions.md)
4250
* [Configuration](workloads/task/configuration.md)
4351
* [Jobs](workloads/task/jobs.md)
4452
* [Statuses](workloads/task/statuses.md)
53+
* [Metrics](workloads/task/metrics.md)
4554
* Dependencies
4655
* [Example](workloads/dependencies/example.md)
4756
* [Python packages](workloads/dependencies/python-packages.md)
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# Autoscaling
2+
3+
Cortex auto-scales AsyncAPIs on a per-API basis based on your configuration.
4+
5+
## Autoscaling replicas
6+
7+
**`min_replicas`**: The lower bound on how many replicas can be running for an API.
8+
9+
<br>
10+
11+
**`max_replicas`**: The upper bound on how many replicas can be running for an API.
12+
13+
<br>
14+
15+
**`target_replica_concurrency`** (default: 1): This is the desired number of in-flight requests per replica, and is the
16+
metric which the autoscaler uses to make scaling decisions. It is recommended to leave this parameter at its default
17+
value.
18+
19+
Replica concurrency is simply how many requests have been sent to the queue and have not yet been responded to (also
20+
referred to as in-flight requests). Therefore, it includes requests which are currently being processed and requests
21+
which are waiting in the queue.
22+
23+
The autoscaler uses this formula to determine the number of desired replicas:
24+
25+
`desired replicas = sum(in-flight requests accross all replicas) / target_replica_concurrency`
26+
27+
<br>
28+
29+
**`max_replica_concurrency`** (default: 1024): This is the maximum number of in-queue messages before requests are
30+
rejected with HTTP error code 503. `max_replica_concurrency` includes requests that are currently being processed as
31+
well as requests that are waiting in the queue (a replica can actively process one request concurrently, and will hold
32+
any additional requests in a local queue). Decreasing `max_replica_concurrency` and configuring the client to retry when
33+
it receives 503 responses will improve queue fairness accross replicas by preventing requests from sitting in long
34+
queues.
35+
36+
<br>
37+
38+
**`window`** (default: 60s): The time over which to average the API in-flight requests (which is the sum of in-flight
39+
requests in each replica). The longer the window, the slower the autoscaler will react to changes in API wide in-flight
40+
requests, since it is averaged over the `window`. API wide in-flight requests is calculated every 10 seconds,
41+
so `window` must be a multiple of 10 seconds.
42+
43+
<br>
44+
45+
**`downscale_stabilization_period`** (default: 5m): The API will not scale below the highest recommendation made during
46+
this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters
47+
described here. It will then take the max of the current recommendation and all recommendations made during
48+
the `downscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing
49+
this value will cause the cluster to react more slowly to decreased traffic, and will reduce thrashing.
50+
51+
<br>
52+
53+
**`upscale_stabilization_period`** (default: 1m): The API will not scale above the lowest recommendation made during
54+
this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters
55+
described here. It will then take the min of the current recommendation and all recommendations made during
56+
the `upscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing this
57+
value will cause the cluster to react more slowly to increased traffic, and will reduce thrashing.
58+
59+
<br>
60+
61+
**`max_downscale_factor`** (default: 0.75): The maximum factor by which to scale down the API on a single scaling event.
62+
For example, if `max_downscale_factor` is 0.5 and there are 10 running replicas, the autoscaler will not recommend fewer
63+
than 5 replicas. Increasing this number will allow the cluster to shrink more quickly in response to dramatic dips in
64+
traffic.
65+
66+
<br>
67+
68+
**`max_upscale_factor`** (default: 1.5): The maximum factor by which to scale up the API on a single scaling event. For
69+
example, if `max_upscale_factor` is 10 and there are 5 running replicas, the autoscaler will not recommend more than 50
70+
replicas. Increasing this number will allow the cluster to grow more quickly in response to dramatic spikes in traffic.
71+
72+
<br>
73+
74+
**`downscale_tolerance`** (default: 0.05): Any recommendation falling within this factor below the current number of
75+
replicas will not trigger a scale down event. For example, if `downscale_tolerance` is 0.1 and there are 20 running
76+
replicas, a recommendation of 18 or 19 replicas will not be acted on, and the API will remain at 20 replicas. Increasing
77+
this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size.
78+
79+
<br>
80+
81+
**`upscale_tolerance`** (default: 0.05): Any recommendation falling within this factor above the current number of
82+
replicas will not trigger a scale up event. For example, if `upscale_tolerance` is 0.1 and there are 20 running
83+
replicas, a recommendation of 21 or 22 replicas will not be acted on, and the API will remain at 20 replicas. Increasing
84+
this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size.
85+
86+
<br>
87+
88+
## Autoscaling instances
89+
90+
Cortex spins up and down instances based on the aggregate resource requests of all APIs. The number of instances will be
91+
at least `min_instances` and no more than `max_instances` (configured during installation and modifiable
92+
via `cortex cluster configure`).
93+
94+
## Autoscaling responsiveness
95+
96+
Assuming that `window` and `upscale_stabilization_period` are set to their default values (1 minute), it could take up
97+
to 2 minutes of increased traffic before an extra replica is requested. As soon as the additional replica is requested,
98+
the replica request will be visible in the output of `cortex get`, but the replica won't yet be running. If an extra
99+
instance is required to schedule the newly requested replica, it could take a few minutes for AWS to provision the
100+
instance (depending on the instance type), plus a few minutes for the newly provisioned instance to download your api
101+
image and for the api to initialize (via its `__init__()` method).
102+
103+
If you want the autoscaler to react as quickly as possible, set `upscale_stabilization_period` and `window` to their
104+
minimum values (0s and 10s respectively).
105+
106+
If it takes a long time to initialize your API replica (i.e. install dependencies and run your predictor's `__init__()`
107+
function), consider building your own API image to use instead of the default image. With this approach, you can
108+
pre-download/build/install any custom dependencies and bake them into the image.
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# Configuration
2+
3+
```yaml
4+
- name: <string>
5+
kind: AsyncAPI
6+
predictor: # detailed configuration below
7+
compute: # detailed configuration below
8+
autoscaling: # detailed configuration below
9+
update_strategy: # detailed configuration below
10+
networking: # detailed configuration below
11+
```
12+
13+
## Predictor
14+
15+
### Python Predictor
16+
17+
<!-- CORTEX_VERSION_BRANCH_STABLE x3 -->
18+
19+
```yaml
20+
predictor:
21+
type: python
22+
path: <string> # path to a python file with a PythonPredictor class definition, relative to the Cortex root (required)
23+
dependencies: # (optional)
24+
pip: <string> # relative path to requirements.txt (default: requirements.txt)
25+
conda: <string> # relative path to conda-packages.txt (default: conda-packages.txt)
26+
shell: <string> # relative path to a shell script for system package installation (default: dependencies.sh)
27+
config: <string: value> # arbitrary dictionary passed to the constructor of the Predictor (optional)
28+
python_path: <string> # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
29+
image: <string> # docker image to use for the Predictor (default: quay.io/cortexlabs/python-predictor-cpu:0.31.0, quay.io/cortexlabs/python-predictor-gpu:0.31.0-cuda10.2-cudnn8, or quay.io/cortexlabs/python-predictor-inf:0.31.0 based on compute)
30+
env: <string: string> # dictionary of environment variables
31+
log_level: <string> # log level that can be "debug", "info", "warning" or "error" (default: "info")
32+
shm_size: <string> # size of shared memory (/dev/shm) for sharing data between multiple processes, e.g. 64Mi or 1Gi (default: Null)
33+
```
34+
35+
## Compute
36+
37+
```yaml
38+
compute:
39+
cpu: <string | int | float> # CPU request per replica. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
40+
gpu: <int> # GPU request per replica. One unit of GPU corresponds to one virtual GPU (default: 0)
41+
mem: <string> # memory request per replica. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
42+
```
43+
44+
## Autoscaling
45+
46+
```yaml
47+
autoscaling:
48+
min_replicas: <int> # minimum number of replicas (default: 1)
49+
max_replicas: <int> # maximum number of replicas (default: 100)
50+
init_replicas: <int> # initial number of replicas (default: <min_replicas>)
51+
max_replica_concurrency: <int> # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
52+
target_replica_concurrency: <float> # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: processes_per_replica * threads_per_process) (aws only)
53+
window: <duration> # the time over which to average the API's concurrency (default: 60s) (aws only)
54+
downscale_stabilization_period: <duration> # the API will not scale below the highest recommendation made during this period (default: 5m) (aws only)
55+
upscale_stabilization_period: <duration> # the API will not scale above the lowest recommendation made during this period (default: 1m) (aws only)
56+
max_downscale_factor: <float> # the maximum factor by which to scale down the API on a single scaling event (default: 0.75) (aws only)
57+
max_upscale_factor: <float> # the maximum factor by which to scale up the API on a single scaling event (default: 1.5) (aws only)
58+
downscale_tolerance: <float> # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.05) (aws only)
59+
upscale_tolerance: <float> # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.05) (aws only)
60+
```
61+
62+
## Update strategy
63+
64+
```yaml
65+
update_strategy:
66+
max_surge: <string | int> # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%) (set to 0 to disable rolling updates)
67+
max_unavailable: <string | int> # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
68+
```
69+
70+
## Networking
71+
72+
```yaml
73+
networking:
74+
endpoint: <string> # the endpoint for the API (default: <api_name>)
75+
```

docs/workloads/async/example.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
# AsyncAPI
2+
3+
Create APIs that process your workloads asynchronously.
4+
5+
## Implementation
6+
7+
Create a folder for your API. In this case, we are deploying an iris-classifier AsyncAPI. This folder will have the
8+
following structure:
9+
10+
```shell
11+
./iris-classifier
12+
├── cortex.yaml
13+
├── predictor.py
14+
└── requirements.txt
15+
```
16+
17+
We will now create the necessary files:
18+
19+
```bash
20+
mkdir iris-classifier && cd iris-classifier
21+
touch predictor.py requirements.txt cortex.yaml
22+
```
23+
24+
```python
25+
# predictor.py
26+
27+
import os
28+
import pickle
29+
from typing import Dict, Any
30+
31+
import boto3
32+
from botocore import UNSIGNED
33+
from botocore.client import Config
34+
35+
labels = ["setosa", "versicolor", "virginica"]
36+
37+
38+
class PythonPredictor:
39+
def __init__(self, config):
40+
if os.environ.get("AWS_ACCESS_KEY_ID"):
41+
s3 = boto3.client("s3") # client will use your credentials if available
42+
else:
43+
s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED)) # anonymous client
44+
45+
s3.download_file(config["bucket"], config["key"], "/tmp/model.pkl")
46+
self.model = pickle.load(open("/tmp/model.pkl", "rb"))
47+
48+
def predict(self, payload: Dict[str, Any]) -> Dict[str, str]:
49+
measurements = [
50+
payload["sepal_length"],
51+
payload["sepal_width"],
52+
payload["petal_length"],
53+
payload["petal_width"],
54+
]
55+
56+
label_id = self.model.predict([measurements])[0]
57+
58+
# result must be json serializable
59+
return {"label": labels[label_id]}
60+
```
61+
62+
```
63+
# requirements.txt
64+
65+
boto3
66+
```
67+
68+
```yaml
69+
# text_generator.yaml
70+
71+
- name: iris-classifier
72+
kind: AsyncAPI
73+
predictor:
74+
type: python
75+
path: predictor.py
76+
```
77+
78+
## Deploy
79+
80+
We can now deploy our API with the `cortex deploy` command. This command can be re-run to update your API configuration
81+
or predictor implementation.
82+
83+
```bash
84+
cortex deploy cortex.yaml
85+
86+
# creating iris-classifier (AsyncAPI)
87+
#
88+
# cortex get (show api statuses)
89+
# cortex get iris-classifier (show api info)
90+
```
91+
92+
## Monitor
93+
94+
To check whether the deployed API is ready, we can run the `cortex get` command with the `--watch` flag.
95+
96+
```bash
97+
cortex get iris-classifier --watch
98+
99+
# status up-to-date requested last update
100+
# live 1 1 10s
101+
#
102+
# endpoint: http://<load_balancer_url>/iris-classifier
103+
#
104+
# api id last deployed
105+
# 6992e7e8f84469c5-d5w1gbvrm5-25a7c15c950439c0bb32eebb7dc84125 10s
106+
```
107+
108+
## Submit a workload
109+
110+
Now we want to submit a workload to our deployed API. We will start by creating a file with a JSON request payload, in
111+
the format expected by our `iris-classifier` predictor implementation.
112+
113+
This is the JSON file we will submit to our iris-classifier API.
114+
115+
```bash
116+
# sample.json
117+
{
118+
"sepal_length": 5.2,
119+
"sepal_width": 3.6,
120+
"petal_length": 1.5,
121+
"petal_width": 0.3
122+
}
123+
```
124+
125+
Once we have our sample request payload, we will submit it with a `POST` request to the endpoint URL previously
126+
displayed in the `cortex get` command. We will quickly get a request `id` back.
127+
128+
```bash
129+
curl -X POST http://<load_balancer_url>/iris-classifier -H "Content-Type: application/json" -d '@./sample.json'
130+
131+
# {"id": "659938d2-2ef6-41f4-8983-4e0b7562a986"}
132+
```
133+
134+
## Retrieve the result
135+
136+
The obtained request id will allow us to check the status of the running payload and retrieve its result. To do so, we
137+
submit a `GET` request to the same endpoint URL with an appended `/<id>`.
138+
139+
```bash
140+
curl http://<load_balancer_url>/iris-classifier/<id> # <id> is the request id that was returned in the previous POST request
141+
142+
# {
143+
# "id": "659938d2-2ef6-41f4-8983-4e0b7562a986",
144+
# "status": "completed",
145+
# "result": {"label": "setosa"},
146+
# "timestamp": "2021-03-16T15:50:50+00:00"
147+
# }
148+
```
149+
150+
Depending on the status of your workload, you will get different responses back. The possible workload status
151+
are `in_queue | in_progress | failed | completed`. The `result` and `timestamp` keys are returned if the status
152+
is `completed`.
153+
154+
It is also possible to setup a webhook in your predictor to get the response sent to a pre-defined web server once the
155+
workload completes or fails. You can read more about it in the [webhook documentation](./webhooks.md).
156+
157+
## Stream logs
158+
159+
If necessary, you can stream the logs from a random running pod from your API with the `cortex logs` command. This is
160+
intended for debugging purposes only. For production logs, you can view the logs in the logging solution of the cloud
161+
provider your cluster is deployed in.
162+
163+
```bash
164+
cortex logs iris-classifier
165+
```
166+
167+
## Delete the API
168+
169+
Finally, you can delete your API with a simple `cortex delete` command.
170+
171+
```bash
172+
cortex delete iris-classifier
173+
```

0 commit comments

Comments
 (0)