cortexlabs
diff --git a/‎docs/summary.md‎
Lines changed: 9 additions & 0 deletions b/‎docs/summary.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎docs/workloads/async/autoscaling.md‎
Lines changed: 108 additions & 0 deletions b/‎docs/workloads/async/autoscaling.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎docs/workloads/async/configuration.md‎
Lines changed: 75 additions & 0 deletions b/‎docs/workloads/async/configuration.md‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎docs/workloads/async/example.md‎
Lines changed: 173 additions & 0 deletions b/‎docs/workloads/async/example.md‎
Lines changed: 173 additions & 0 deletions
@@ -30,18 +30,27 @@
     * [Example](workloads/realtime/traffic-splitter/example.md)
     * [Configuration](workloads/realtime/traffic-splitter/configuration.md)
   * [Troubleshooting](workloads/realtime/troubleshooting.md)
+* [Async APIs](workloads/async/introduction.md)
+  * [Example](workloads/async/example.md)
+  * [Predictor](workloads/async/predictors.md)
+  * [Configuration](workloads/async/configuration.md)
+  * [Statuses](workloads/async/statuses.md)
+  * [Webhooks](workloads/async/webhooks.md)
+  * [Metrics](workloads/async/metrics.md)
 * Batch APIs
   * [Example](workloads/batch/example.md)
   * [Predictor](workloads/batch/predictors.md)
   * [Configuration](workloads/batch/configuration.md)
   * [Jobs](workloads/batch/jobs.md)
   * [Statuses](workloads/batch/statuses.md)
+  * [Metrics](workloads/batch/metrics.md)
 * Task APIs
   * [Example](workloads/task/example.md)
   * [Definition](workloads/task/definitions.md)
   * [Configuration](workloads/task/configuration.md)
   * [Jobs](workloads/task/jobs.md)
   * [Statuses](workloads/task/statuses.md)
+  * [Metrics](workloads/task/metrics.md)
 * Dependencies
   * [Example](workloads/dependencies/example.md)
   * [Python packages](workloads/dependencies/python-packages.md)
 
@@ -0,0 +1,108 @@
+# Autoscaling
+
+Cortex auto-scales AsyncAPIs on a per-API basis based on your configuration.
+
+## Autoscaling replicas
+
+**`min_replicas`**: The lower bound on how many replicas can be running for an API.
+
+<br>
+
+**`max_replicas`**: The upper bound on how many replicas can be running for an API.
+
+<br>
+
+**`target_replica_concurrency`** (default: 1): This is the desired number of in-flight requests per replica, and is the
+metric which the autoscaler uses to make scaling decisions. It is recommended to leave this parameter at its default
+value.
+
+Replica concurrency is simply how many requests have been sent to the queue and have not yet been responded to (also
+referred to as in-flight requests). Therefore, it includes requests which are currently being processed and requests
+which are waiting in the queue.
+
+The autoscaler uses this formula to determine the number of desired replicas:
+
+`desired replicas = sum(in-flight requests accross all replicas) / target_replica_concurrency`
+
+<br>
+
+**`max_replica_concurrency`** (default: 1024): This is the maximum number of in-queue messages before requests are
+rejected with HTTP error code 503. `max_replica_concurrency` includes requests that are currently being processed as
+well as requests that are waiting in the queue (a replica can actively process one request concurrently, and will hold
+any additional requests in a local queue). Decreasing `max_replica_concurrency` and configuring the client to retry when
+it receives 503 responses will improve queue fairness accross replicas by preventing requests from sitting in long
+queues.
+
+<br>
+
+**`window`** (default: 60s): The time over which to average the API in-flight requests (which is the sum of in-flight
+requests in each replica). The longer the window, the slower the autoscaler will react to changes in API wide in-flight
+requests, since it is averaged over the `window`. API wide in-flight requests is calculated every 10 seconds,
+so `window` must be a multiple of 10 seconds.
+
+<br>
+
+**`downscale_stabilization_period`** (default: 5m): The API will not scale below the highest recommendation made during
+this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters
+described here. It will then take the max of the current recommendation and all recommendations made during
+the `downscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing
+this value will cause the cluster to react more slowly to decreased traffic, and will reduce thrashing.
+
+<br>
+
+**`upscale_stabilization_period`** (default: 1m): The API will not scale above the lowest recommendation made during
+this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters
+described here. It will then take the min of the current recommendation and all recommendations made during
+the `upscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing this
+value will cause the cluster to react more slowly to increased traffic, and will reduce thrashing.
+
+<br>
+
+**`max_downscale_factor`** (default: 0.75): The maximum factor by which to scale down the API on a single scaling event.
+For example, if `max_downscale_factor` is 0.5 and there are 10 running replicas, the autoscaler will not recommend fewer
+than 5 replicas. Increasing this number will allow the cluster to shrink more quickly in response to dramatic dips in
+traffic.
+
+<br>
+
+**`max_upscale_factor`** (default: 1.5): The maximum factor by which to scale up the API on a single scaling event. For
+example, if `max_upscale_factor` is 10 and there are 5 running replicas, the autoscaler will not recommend more than 50
+replicas. Increasing this number will allow the cluster to grow more quickly in response to dramatic spikes in traffic.
+
+<br>
+
+**`downscale_tolerance`** (default: 0.05): Any recommendation falling within this factor below the current number of
+replicas will not trigger a scale down event. For example, if `downscale_tolerance` is 0.1 and there are 20 running
+replicas, a recommendation of 18 or 19 replicas will not be acted on, and the API will remain at 20 replicas. Increasing
+this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size.
+
+<br>
+
+**`upscale_tolerance`** (default: 0.05): Any recommendation falling within this factor above the current number of
+replicas will not trigger a scale up event. For example, if `upscale_tolerance` is 0.1 and there are 20 running
+replicas, a recommendation of 21 or 22 replicas will not be acted on, and the API will remain at 20 replicas. Increasing
+this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size.
+
+<br>
+
+## Autoscaling instances
+
+Cortex spins up and down instances based on the aggregate resource requests of all APIs. The number of instances will be
+at least `min_instances` and no more than `max_instances` (configured during installation and modifiable
+via `cortex cluster configure`).
+
+## Autoscaling responsiveness
+
+Assuming that `window` and `upscale_stabilization_period` are set to their default values (1 minute), it could take up
+to 2 minutes of increased traffic before an extra replica is requested. As soon as the additional replica is requested,
+the replica request will be visible in the output of `cortex get`, but the replica won't yet be running. If an extra
+instance is required to schedule the newly requested replica, it could take a few minutes for AWS to provision the
+instance (depending on the instance type), plus a few minutes for the newly provisioned instance to download your api
+image and for the api to initialize (via its `__init__()` method).
+
+If you want the autoscaler to react as quickly as possible, set `upscale_stabilization_period` and `window` to their
+minimum values (0s and 10s respectively).
+
+If it takes a long time to initialize your API replica (i.e. install dependencies and run your predictor's `__init__()`
+function), consider building your own API image to use instead of the default image. With this approach, you can
+pre-download/build/install any custom dependencies and bake them into the image.
@@ -0,0 +1,75 @@
+# Configuration
+
+```yaml
+- name: <string>
+  kind: AsyncAPI
+  predictor: # detailed configuration below
+  compute: # detailed configuration below
+  autoscaling: # detailed configuration below
+  update_strategy: # detailed configuration below
+  networking: # detailed configuration below
+```
+
+## Predictor
+
+### Python Predictor
+
+<!-- CORTEX_VERSION_BRANCH_STABLE x3 -->
+
+```yaml
+predictor:
+  type: python
+  path: <string>  # path to a python file with a PythonPredictor class definition, relative to the Cortex root (required)
+  dependencies: # (optional)
+    pip: <string>  # relative path to requirements.txt (default: requirements.txt)
+    conda: <string>  # relative path to conda-packages.txt (default: conda-packages.txt)
+    shell: <string>  # relative path to a shell script for system package installation (default: dependencies.sh)
+  config: <string: value>  # arbitrary dictionary passed to the constructor of the Predictor (optional)
+  python_path: <string>  # path to the root of your Python folder that will be appended to PYTHONPATH (default: folder containing cortex.yaml)
+  image: <string>  # docker image to use for the Predictor (default: quay.io/cortexlabs/python-predictor-cpu:0.31.0, quay.io/cortexlabs/python-predictor-gpu:0.31.0-cuda10.2-cudnn8, or quay.io/cortexlabs/python-predictor-inf:0.31.0 based on compute)
+  env: <string: string>  # dictionary of environment variables
+  log_level: <string>  # log level that can be "debug", "info", "warning" or "error" (default: "info")
+  shm_size: <string>  # size of shared memory (/dev/shm) for sharing data between multiple processes, e.g. 64Mi or 1Gi (default: Null)
+```
+
+## Compute
+
+```yaml
+compute:
+  cpu: <string | int | float>  # CPU request per replica. One unit of CPU corresponds to one virtual CPU; fractional requests are allowed, and can be specified as a floating point number or via the "m" suffix (default: 200m)
+  gpu: <int>  # GPU request per replica. One unit of GPU corresponds to one virtual GPU (default: 0)
+  mem: <string>  # memory request per replica. One unit of memory is one byte and can be expressed as an integer or by using one of these suffixes: K, M, G, T (or their power-of two counterparts: Ki, Mi, Gi, Ti) (default: Null)
+```
+
+## Autoscaling
+
+```yaml
+autoscaling:
+  min_replicas: <int>  # minimum number of replicas (default: 1)
+  max_replicas: <int>  # maximum number of replicas (default: 100)
+  init_replicas: <int>  # initial number of replicas (default: <min_replicas>)
+  max_replica_concurrency: <int>  # the maximum number of in-flight requests per replica before requests are rejected with error code 503 (default: 1024)
+  target_replica_concurrency: <float>  # the desired number of in-flight requests per replica, which the autoscaler tries to maintain (default: processes_per_replica * threads_per_process) (aws only)
+  window: <duration>  # the time over which to average the API's concurrency (default: 60s) (aws only)
+  downscale_stabilization_period: <duration>  # the API will not scale below the highest recommendation made during this period (default: 5m) (aws only)
+  upscale_stabilization_period: <duration>  # the API will not scale above the lowest recommendation made during this period (default: 1m) (aws only)
+  max_downscale_factor: <float>  # the maximum factor by which to scale down the API on a single scaling event (default: 0.75) (aws only)
+  max_upscale_factor: <float>  # the maximum factor by which to scale up the API on a single scaling event (default: 1.5) (aws only)
+  downscale_tolerance: <float>  # any recommendation falling within this factor below the current number of replicas will not trigger a scale down event (default: 0.05) (aws only)
+  upscale_tolerance: <float>  # any recommendation falling within this factor above the current number of replicas will not trigger a scale up event (default: 0.05) (aws only)
+```
+
+## Update strategy
+
+```yaml
+update_strategy:
+  max_surge: <string | int>  # maximum number of replicas that can be scheduled above the desired number of replicas during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%) (set to 0 to disable rolling updates)
+  max_unavailable: <string | int>  # maximum number of replicas that can be unavailable during an update; can be an absolute number, e.g. 5, or a percentage of desired replicas, e.g. 10% (default: 25%)
+```
+
+## Networking
+
+```yaml
+  networking:
+    endpoint: <string>  # the endpoint for the API (default: <api_name>)
+```
@@ -0,0 +1,173 @@
+# AsyncAPI
+
+Create APIs that process your workloads asynchronously.
+
+## Implementation
+
+Create a folder for your API. In this case, we are deploying an iris-classifier AsyncAPI. This folder will have the
+following structure:
+
+```shell
+./iris-classifier
+├── cortex.yaml
+├── predictor.py
+└── requirements.txt
+```
+
+We will now create the necessary files:
+
+```bash
+mkdir iris-classifier && cd iris-classifier
+touch predictor.py requirements.txt cortex.yaml
+```
+
+```python
+# predictor.py
+
+import os
+import pickle
+from typing import Dict, Any
+
+import boto3
+from botocore import UNSIGNED
+from botocore.client import Config
+
+labels = ["setosa", "versicolor", "virginica"]
+
+
+class PythonPredictor:
+    def __init__(self, config):
+        if os.environ.get("AWS_ACCESS_KEY_ID"):
+            s3 = boto3.client("s3")  # client will use your credentials if available
+        else:
+            s3 = boto3.client("s3", config=Config(signature_version=UNSIGNED))  # anonymous client
+
+        s3.download_file(config["bucket"], config["key"], "/tmp/model.pkl")
+        self.model = pickle.load(open("/tmp/model.pkl", "rb"))
+
+    def predict(self, payload: Dict[str, Any]) -> Dict[str, str]:
+        measurements = [
+            payload["sepal_length"],
+            payload["sepal_width"],
+            payload["petal_length"],
+            payload["petal_width"],
+        ]
+
+        label_id = self.model.predict([measurements])[0]
+
+        # result must be json serializable
+        return {"label": labels[label_id]}
+```
+
+```
+# requirements.txt
+
+boto3
+```
+
+```yaml
+# text_generator.yaml
+
+- name: iris-classifier
+  kind: AsyncAPI
+  predictor:
+    type: python
+    path: predictor.py
+```
+
+## Deploy
+
+We can now deploy our API with the `cortex deploy` command. This command can be re-run to update your API configuration
+or predictor implementation.
+
+```bash
+cortex deploy cortex.yaml
+
+# creating iris-classifier (AsyncAPI)
+#
+# cortex get                  (show api statuses)
+# cortex get iris-classifier  (show api info)
+```
+
+## Monitor
+
+To check whether the deployed API is ready, we can run the `cortex get` command with the `--watch` flag.
+
+```bash
+cortex get iris-classifier --watch
+
+# status     up-to-date   requested   last update
+# live       1            1           10s
+#
+# endpoint: http://<load_balancer_url>/iris-classifier
+#
+# api id                                                         last deployed
+# 6992e7e8f84469c5-d5w1gbvrm5-25a7c15c950439c0bb32eebb7dc84125   10s
+```
+
+## Submit a workload
+
+Now we want to submit a workload to our deployed API. We will start by creating a file with a JSON request payload, in
+the format expected by our `iris-classifier` predictor implementation.
+
+This is the JSON file we will submit to our iris-classifier API.
+
+```bash
+# sample.json
+{
+    "sepal_length": 5.2,
+    "sepal_width": 3.6,
+    "petal_length": 1.5,
+    "petal_width": 0.3
+}
+```
+
+Once we have our sample request payload, we will submit it with a `POST` request to the endpoint URL previously
+displayed in the `cortex get` command. We will quickly get a request `id` back.
+
+```bash
+curl -X POST http://<load_balancer_url>/iris-classifier -H "Content-Type: application/json" -d '@./sample.json'
+
+# {"id": "659938d2-2ef6-41f4-8983-4e0b7562a986"}
+```
+
+## Retrieve the result
+
+The obtained request id will allow us to check the status of the running payload and retrieve its result. To do so, we
+submit a `GET` request to the same endpoint URL with an appended `/<id>`.
+
+```bash
+curl http://<load_balancer_url>/iris-classifier/<id>  # <id> is the request id that was returned in the previous POST request
+
+# {
+#   "id": "659938d2-2ef6-41f4-8983-4e0b7562a986",
+#   "status": "completed",
+#   "result": {"label": "setosa"},
+#   "timestamp": "2021-03-16T15:50:50+00:00"
+# }
+```
+
+Depending on the status of your workload, you will get different responses back. The possible workload status
+are `in_queue | in_progress | failed | completed`. The `result` and `timestamp` keys are returned if the status
+is `completed`.
+
+It is also possible to setup a webhook in your predictor to get the response sent to a pre-defined web server once the
+workload completes or fails. You can read more about it in the [webhook documentation](./webhooks.md).
+
+## Stream logs
+
+If necessary, you can stream the logs from a random running pod from your API with the `cortex logs` command. This is
+intended for debugging purposes only. For production logs, you can view the logs in the logging solution of the cloud
+provider your cluster is deployed in.
+
+```bash
+cortex logs iris-classifier
+```
+
+## Delete the API
+
+Finally, you can delete your API with a simple `cortex delete` command.
+
+```bash
+cortex delete iris-classifier
+```