|
| 1 | +# Running in production |
| 2 | + |
| 3 | +_WARNING: you are on the master branch, please refer to the docs on the branch that matches your `cortex version`_ |
| 4 | + |
| 5 | +**Tips for batch and realtime APIs:** |
| 6 | + |
| 7 | +* Consider using [spot instances](../cluster-management/spot-instances.md) to reduce cost. |
| 8 | + |
| 9 | +* If you're using multiple clusters and/or multiple developers are interacting with your cluster(s), see our documention on [environments](../miscellaneous/environments.md) |
| 10 | + |
| 11 | +**Additional tips for realtime APIs** |
| 12 | + |
| 13 | +* Consider tuning `processes_per_replica` and `threads_per_process` in your [Realtime API configuration](../deployments/realtime-api/api-configuration.md). Each model behaves differently, so the best way to find a good value is to run a load test on a single replica (you can set `min_replicas` to 1 to avoid autocaling). Here is [additional information](../deployments/realtime-api/parallelism.md#concurrency) about these fields. |
| 14 | + |
| 15 | +* You may wish to customize the autoscaler for your APIs. The [autoscaling documentation](../deployments/realtime-api/autoscaling.md) describes each of the parameters that can be configured. |
| 16 | + |
| 17 | +* When creating an API that you will send large amounts of traffic to all at once, set `min_replicas` at (or slightly above) the number of replicas you expect will be necessary to handle the load at steady state. After traffic has been fully shifted to your API, `min_replicas` can be reduced to allow automatic downscaling. |
| 18 | + |
| 19 | +* [Traffic splitters](./deployments/realtime-api/traffic-splitter.md) can be used to route a subset of traffic to an updated API. For example, you can create a traffic splitter named `my-api`, and route requests to `my-api` to any number of Realtime APIs (e.g. `my-api_v1`, `my-api_v2`, etc). The percentage of traffic that the traffic splitter routes to each API can be updated on the fly. |
| 20 | + |
| 21 | +* If initialization of your API replicas takes a while (e.g. due to downloading large models from slow hosts or installing dependencies), and responsive autoscaling is important to you, consider pre-building your API's Docker image. See [here](../deployments/system-packages.md#custom-docker-image) for instructions. |
| 22 | + |
| 23 | +* If your API is receiving many queries per second and you are using the TensorFlow Predictor, consider enabling [server-side batching](../deployments/realtime-api/parallelism.md#server-side-batching). |
| 24 | + |
| 25 | +* [Overprovisioning](../deployments/realtime-api/autoscaling.md#overprovisioning) can be used to reduce the chance of large queues building up. This can be especially important when inferences take a long time. |
| 26 | + |
| 27 | +**Additional tips for inferences that take a long time:** |
| 28 | + |
| 29 | +* Consider using [GPUs](../deployments/gpus.md) or [Inferentia](../deployments/inferentia.md) to speed up inference. |
| 30 | + |
| 31 | +* Consider setting a low value for `max_replica_concurrency`, since if there are many requests in the queue, it will take a long time until newly received requests are processed. See [autoscaling docs](../deployments/realtime-api/autoscaling.md) for more details. |
| 32 | + |
| 33 | +* Keep in mind that API Gateway has a 29 second timeout; if your requests take longer (due to a long inference time and/or long request queues), you will need to disable API Gateway for your API by setting `api_gateway: none` in the `networking` config in your [Realtime API configuration](../deployments/realtime-api/api-configuration.md) and/or [Batch API configuration](../deployments/batch-api/api-configuration.md). Alternatively, you can disable API gateway for all APIs in your cluster by setting `api_gateway: none` in your [cluster configuration file](../cluster-management/config.md) before creating your cluster. |
0 commit comments