|
| 1 | +# Autoscaling |
| 2 | + |
| 3 | +Cortex auto-scales AsyncAPIs on a per-API basis based on your configuration. |
| 4 | + |
| 5 | +## Autoscaling replicas |
| 6 | + |
| 7 | +**`min_replicas`**: The lower bound on how many replicas can be running for an API. |
| 8 | + |
| 9 | +<br> |
| 10 | + |
| 11 | +**`max_replicas`**: The upper bound on how many replicas can be running for an API. |
| 12 | + |
| 13 | +<br> |
| 14 | + |
| 15 | +**`target_replica_concurrency`** (default: 1): This is the desired number of in-flight requests per replica, and is the |
| 16 | +metric which the autoscaler uses to make scaling decisions. It is recommended to leave this parameter at its default |
| 17 | +value. |
| 18 | + |
| 19 | +Replica concurrency is simply how many requests have been sent to the queue and have not yet been responded to (also |
| 20 | +referred to as in-flight requests). Therefore, it includes requests which are currently being processed and requests |
| 21 | +which are waiting in the queue. |
| 22 | + |
| 23 | +The autoscaler uses this formula to determine the number of desired replicas: |
| 24 | + |
| 25 | +`desired replicas = sum(in-flight requests accross all replicas) / target_replica_concurrency` |
| 26 | + |
| 27 | +<br> |
| 28 | + |
| 29 | +**`max_replica_concurrency`** (default: 1024): This is the maximum number of in-queue messages before requests are |
| 30 | +rejected with HTTP error code 503. `max_replica_concurrency` includes requests that are currently being processed as |
| 31 | +well as requests that are waiting in the queue (a replica can actively process one request concurrently, and will hold |
| 32 | +any additional requests in a local queue). Decreasing `max_replica_concurrency` and configuring the client to retry when |
| 33 | +it receives 503 responses will improve queue fairness accross replicas by preventing requests from sitting in long |
| 34 | +queues. |
| 35 | + |
| 36 | +<br> |
| 37 | + |
| 38 | +**`window`** (default: 60s): The time over which to average the API in-flight requests (which is the sum of in-flight |
| 39 | +requests in each replica). The longer the window, the slower the autoscaler will react to changes in API wide in-flight |
| 40 | +requests, since it is averaged over the `window`. API wide in-flight requests is calculated every 10 seconds, |
| 41 | +so `window` must be a multiple of 10 seconds. |
| 42 | + |
| 43 | +<br> |
| 44 | + |
| 45 | +**`downscale_stabilization_period`** (default: 5m): The API will not scale below the highest recommendation made during |
| 46 | +this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters |
| 47 | +described here. It will then take the max of the current recommendation and all recommendations made during |
| 48 | +the `downscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing |
| 49 | +this value will cause the cluster to react more slowly to decreased traffic, and will reduce thrashing. |
| 50 | + |
| 51 | +<br> |
| 52 | + |
| 53 | +**`upscale_stabilization_period`** (default: 1m): The API will not scale above the lowest recommendation made during |
| 54 | +this period. Every 10 seconds, the autoscaler makes a recommendation based on all of the other configuration parameters |
| 55 | +described here. It will then take the min of the current recommendation and all recommendations made during |
| 56 | +the `upscale_stabilization_period`, and use that to determine the final number of replicas to scale to. Increasing this |
| 57 | +value will cause the cluster to react more slowly to increased traffic, and will reduce thrashing. |
| 58 | + |
| 59 | +<br> |
| 60 | + |
| 61 | +**`max_downscale_factor`** (default: 0.75): The maximum factor by which to scale down the API on a single scaling event. |
| 62 | +For example, if `max_downscale_factor` is 0.5 and there are 10 running replicas, the autoscaler will not recommend fewer |
| 63 | +than 5 replicas. Increasing this number will allow the cluster to shrink more quickly in response to dramatic dips in |
| 64 | +traffic. |
| 65 | + |
| 66 | +<br> |
| 67 | + |
| 68 | +**`max_upscale_factor`** (default: 1.5): The maximum factor by which to scale up the API on a single scaling event. For |
| 69 | +example, if `max_upscale_factor` is 10 and there are 5 running replicas, the autoscaler will not recommend more than 50 |
| 70 | +replicas. Increasing this number will allow the cluster to grow more quickly in response to dramatic spikes in traffic. |
| 71 | + |
| 72 | +<br> |
| 73 | + |
| 74 | +**`downscale_tolerance`** (default: 0.05): Any recommendation falling within this factor below the current number of |
| 75 | +replicas will not trigger a scale down event. For example, if `downscale_tolerance` is 0.1 and there are 20 running |
| 76 | +replicas, a recommendation of 18 or 19 replicas will not be acted on, and the API will remain at 20 replicas. Increasing |
| 77 | +this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size. |
| 78 | + |
| 79 | +<br> |
| 80 | + |
| 81 | +**`upscale_tolerance`** (default: 0.05): Any recommendation falling within this factor above the current number of |
| 82 | +replicas will not trigger a scale up event. For example, if `upscale_tolerance` is 0.1 and there are 20 running |
| 83 | +replicas, a recommendation of 21 or 22 replicas will not be acted on, and the API will remain at 20 replicas. Increasing |
| 84 | +this value will prevent thrashing, but setting it too high will prevent the cluster from maintaining it's optimal size. |
| 85 | + |
| 86 | +<br> |
| 87 | + |
| 88 | +## Autoscaling instances |
| 89 | + |
| 90 | +Cortex spins up and down instances based on the aggregate resource requests of all APIs. The number of instances will be |
| 91 | +at least `min_instances` and no more than `max_instances` (configured during installation and modifiable |
| 92 | +via `cortex cluster configure`). |
| 93 | + |
| 94 | +## Autoscaling responsiveness |
| 95 | + |
| 96 | +Assuming that `window` and `upscale_stabilization_period` are set to their default values (1 minute), it could take up |
| 97 | +to 2 minutes of increased traffic before an extra replica is requested. As soon as the additional replica is requested, |
| 98 | +the replica request will be visible in the output of `cortex get`, but the replica won't yet be running. If an extra |
| 99 | +instance is required to schedule the newly requested replica, it could take a few minutes for AWS to provision the |
| 100 | +instance (depending on the instance type), plus a few minutes for the newly provisioned instance to download your api |
| 101 | +image and for the api to initialize (via its `__init__()` method). |
| 102 | + |
| 103 | +If you want the autoscaler to react as quickly as possible, set `upscale_stabilization_period` and `window` to their |
| 104 | +minimum values (0s and 10s respectively). |
| 105 | + |
| 106 | +If it takes a long time to initialize your API replica (i.e. install dependencies and run your predictor's `__init__()` |
| 107 | +function), consider building your own API image to use instead of the default image. With this approach, you can |
| 108 | +pre-download/build/install any custom dependencies and bake them into the image. |
0 commit comments