|
| 1 | +--- |
| 2 | +title: "Troubleshooting Provisioning Request in Kueue" |
| 3 | +date: 2024-05-20 |
| 4 | +weight: 3 |
| 5 | +description: > |
| 6 | + Troubleshooting the status of a Provisioning Request in Kueue |
| 7 | +--- |
| 8 | + |
| 9 | +This document helps you troubleshoot ProvisioningRequests, an API defined by [ClusterAutoscaler](https://github.com/kubernetes/autoscaler/blob/4872bddce2bcc5b4a5f6a3d569111c11b8a2baf4/cluster-autoscaler/provisioningrequest/apis/autoscaling.x-k8s.io/v1beta1/types.go#L41). |
| 10 | + |
| 11 | +Kueue creates ProvisioningRequests via the [Provisioning Admission Check Controller](/docs/admission-check-controllers/provisioning/), and treats them like an [Admission Check](/docs/concepts/admission_check/). In order for Kueue to admit a Workload, the ProvisioningRequest created for it needs to succeed. |
| 12 | + |
| 13 | +## Before you begin |
| 14 | + |
| 15 | +Before you begin troubleshooting, make sure your cluster meets the following requirements: |
| 16 | +- Your cluster has ClusterAutoscaler enabled and ClusterAutoscaler supports ProvisioningRequest API. |
| 17 | +Check your cloud provider's documentation to determine the minimum versions that support ProvisioningRequest. If you use GKE, your cluster should be running version `1.28.3-gke.1098000` or newer. |
| 18 | +- You use a type of nodes that support ProvisioningRequest. It may vary depending on your cloud provider. |
| 19 | +- Kueue's version is `v0.5.3` or newer. |
| 20 | +- You have enabled the `ProvisioningACC` in [the feature gates configuration](/docs/installation/#change-the-feature-gates-configuration). This feature gate is enabled by default for Kueue `v0.7.0` or newer. |
| 21 | + |
| 22 | +## Identifying the Provisioning Request for your job |
| 23 | + |
| 24 | +See the [Troubleshooting Jobs guide](/docs/tasks/troubleshooting/troubleshooting_jobs/#identifying-the-workload-for-your-job), to learn how to identify the Workload for your job. |
| 25 | + |
| 26 | +You can run the following command to see a brief state of a Provisioning Request (and other Admission Checks) in the `admissionChecks` field of the Workload's Status. |
| 27 | + |
| 28 | +```bash |
| 29 | +kubectl describe workload WORKLOAD_NAME |
| 30 | +``` |
| 31 | + |
| 32 | +Kueue creates ProvisioningRequests using a naming pattern that helps you identify the request corresponding to your workload. |
| 33 | + |
| 34 | +``` |
| 35 | +[NAME OF YOUR WORKLOAD]-[NAME OF THE ADMISSION CHECK]-[NUMBER OF RETRY] |
| 36 | +``` |
| 37 | +e.g. |
| 38 | +```bash |
| 39 | +sample-job-2zcsb-57864-sample-admissioncheck-1 |
| 40 | +``` |
| 41 | + |
| 42 | +When nodes for your job are provisioned, Kueue will also add the annotation `cluster-autoscaler.kubernetes.io/consume-provisioning-request` to the `.admissionChecks[*].podSetUpdate[*]` field in Workload's status. The value of this annotation is the Provisioning Request's name. |
| 43 | + |
| 44 | +The output of the `kubectl describe workload` command should look similar to the following: |
| 45 | + |
| 46 | +```bash |
| 47 | +[...] |
| 48 | +Status: |
| 49 | + Admission Checks: |
| 50 | + Last Transition Time: 2024-05-22T10:47:46Z |
| 51 | + Message: Provisioning Request was successfully provisioned. |
| 52 | + Name: sample-admissioncheck |
| 53 | + Pod Set Updates: |
| 54 | + Annotations: |
| 55 | + cluster-autoscaler.kubernetes.io/consume-provisioning-request: sample-job-2zcsb-57864-sample-admissioncheck-1 |
| 56 | + cluster-autoscaler.kubernetes.io/provisioning-class-name: queued-provisioning.gke.io |
| 57 | + Name: main |
| 58 | + State: Ready |
| 59 | +``` |
| 60 | + |
| 61 | +## What is the current state of my Provisioning Request? |
| 62 | + |
| 63 | +One possible reason your job is not running might be that ProvisioningRequest is waiting to be provisioned. |
| 64 | +To find out if this is the case you can view Provisioning Request's state by running the following command: |
| 65 | + |
| 66 | +```bash |
| 67 | +kubectl get provisioningrequest PROVISIONING_REQUEST_NAME |
| 68 | +``` |
| 69 | + |
| 70 | +If this is the case, the output should look similar to the following: |
| 71 | + |
| 72 | +```bash |
| 73 | +NAME ACCEPTED PROVISIONED FAILED AGE |
| 74 | +sample-job-2zcsb-57864-sample-admissioncheck-1 True False False 20s |
| 75 | +``` |
| 76 | + |
| 77 | +You can also view more detailed status of your ProvisioningRequest by running the following command: |
| 78 | + |
| 79 | +```bash |
| 80 | +kubectl describe provisioningrequest PROVISIONING_REQUEST_NAME |
| 81 | +``` |
| 82 | + |
| 83 | +If your ProvisioningRequest fails to provision nodes, the error output may look similar to the following: |
| 84 | +```bash |
| 85 | +[...] |
| 86 | +Status: |
| 87 | + Conditions: |
| 88 | + Last Transition Time: 2024-05-22T13:04:54Z |
| 89 | + Message: Provisioning Request wasn't accepted. |
| 90 | + Observed Generation: 1 |
| 91 | + Reason: NotAccepted |
| 92 | + Status: False |
| 93 | + Type: Accepted |
| 94 | + Last Transition Time: 2024-05-22T13:04:54Z |
| 95 | + Message: Provisioning Request wasn't provisioned. |
| 96 | + Observed Generation: 1 |
| 97 | + Reason: NotProvisioned |
| 98 | + Status: False |
| 99 | + Type: Provisioned |
| 100 | + Last Transition Time: 2024-05-22T13:06:49Z |
| 101 | + Message: max cluster limit reached, nodepools out of resources: default-nodepool (cpu, memory) |
| 102 | + Observed Generation: 1 |
| 103 | + Reason: OutOfResources |
| 104 | + Status: True |
| 105 | + Type: Failed |
| 106 | +``` |
| 107 | + |
| 108 | +Note that the `Reason` and `Message` values for `Failed` condition may differ from your output, depending on the |
| 109 | +reason that prevented the provisioning. |
| 110 | + |
| 111 | +The Provisioning Request state is described in the `.conditions[*].status` field. |
| 112 | +An empty field means ProvisinongRequest is still being processed by the ClusterAutoscaler. |
| 113 | +Otherwise, it falls into one of the states listed below: |
| 114 | +- `Accepted` - indicates that the ProvisioningRequest was accepted by ClusterAutoscaler, so ClusterAutoscaler will attempt to provision the nodes for it. |
| 115 | +- `Provisioned` - indicates that all of the requested resources were created and are available in the cluster. ClusterAutoscaler will set this condition when the VM creation finishes successfully. |
| 116 | +- `Failed` - indicates that it is impossible to obtain resources to fulfill this ProvisioningRequest. Condition Reason and Message will contain more details about what failed. |
| 117 | +- `BookingExpired` - indicates that the ProvisioningRequest had Provisioned condition before and capacity reservation time is expired. |
| 118 | +- `CapacityRevoked` - indicates that requested resources are not longer valid. |
| 119 | + |
| 120 | +The states transitions are as follow: |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | +## Why a Provisioning Request is not created? |
| 125 | + |
| 126 | +If Kueue did not create a Provisioning Request for your job, try checking the following requirements: |
| 127 | + |
| 128 | +### a. Ensure the Kueue's controller manager enables the `ProvisioningACC` feature gate |
| 129 | + |
| 130 | +Run the following command to check whether your Kueue's controller manager has enabled the `ProvisioningACC` feature gate: |
| 131 | + |
| 132 | +```bash |
| 133 | +kubectl describe pod -n kueue-system kueue-controller-manager- |
| 134 | +``` |
| 135 | + |
| 136 | +The arguments for Kueue container should be similar to the following: |
| 137 | + |
| 138 | +```bash |
| 139 | + ... |
| 140 | + Args: |
| 141 | + --config=/controller_manager_config.yaml |
| 142 | + --zap-log-level=2 |
| 143 | + --feature-gates=ProvisioningACC=true |
| 144 | +``` |
| 145 | + |
| 146 | +Note for Kueue `v0.7.0` or newer the feature is enabled by default, so you may see different output. |
| 147 | + |
| 148 | +### b. Ensure your Workload has reserved quota |
| 149 | + |
| 150 | +To check if your Workload has reserved quota in a ClusterQueue check your Workload's status by running the following command: |
| 151 | + |
| 152 | +```bash |
| 153 | +kubectl describe workload WORKLOAD_NAME |
| 154 | +``` |
| 155 | + |
| 156 | +The output should be similar to the following: |
| 157 | + |
| 158 | +```bash |
| 159 | +[...] |
| 160 | +Status: |
| 161 | + Conditions: |
| 162 | + Last Transition Time: 2024-05-22T10:26:40Z |
| 163 | + Message: Quota reserved in ClusterQueue cluster-queue |
| 164 | + Observed Generation: 1 |
| 165 | + Reason: QuotaReserved |
| 166 | + Status: True |
| 167 | + Type: QuotaReserved |
| 168 | +``` |
| 169 | + |
| 170 | +If the output you get is similar to the following: |
| 171 | + |
| 172 | +```bash |
| 173 | + Conditions: |
| 174 | + Last Transition Time: 2024-05-22T08:48:47Z |
| 175 | + Message: couldn't assign flavors to pod set main: insufficient unused quota for memory in flavor default-flavor, 4396Mi more needed |
| 176 | + Observed Generation: 1 |
| 177 | + Reason: Pending |
| 178 | + Status: False |
| 179 | + Type: QuotaReserved |
| 180 | +``` |
| 181 | +
|
| 182 | +This means you do not have sufficient free quota in your ClusterQueue. |
| 183 | +
|
| 184 | +Other reasons why your Workload has not reserved quota may relate to LocalQueue/ClusterQueue misconfiguration, e.g.: |
| 185 | +
|
| 186 | +```bash |
| 187 | +Status: |
| 188 | + Conditions: |
| 189 | + Last Transition Time: 2024-05-22T08:57:09Z |
| 190 | + Message: ClusterQueue cluster-queue doesn't exist |
| 191 | + Observed Generation: 1 |
| 192 | + Reason: Inadmissible |
| 193 | + Status: False |
| 194 | + Type: QuotaReserved |
| 195 | +``` |
| 196 | + |
| 197 | +You can check if ClusterQueues and LocalQueues are ready to admit your Workloads. |
| 198 | +See the [Troubleshooting Queues](/docs/tasks/troubleshooting/troubleshooting_queues/) for more details. |
| 199 | + |
| 200 | + |
| 201 | +### c. Ensure the Admission Check is active |
| 202 | + |
| 203 | +To check if the Admission Check that your job uses is active run the following command: |
| 204 | + |
| 205 | +```bash |
| 206 | +kubectl describe admissionchecks ADMISSIONCHECK_NAME |
| 207 | +``` |
| 208 | + |
| 209 | +Where `ADMISSIONCHECK_NAME` is a name configured in your ClusterQueue spec. See the [Admission Check documentation](/docs/concepts/admission_check/) for more details. |
| 210 | + |
| 211 | +The status of the Admission Check should be similar to: |
| 212 | + |
| 213 | +```bash |
| 214 | +... |
| 215 | +Status: |
| 216 | + Conditions: |
| 217 | + Last Transition Time: 2024-03-08T11:44:53Z |
| 218 | + Message: The admission check is active |
| 219 | + Reason: Active |
| 220 | + Status: True |
| 221 | + Type: Active |
| 222 | +``` |
| 223 | + |
| 224 | +If none of the above steps resolves your problem, contact us at the [Slack `wg-batch` channel](https://kubernetes.slack.com/archives/C032ZE66A2X) |
0 commit comments