Skip to content

Commit 993de6f

Browse files
k8s-infra-cherrypick-robotPBundyracortespaoalculquicondortenzen-y
authored
[website] Add troubleshooting guide for ProvisioningRequest (#2357)
* Add troubleshooting guide for ProvisioningRequest * Fix description of the Provisioned state * Apply suggestions from code review Co-authored-by: Paola Cortés <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]> * Update site/content/en/docs/tasks/troubleshooting/troubleshooting_provreq.md Co-authored-by: Yuki Iwai <[email protected]> * Apply suggestions from code review Co-authored-by: Aldo Culquicondor <[email protected]> * Improve ProvisioningRequest troubleshooting guide, add more examples * Improve a ProvisioningRequest diagram * Bump Kueue's version to 0.5.3 * Update site/content/en/docs/tasks/troubleshooting/troubleshooting_provreq.md Co-authored-by: Yaroslava Serdiuk <[email protected]> * Update site/content/en/docs/tasks/troubleshooting/troubleshooting_provreq.md Co-authored-by: Yaroslava Serdiuk <[email protected]> * Improve readability * Improve a ProvisioningRequest diagram * Update site/content/en/docs/tasks/troubleshooting/troubleshooting_provreq.md Co-authored-by: Aldo Culquicondor <[email protected]> * Apply suggestions from code review Co-authored-by: Yuki Iwai <[email protected]> * Improve naming in ProvisioningRequest troubleshooting guide * Apply suggestions from code review Co-authored-by: Aldo Culquicondor <[email protected]> * Improve the ProvisioningRequest troubleshooting guide * Update site/content/en/docs/tasks/troubleshooting/troubleshooting_provreq.md Co-authored-by: Aldo Culquicondor <[email protected]> * Improve the ProvisioningRequest troubleshooting guide --------- Co-authored-by: Patryk Bundyra <[email protected]> Co-authored-by: Patryk Bundyra <[email protected]> Co-authored-by: Paola Cortés <[email protected]> Co-authored-by: Aldo Culquicondor <[email protected]> Co-authored-by: Yuki Iwai <[email protected]> Co-authored-by: Yaroslava Serdiuk <[email protected]>
1 parent a32228f commit 993de6f

File tree

3 files changed

+226
-1
lines changed

3 files changed

+226
-1
lines changed
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
title: "Troubleshooting Provisioning Request in Kueue"
3+
date: 2024-05-20
4+
weight: 3
5+
description: >
6+
Troubleshooting the status of a Provisioning Request in Kueue
7+
---
8+
9+
This document helps you troubleshoot ProvisioningRequests, an API defined by [ClusterAutoscaler](https://github.com/kubernetes/autoscaler/blob/4872bddce2bcc5b4a5f6a3d569111c11b8a2baf4/cluster-autoscaler/provisioningrequest/apis/autoscaling.x-k8s.io/v1beta1/types.go#L41).
10+
11+
Kueue creates ProvisioningRequests via the [Provisioning Admission Check Controller](/docs/admission-check-controllers/provisioning/), and treats them like an [Admission Check](/docs/concepts/admission_check/). In order for Kueue to admit a Workload, the ProvisioningRequest created for it needs to succeed.
12+
13+
## Before you begin
14+
15+
Before you begin troubleshooting, make sure your cluster meets the following requirements:
16+
- Your cluster has ClusterAutoscaler enabled and ClusterAutoscaler supports ProvisioningRequest API.
17+
Check your cloud provider's documentation to determine the minimum versions that support ProvisioningRequest. If you use GKE, your cluster should be running version `1.28.3-gke.1098000` or newer.
18+
- You use a type of nodes that support ProvisioningRequest. It may vary depending on your cloud provider.
19+
- Kueue's version is `v0.5.3` or newer.
20+
- You have enabled the `ProvisioningACC` in [the feature gates configuration](/docs/installation/#change-the-feature-gates-configuration). This feature gate is enabled by default for Kueue `v0.7.0` or newer.
21+
22+
## Identifying the Provisioning Request for your job
23+
24+
See the [Troubleshooting Jobs guide](/docs/tasks/troubleshooting/troubleshooting_jobs/#identifying-the-workload-for-your-job), to learn how to identify the Workload for your job.
25+
26+
You can run the following command to see a brief state of a Provisioning Request (and other Admission Checks) in the `admissionChecks` field of the Workload's Status.
27+
28+
```bash
29+
kubectl describe workload WORKLOAD_NAME
30+
```
31+
32+
Kueue creates ProvisioningRequests using a naming pattern that helps you identify the request corresponding to your workload.
33+
34+
```
35+
[NAME OF YOUR WORKLOAD]-[NAME OF THE ADMISSION CHECK]-[NUMBER OF RETRY]
36+
```
37+
e.g.
38+
```bash
39+
sample-job-2zcsb-57864-sample-admissioncheck-1
40+
```
41+
42+
When nodes for your job are provisioned, Kueue will also add the annotation `cluster-autoscaler.kubernetes.io/consume-provisioning-request` to the `.admissionChecks[*].podSetUpdate[*]` field in Workload's status. The value of this annotation is the Provisioning Request's name.
43+
44+
The output of the `kubectl describe workload` command should look similar to the following:
45+
46+
```bash
47+
[...]
48+
Status:
49+
Admission Checks:
50+
Last Transition Time: 2024-05-22T10:47:46Z
51+
Message: Provisioning Request was successfully provisioned.
52+
Name: sample-admissioncheck
53+
Pod Set Updates:
54+
Annotations:
55+
cluster-autoscaler.kubernetes.io/consume-provisioning-request: sample-job-2zcsb-57864-sample-admissioncheck-1
56+
cluster-autoscaler.kubernetes.io/provisioning-class-name: queued-provisioning.gke.io
57+
Name: main
58+
State: Ready
59+
```
60+
61+
## What is the current state of my Provisioning Request?
62+
63+
One possible reason your job is not running might be that ProvisioningRequest is waiting to be provisioned.
64+
To find out if this is the case you can view Provisioning Request's state by running the following command:
65+
66+
```bash
67+
kubectl get provisioningrequest PROVISIONING_REQUEST_NAME
68+
```
69+
70+
If this is the case, the output should look similar to the following:
71+
72+
```bash
73+
NAME ACCEPTED PROVISIONED FAILED AGE
74+
sample-job-2zcsb-57864-sample-admissioncheck-1 True False False 20s
75+
```
76+
77+
You can also view more detailed status of your ProvisioningRequest by running the following command:
78+
79+
```bash
80+
kubectl describe provisioningrequest PROVISIONING_REQUEST_NAME
81+
```
82+
83+
If your ProvisioningRequest fails to provision nodes, the error output may look similar to the following:
84+
```bash
85+
[...]
86+
Status:
87+
Conditions:
88+
Last Transition Time: 2024-05-22T13:04:54Z
89+
Message: Provisioning Request wasn't accepted.
90+
Observed Generation: 1
91+
Reason: NotAccepted
92+
Status: False
93+
Type: Accepted
94+
Last Transition Time: 2024-05-22T13:04:54Z
95+
Message: Provisioning Request wasn't provisioned.
96+
Observed Generation: 1
97+
Reason: NotProvisioned
98+
Status: False
99+
Type: Provisioned
100+
Last Transition Time: 2024-05-22T13:06:49Z
101+
Message: max cluster limit reached, nodepools out of resources: default-nodepool (cpu, memory)
102+
Observed Generation: 1
103+
Reason: OutOfResources
104+
Status: True
105+
Type: Failed
106+
```
107+
108+
Note that the `Reason` and `Message` values for `Failed` condition may differ from your output, depending on the
109+
reason that prevented the provisioning.
110+
111+
The Provisioning Request state is described in the `.conditions[*].status` field.
112+
An empty field means ProvisinongRequest is still being processed by the ClusterAutoscaler.
113+
Otherwise, it falls into one of the states listed below:
114+
- `Accepted` - indicates that the ProvisioningRequest was accepted by ClusterAutoscaler, so ClusterAutoscaler will attempt to provision the nodes for it.
115+
- `Provisioned` - indicates that all of the requested resources were created and are available in the cluster. ClusterAutoscaler will set this condition when the VM creation finishes successfully.
116+
- `Failed` - indicates that it is impossible to obtain resources to fulfill this ProvisioningRequest. Condition Reason and Message will contain more details about what failed.
117+
- `BookingExpired` - indicates that the ProvisioningRequest had Provisioned condition before and capacity reservation time is expired.
118+
- `CapacityRevoked` - indicates that requested resources are not longer valid.
119+
120+
The states transitions are as follow:
121+
122+
![Provisioning Request's states](/images/prov-req-states.svg)
123+
124+
## Why a Provisioning Request is not created?
125+
126+
If Kueue did not create a Provisioning Request for your job, try checking the following requirements:
127+
128+
### a. Ensure the Kueue's controller manager enables the `ProvisioningACC` feature gate
129+
130+
Run the following command to check whether your Kueue's controller manager has enabled the `ProvisioningACC` feature gate:
131+
132+
```bash
133+
kubectl describe pod -n kueue-system kueue-controller-manager-
134+
```
135+
136+
The arguments for Kueue container should be similar to the following:
137+
138+
```bash
139+
...
140+
Args:
141+
--config=/controller_manager_config.yaml
142+
--zap-log-level=2
143+
--feature-gates=ProvisioningACC=true
144+
```
145+
146+
Note for Kueue `v0.7.0` or newer the feature is enabled by default, so you may see different output.
147+
148+
### b. Ensure your Workload has reserved quota
149+
150+
To check if your Workload has reserved quota in a ClusterQueue check your Workload's status by running the following command:
151+
152+
```bash
153+
kubectl describe workload WORKLOAD_NAME
154+
```
155+
156+
The output should be similar to the following:
157+
158+
```bash
159+
[...]
160+
Status:
161+
Conditions:
162+
Last Transition Time: 2024-05-22T10:26:40Z
163+
Message: Quota reserved in ClusterQueue cluster-queue
164+
Observed Generation: 1
165+
Reason: QuotaReserved
166+
Status: True
167+
Type: QuotaReserved
168+
```
169+
170+
If the output you get is similar to the following:
171+
172+
```bash
173+
Conditions:
174+
Last Transition Time: 2024-05-22T08:48:47Z
175+
Message: couldn't assign flavors to pod set main: insufficient unused quota for memory in flavor default-flavor, 4396Mi more needed
176+
Observed Generation: 1
177+
Reason: Pending
178+
Status: False
179+
Type: QuotaReserved
180+
```
181+
182+
This means you do not have sufficient free quota in your ClusterQueue.
183+
184+
Other reasons why your Workload has not reserved quota may relate to LocalQueue/ClusterQueue misconfiguration, e.g.:
185+
186+
```bash
187+
Status:
188+
Conditions:
189+
Last Transition Time: 2024-05-22T08:57:09Z
190+
Message: ClusterQueue cluster-queue doesn't exist
191+
Observed Generation: 1
192+
Reason: Inadmissible
193+
Status: False
194+
Type: QuotaReserved
195+
```
196+
197+
You can check if ClusterQueues and LocalQueues are ready to admit your Workloads.
198+
See the [Troubleshooting Queues](/docs/tasks/troubleshooting/troubleshooting_queues/) for more details.
199+
200+
201+
### c. Ensure the Admission Check is active
202+
203+
To check if the Admission Check that your job uses is active run the following command:
204+
205+
```bash
206+
kubectl describe admissionchecks ADMISSIONCHECK_NAME
207+
```
208+
209+
Where `ADMISSIONCHECK_NAME` is a name configured in your ClusterQueue spec. See the [Admission Check documentation](/docs/concepts/admission_check/) for more details.
210+
211+
The status of the Admission Check should be similar to:
212+
213+
```bash
214+
...
215+
Status:
216+
Conditions:
217+
Last Transition Time: 2024-03-08T11:44:53Z
218+
Message: The admission check is active
219+
Reason: Active
220+
Status: True
221+
Type: Active
222+
```
223+
224+
If none of the above steps resolves your problem, contact us at the [Slack `wg-batch` channel](https://kubernetes.slack.com/archives/C032ZE66A2X)

site/content/en/docs/tasks/troubleshooting/troubleshooting_queues.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ status:
5959

6060
In the example above, the `Active` condition has status `False` because the configured flavor
6161
does not exist.
62-
Read [Aminister ClusterQueues](/docs/tasks/manage/administer_cluster_quotas) to learn how
62+
Read [Administer ClusterQueues](/docs/tasks/manage/administer_cluster_quotas) to learn how
6363
to configure a ClusterQueue.
6464

6565
If the ClusterQueue is properly configured, the status will be similar to the following:

site/static/images/prov-req-states.svg

Lines changed: 1 addition & 0 deletions
Loading

0 commit comments

Comments
 (0)