Nutanix GPU support implementation #8745

adiantum · 2024-09-11T15:19:09Z

Description of changes:
Implemented GPU support for Nutanix provider. Both vGPU and Passthrough modes are supported.

Testing (if applicable):

$ eksctl anywhere create cluster -f ./cluster-ntnx-gpu.yaml -v 10 --bundles-override bin/local-bundle-release.yaml
2024-09-11T14:12:05.046Z	V6	Executing command	{"cmd": "/usr/bin/docker version --format {{.Client.Version}}"}
2024-09-11T14:12:05.065Z	V6	Executing command	{"cmd": "/usr/bin/docker info --format '{{json .MemTotal}}'"}
2024-09-11T14:12:05.118Z	V4	Reading bundles manifest	{"url": "bin/local-bundle-release.yaml"}
2024-09-11T14:12:05.138Z	V4	Using CAPI provider versions	{"Core Cluster API": "v1.7.2+7b521fe", "Kubeadm Bootstrap": "v1.7.2+74bd9a3", "Kubeadm Control Plane": "v1.7.2+d29bc82", "External etcd Bootstrap": "v1.0.13+4d890d2", "External etcd Controller": "v1.0.22+a8279bb", "Cluster API Provider Nutanix": "v1.3.5+0f39da7"}
2024-09-11T14:12:05.370Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:12:05.370Z	V2	Pulling docker image	{"image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.370Z	V6	Executing command	{"cmd": "/usr/bin/docker pull public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.953Z	V5	Retry execution successful	{"retries": 1, "duration": "582.779276ms"}
2024-09-11T14:12:05.953Z	V3	Initializing long running container	{"name": "eksa_1726063925370486292", "image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:12:05.953Z	V6	Executing command	{"cmd": "/usr/bin/docker run -d --name eksa_1726063925370486292 --network host -w /home/ubuntu/eksa-tests/gpus-feature -v /var/run/docker.sock:/var/run/docker.sock -v /home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu:/home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature --entrypoint sleep public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110 infinity"}
2024-09-11T14:12:06.119Z	V1	Using the eksa controller to create the management cluster
2024-09-11T14:12:06.119Z	V4	Task start	{"task_name": "setup-validate"}
2024-09-11T14:12:06.119Z	V0	Performing setup and validations
2024-09-11T14:12:06.119Z	V0	ValidateClusterSpec for Nutanix datacenter	{"NutanixDatacenter": "eksa-ntnx-gpu"}
2024-09-11T14:12:15.144Z	V0	✅ Nutanix Provider setup is valid
2024-09-11T14:12:15.144Z	V0	✅ Validate OS is compatible with registry mirror configuration
2024-09-11T14:12:15.144Z	V0	✅ Validate certificate for registry mirror
2024-09-11T14:12:15.144Z	V0	✅ Validate authentication for git provider
2024-09-11T14:12:15.144Z	V0	✅ Validate cluster's eksaVersion matches EKS-A version
2024-09-11T14:12:15.144Z	V4	Task finished	{"task_name": "setup-validate", "duration": "9.025697406s"}
2024-09-11T14:12:15.144Z	V4	----------------------------------
2024-09-11T14:12:15.144Z	V4	Task start	{"task_name": "bootstrap-cluster-init"}
2024-09-11T14:12:15.144Z	V0	Creating new bootstrap cluster
...
2024-09-11T14:23:44.708Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726063925370486292 kubectl get clusters.cluster.x-k8s.io -o json --kubeconfig eksa-ntnx-gpu/generated/eksa-ntnx-gpu.kind.kubeconfig --namespace eksa-system"}
2024-09-11T14:23:44.867Z	V5	Retry execution successful	{"retries": 1, "duration": "158.96818ms"}
2024-09-11T14:23:44.867Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:23:44.867Z	V4	Deleting kind cluster	{"name": "eksa-ntnx-gpu-eks-a-cluster"}
2024-09-11T14:23:44.867Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726063925370486292 kind delete cluster --name eksa-ntnx-gpu-eks-a-cluster"}
2024-09-11T14:23:45.957Z	V5	Retry execution successful	{"retries": 1, "duration": "1.089860832s"}
2024-09-11T14:23:45.957Z	V0	🎉 Cluster created!
2024-09-11T14:23:45.957Z	V4	Task finished	{"task_name": "delete-kind-cluster", "duration": "1.534960627s"}
...

$ kubectl apply -f ./cuda-vectoradd.yaml
pod/cuda-vectoradd created

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

2024-09-11T14:51:49.291Z	V6	Executing command	{"cmd": "/usr/bin/docker version --format {{.Client.Version}}"}
2024-09-11T14:51:49.310Z	V6	Executing command	{"cmd": "/usr/bin/docker info --format '{{json .MemTotal}}'"}
2024-09-11T14:51:49.355Z	V4	Reading bundles manifest	{"url": "bin/local-bundle-release.yaml"}
2024-09-11T14:51:49.373Z	V4	Using CAPI provider versions	{"Core Cluster API": "v1.7.2+7b521fe", "Kubeadm Bootstrap": "v1.7.2+74bd9a3", "Kubeadm Control Plane": "v1.7.2+d29bc82", "External etcd Bootstrap": "v1.0.13+4d890d2", "External etcd Controller": "v1.0.22+a8279bb", "Cluster API Provider Nutanix": "v1.3.5+0f39da7"}
2024-09-11T14:51:49.601Z	V5	Retrier:	{"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-09-11T14:51:49.601Z	V2	Pulling docker image	{"image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:49.601Z	V6	Executing command	{"cmd": "/usr/bin/docker pull public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:50.292Z	V5	Retry execution successful	{"retries": 1, "duration": "691.054922ms"}
2024-09-11T14:51:50.292Z	V3	Initializing long running container	{"name": "eksa_1726066309601289865", "image": "public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110"}
2024-09-11T14:51:50.292Z	V6	Executing command	{"cmd": "/usr/bin/docker run -d --name eksa_1726066309601289865 --network host -w /home/ubuntu/eksa-tests/gpus-feature -v /var/run/docker.sock:/var/run/docker.sock -v /home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu:/home/ubuntu/eksa-tests/gpus-feature/eksa-ntnx-gpu -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature -v /home/ubuntu/eksa-tests/gpus-feature:/home/ubuntu/eksa-tests/gpus-feature --entrypoint sleep public.ecr.aws/l0g8r8j6/eks-anywhere-cli-tools:v0.20.4-eks-a-v0.21.0-dev-build.110 infinity"}
2024-09-11T14:51:50.468Z	V4	Task start	{"task_name": "setup-validate-create"}
2024-09-11T14:51:50.468Z	V0	ValidateClusterSpec for Nutanix datacenter	{"NutanixDatacenter": "eksa-wrk-ntnx-gpu"}
2024-09-11T14:51:59.497Z	V0	✅ Workload cluster's nutanix Provider setup is valid
2024-09-11T14:51:59.497Z	V0	✅ Validate OS is compatible with registry mirror configuration
2024-09-11T14:51:59.497Z	V0	✅ Validate certificate for registry mirror
2024-09-11T14:51:59.497Z	V0	✅ Validate authentication for git provider
2024-09-11T14:51:59.497Z	V0	✅ Validate cluster's eksaVersion matches EKS-A version
2024-09-11T14:51:59.497Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.cluster.x-k8s.io -o json --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --namespace eksa-system"}
2024-09-11T14:51:59.637Z	V0	✅ Validate cluster name
2024-09-11T14:51:59.637Z	V0	✅ Validate gitops
2024-09-11T14:51:59.637Z	V5	skipping ValidateIdentityProviderNameIsUnique
2024-09-11T14:51:59.637Z	V0	✅ Validate identity providers' name
2024-09-11T14:51:59.637Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get customresourcedefinition clusters.cluster.x-k8s.io --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig"}
2024-09-11T14:51:59.763Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get customresourcedefinition clusters.anywhere.eks.amazonaws.com --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig"}
2024-09-11T14:51:59.908Z	V0	✅ Validate management cluster has eksa crds
2024-09-11T14:51:59.908Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --field-selector=metadata.name=eksa-ntnx-gpu"}
2024-09-11T14:52:00.075Z	V0	✅ Validate management cluster name is valid
2024-09-11T14:52:00.075Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig --field-selector=metadata.name=eksa-ntnx-gpu"}
2024-09-11T14:52:00.214Z	V0	✅ Validate management cluster eksaVersion compatibility
2024-09-11T14:52:00.214Z	V6	Executing command	{"cmd": "/usr/bin/docker exec -i eksa_1726066309601289865 kubectl get --ignore-not-found -o json --kubeconfig ./eksa-ntnx-gpu/eksa-ntnx-gpu-eks-a-cluster.kubeconfig EKSARelease.v1alpha1.anywhere.eks.amazonaws.com --namespace eksa-system eksa-v0-0-0"}
2024-09-11T14:52:00.339Z	V0	✅ Validate eksa release components exist on management cluster
2024-09-11T14:52:00.339Z	V4	Task finished	{"task_name": "setup-validate-create", "duration": "9.87119141s"}
2024-09-11T14:52:00.339Z	V4	----------------------------------
2024-09-11T14:52:00.339Z	V4	Task start	{"task_name": "create-workload-cluster"}
2024-09-11T14:52:00.339Z	V0	Creating workload cluster
2024-09-11T14:52:00.339Z	V3	Applying cluster spec
...
2024-09-11T14:59:31.213Z	V4	----------------------------------
2024-09-11T14:59:31.213Z	V4	Task start	{"task_name": "write-cluster-config"}
2024-09-11T14:59:31.213Z	V0	Writing cluster config file
2024-09-11T14:59:31.216Z	V0	🎉 Cluster created!
2024-09-11T14:59:31.216Z	V4	Task finished	{"task_name": "write-cluster-config", "duration": "2.839671ms"}
2024-09-11T14:59:31.216Z	V4	----------------------------------
2024-09-11T14:59:31.216Z	V4	Tasks completed	{"duration": "7m40.748367154s"}
2024-09-11T14:59:31.216Z	V3	Cleaning up long running container	{"name": "eksa_1726066309601289865"}
2024-09-11T14:59:31.216Z	V6	Executing command	{"cmd": "/usr/bin/docker rm -f -v eksa_1726066309601289865"}

$ kubectl apply -f ./cuda-vectoradd.yaml
pod/cuda-vectoradd configured

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Documentation added/planned (if applicable):
Planned docs: GPU support for Nutanix clusters

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

eks-distro-bot · 2024-09-11T15:19:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign panktishah26 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

eks-distro-bot · 2024-09-11T15:19:21Z

Hi @adiantum. Thanks for your PR.

I'm waiting for a aws member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavmpandey08 · 2024-09-11T16:10:16Z

/ok-to-test

abhinavmpandey08 · 2024-09-11T16:12:20Z

Can you add an example of what the cluster config will look like with the GPUs configured?

codecov · 2024-09-11T16:16:09Z

Codecov Report

Attention: Patch coverage is 91.35802% with 14 lines in your changes missing coverage. Please review.

Project coverage is 73.65%. Comparing base (2f63f88) to head (44d54a9).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/providers/nutanix/validator.go	92.40%	8 Missing and 4 partials ⚠️
pkg/api/v1alpha1/nutanixmachineconfig.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #8745      +/-   ##
==========================================
+ Coverage   73.57%   73.65%   +0.07%     
==========================================
  Files         578      578              
  Lines       36629    36784     +155     
==========================================
+ Hits        26951    27094     +143     
- Misses       7951     7960       +9     
- Partials     1727     1730       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adiantum · 2024-09-11T17:26:08Z

Can you add an example of what the cluster config will look like with the GPUs configured?

Sure, I have it in tests:
https://github.com/aws/eks-anywhere/pull/8745/files#diff-8837b0b2c467097c587ca47d1d535adfec94fc5c306a3526a9acfccd210bba9eR64-R68

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: NutanixMachineConfig
metadata:
  name: eksa-unit-test
  namespace: default
spec:
  vcpusPerSocket: 1
  vcpuSockets: 4
  memorySize: 8Gi
  ...
  gpus:
  - type:     deviceID
    deviceID: 8757
  - type:     name
    name:     "Ampere 40"
  systemDiskSize: 40Gi
  osFamily: "ubuntu"
...

Initial GPU support implementation

fa5b416

eks-distro-bot added needs-ok-to-test size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Sep 11, 2024

eks-distro-bot added ok-to-test and removed needs-ok-to-test labels Sep 11, 2024

Fix test

44d54a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nutanix GPU support implementation #8745

Nutanix GPU support implementation #8745

adiantum commented Sep 11, 2024

eks-distro-bot commented Sep 11, 2024

eks-distro-bot commented Sep 11, 2024

abhinavmpandey08 commented Sep 11, 2024

abhinavmpandey08 commented Sep 11, 2024

codecov bot commented Sep 11, 2024 •

edited

Loading

adiantum commented Sep 11, 2024

Nutanix GPU support implementation #8745

Are you sure you want to change the base?

Nutanix GPU support implementation #8745

Conversation

adiantum commented Sep 11, 2024

eks-distro-bot commented Sep 11, 2024

eks-distro-bot commented Sep 11, 2024

abhinavmpandey08 commented Sep 11, 2024

abhinavmpandey08 commented Sep 11, 2024

codecov bot commented Sep 11, 2024 • edited Loading

Codecov Report

adiantum commented Sep 11, 2024

codecov bot commented Sep 11, 2024 •

edited

Loading