Setting up Torch
Here, we’ll install Torch into a conda environment, but we’ll do it using a pod dedicated to the install. Unlike in the previous example, we won’t be accessing conda interactively.
Installing Conda Packages with a Dedicated Pod
In order to install a conda package with a dedicated pod, we’ll need to do all of the usual installation steps in an automated fashion. This can be a fantastic way to setup your environments, as it allows you to quicky construct / deconstruct initial environments and test that they worked in a single script. Note this script took about 15 minutes to run in testing.
For this example, we’ll use the following yml file:
apiVersion: v1
kind: Pod
metadata:
name: pytorch-install
spec:
restartPolicy: Never
volumes:
- name: home-volume
persistentVolumeClaim:
claimName: dsmillerrunfol-rwm # Ensure this is your correct PVC
containers:
- name: pytorch-setup-container
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
requests:
memory: "32Gi"
nvidia.com/gpu: 1
limits:
memory: "32Gi"
nvidia.com/gpu: 1
volumeMounts:
- name: home-volume
mountPath: /kube/home/
command:
- /bin/bash
- -c
- |
# Set the Miniconda path and initialize
export MINICONDA_PATH="/kube/home/.envs/conda"
export PATH="$MINICONDA_PATH/bin:$PATH"
# Check if the conda environment already exists
if conda info --envs | grep -q "torchEnv"; then
echo "Conda environment 'torchEnv' already exists. Deleting and rebuilding."
conda env remove -n torchEnv --yes
fi
# Create a new conda environment with Python 3.11
conda create -y -n torchEnv python=3.11
# Activate the environment
source activate torchEnv
# Install PyTorch, torchvision, and pytorch-cuda
conda install -y -n torchEnv pytorch torchvision pytorch-cuda=12.1 -c nvidia -c pytorch
#Test that we have a GPU, and it's registering in Torch.
echo "Testing visibility of GPUs on system and in Python."
python -c "import torch; gpus = torch.cuda.device_count(); print(f'Available GPUs: {gpus}'); [print(f'GPU {gpu}: {torch.cuda.get_device_name(gpu)}') for gpu in range(gpus)]"
nvidia-smi
# Check for GPU availability using Python and PyTorch
GPU_COUNT=$(python -c "import torch; print(torch.cuda.device_count())")
# Export the Conda environment if at least one GPU is detected
if [ "$GPU_COUNT" -gt 0 ]; then
echo "At least one GPU detected, exporting Conda environment."
echo "Exporting environment yaml file to:" $MINICONDA_PATH
conda env export -n torchEnv > $MINICONDA_PATH/torchEnv.yml
else
echo "No GPUs detected. Something may have gone wrong, or you may not have asked for any in your pod."
fi
Notably:
- We are requesting a GPU. This is so we can test that python can actually detect and load the GPUs for processing.
- In our script, we are installing the appropriate version of pytorch-cuda (12.1) for the NVIDIA drivers available in our base image.
- Finally, we have some short diagnostic code that loads torch within python and confirms it can detect GPUs.
- This script has no sleep, so will terminate after completion.
Go ahead and create this yml file, and apply it with kubectl apply -f 2_createTorchEnv.yml
. As before, you can watch the progress of the pod with kubectl get pods
and kubectl logs pytorch-install
. If it is taking a while, you can also use kubectl describe pod pytorch-install
to get a bit more information on the status of the pod. Note that this pod make take a while before it registers as complete, as the relevant packages are larger than a gig in total for install size. When the status of the pod changes from “Running” to “Complete”, you’ll know its done; alternatively, in the logs, you should see information about the total number of GPUs accessible by python at the end of the script after it completes.
Sometimes, you may want to run
watch kubectl get pods
on your frontend to monitor pods that may take a while to complete.
If everything works correctly, after about 15 minutes kubectl logs pytorch-install
should return something similar to: