Updating NVIDIA Driver and CUDA Toolkit¶
Steps to install new NVIDIA Driver and CUDA Toolkit This procedure must be executed as root on a machine that is not running jobs.
Draining the nodes¶
The compute node must be drained in advance on SLURM
On SLURM, the command scontrol
can modify the state of the machine by specifying the node and the new state.
You must provide a reason when disabling a node.
Disable:
[trcis001 ~]# scontrol update NodeName=tcogq[001-006] State=DRAIN Reason="CUDA UPDATE"Enable:
[trcis001 ~]# scontrol update NodeName=tcogq[001-006] State=RESUME
Updating the NVIDIA Driver and CUDA Toolkit¶
Connect via SSH to the compute node and execute these steps:
1. Stop daemon that uses nvidia kernel module
#> systemctl stop nvidia-dcgm
2. Remove all modules depending on nvidia
#> rmmod nvidia_drm nvidia_modeset nvidia_uvm nvidia
3. Install NVIDIA Driver and CUDA Toolkit
#> sh /shared/src/cuda_12.2.1_535.86.10_linux.run --silent --driver
#> sh /shared/src/cuda_12.2.1_535.86.10_linux.run --silent --toolkit
4. Restart the machine
#> reboot
Verify NVIDIA Driver and CUDA Toolkit versions¶
On the infraestructure machine verify the versions using Ansible
[trcis009 ~]# ansible thorny-gpu -m shell -a "nvidia-smi | grep CUDA"
tcogq001.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tcogq002.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tcogq003.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tcogq006.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tcogq004.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tcogq005.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tbmgq001.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
tbegq201.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tbegq202.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
tbmgq100.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
tbegq200.hpc.wvu.edu | CHANGED | rc=0 >>
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |