Updating NVIDIA Driver and CUDA Toolkit ======================================= Steps to install new NVIDIA Driver and CUDA Toolkit This procedure must be executed as root on a machine that is not running jobs. Draining the nodes ------------------ The compute node must be drained in advance on SLURM On SLURM, the command ``scontrol`` can modify the state of the machine by specifying the node and the new state. You must provide a reason when disabling a node. * **Disable**:: [trcis001 ~]# scontrol update NodeName=tcogq[001-006] State=DRAIN Reason="CUDA UPDATE" * **Enable**:: [trcis001 ~]# scontrol update NodeName=tcogq[001-006] State=RESUME Updating the NVIDIA Driver and CUDA Toolkit ------------------------------------------- Connect via SSH to the compute node and execute these steps: 1. Stop daemon that uses nvidia kernel module :: #> systemctl stop nvidia-dcgm 2. Remove all modules depending on ``nvidia`` :: #> rmmod nvidia_drm nvidia_modeset nvidia_uvm nvidia 3. Install NVIDIA Driver and CUDA Toolkit :: #> sh /shared/src/cuda_12.2.1_535.86.10_linux.run --silent --driver #> sh /shared/src/cuda_12.2.1_535.86.10_linux.run --silent --toolkit 4. Restart the machine :: #> reboot Verify NVIDIA Driver and CUDA Toolkit versions ---------------------------------------------- On the infraestructure machine verify the versions using Ansible :: [trcis009 ~]# ansible thorny-gpu -m shell -a "nvidia-smi | grep CUDA" tcogq001.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tcogq002.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tcogq003.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tcogq006.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tcogq004.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tcogq005.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tbmgq001.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | tbegq201.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tbegq202.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 | tbmgq100.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 | tbegq200.hpc.wvu.edu | CHANGED | rc=0 >> | NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |