Linux 修復顯卡驅動問題

當以下指令都失效時,開始一步一步檢查問題

> nvcc -V
zsh: command not found: nvcc

> nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

  1. 確認 cuda 是否有在 $PATH 中。
> echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games

若沒有,將 export PATH=/usr/local/cuda/bin:$PATH 加入到 .zshrc 當中(若使用zsh),讓 zsh 知道 nvcc 是放在哪個地方。 加完之後,再次執行就可以看到

❯ source ./zshrc #重新讀取文件中之設定
❯ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
  1. 接下來處理 nvidia-smi 之問題
> sudo systemctl status nvidia-persistenced

[sudo] password for Sam504:
● nvidia-persistenced.service - NVIDIA Persistence Daemon
   Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2023-05-31 14:51:18 CST; 43s ago
  Process: 955 ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced (code=exited, status=0/SUCCESS)
  Process: 893 ExecStart=/usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose (code=exited, st

May 31 14:51:18 liboffice nvidia-persistenced[897]: Started (897)
May 31 14:51:18 liboffice nvidia-persistenced[897]: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/
May 31 14:51:18 liboffice nvidia-persistenced[897]: PID file unlocked.
May 31 14:51:18 liboffice nvidia-persistenced[893]: nvidia-persistenced failed to initialize. Check syslog for more details.
May 31 14:51:18 liboffice nvidia-persistenced[897]: PID file closed.
May 31 14:51:18 liboffice nvidia-persistenced[897]: The daemon no longer has permission to remove its runtime data directory /var
May 31 14:51:18 liboffice nvidia-persistenced[897]: Shutdown (897)
May 31 14:51:18 liboffice systemd[1]: nvidia-persistenced.service: Control process exited, code=exited status=1
May 31 14:51:18 liboffice systemd[1]: nvidia-persistenced.service: Failed with result 'exit-code'.
May 31 14:51:18 liboffice systemd[1]: Failed to start NVIDIA Persistence Daemon.

可以發現是 fail 的。 嘗試重新安裝驅動 使用以下指令尋找要安裝之驅動

> ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C03sv00001462sd00003281bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP106 [GeForce GTX 1060 6GB]
driver   : nvidia-driver-418 - third-party free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-515 - third-party free
driver   : nvidia-driver-510 - third-party free
driver   : nvidia-driver-440 - third-party free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-460 - third-party free
driver   : nvidia-driver-465 - third-party free
...
driver   : xserver-xorg-video-nouveau - distro free builtin

我目前是找最新的安裝

sudo apt install nvidia-driver-418
sudo reboot

若安裝上有問題,可以嘗試將所有驅動刪除重新開機之後再安裝

sudo apt remove --purge '^nvidia-.*'
sudo apt autoremove
sudo apt autoclean

完成之後一定要記得重開機再執行 nvidia-smi

以上,就完成驅動問題修復了。

可以使用以下指令在顯卡工作時動態查看相關參數

watch -n -0.1 -d nvidia-smi