[Ubuntu] Unable to determine the device handle for GPU0000:06:00.0: Unknown Error 해결방법

상황

서버 컴으로 학습시키다가 (3~4epoch을 지나고 있었음, 코드 레벨에서 Error 없을 것으로 예상)

RuntimeError: CUDA error: the launch timed out and was terminated

라는 에러 메시지와 함께 갑자기 학습을 멈췄다.

그래서 터미널 창에

nvidia-smi

를 통해 학습중인지 아닌지를 판단하고자 하였다.

~~(학습 중이라면 GPU를 사용하고 있을 테니까)~~

그러자 nvidia-smi 의 결과로

Unable to determine the device handle for GPU0000:06:00.0: Unknown Error

라는 메시지를 뱉더라..!

이유

위와 같은 에러가 나타나는 이유는 다양하겠지만, 아래와 같이 정리할 수 있었다.

GPU의 물리적 고장
GPU 현재 온도 너무 높음
GPU 연결 접촉 상태 올바르지 못함

해결방법

여러 구글링의 결과를 정리하였다.

아래와 같은 순서로 진행해보길 바란다.

재부팅 -------------- 1단계에서 필자 해결 완료
GPU 보조전원선(PCI-E) (재)연결
GPU 드라이버 재설치
GPU를 기존 슬롯 재장착
GPU를 다른 슬롯에 장착
BIOS 업그레이드
메인보드 or GPU 교체

* 참고

https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-000000-0-unknown-error/197974/2

Unable to determine the device handle for GPU 0000:02:00.0: Unknown Error

You’re getting a fatal pcie error on the root bus so the gpu is disconnected. Please try reseating the gpu in its slot, try a different slot, check for a bios upgrade, check/replace mainboard. [ 695.203791] pcieport 0000:00:02.0: AER: Uncorrected (Fatal)

forums.developer.nvidia.com

https://github.com/NVIDIA/nvidia-container-toolkit/issues/69

Unable to determine the device handle for GPU0000:65:00.0: Unknown Error -- only from container · Issue #69 · NVIDIA/nvidia-co

Hi here, I posted this issue in the nvidia container issue also. not sure the root cause. I need your help :) I installed 2 A4000 video cards on my Dell T5820 which got the RHEL 8.6 running. After ...

github.com

상황

이유

해결방법

티스토리툴바