nvidia card going bad / New AMD GPU recommendations
Stupendous Man commented on 10 November 2018 at 1:16 pm UTC

Lately I've had a ton of problems with my GTX660. When playing any graphics intensive game I get constant errors in the log like these:
Nov  4 12:35:34 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov  4 12:35:34 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00002384, Data 40000001, ErrorCode 0000000c
Nov  4 12:57:30 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov  4 12:57:30 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0030, Class 0000a097, Offset 00001c80, Data 40000000, ErrorCode 0000000c
Nov  4 23:53:41 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov  4 23:53:41 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 1f789000
Nov  5 22:44:50 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 32, Channel ID 00000050 intr 00040000
Nov  5 22:54:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000050 beef3901 0000a040 000001b8 2faac600
Nov  5 23:11:48 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 31, Ch 00000050, engmask 00000101, intr 10000000
Nov  6 23:48:26 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov  6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001b00, Data 00004100, ErrorCode 0000000c
Nov  6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: EXTRA_MACRO_DATA
Nov  6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ESR 0x404490=0x80000002
Nov  6 23:48:26 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 13, Graphics Exception: ChID 0058, Class 0000a097, Offset 00001b00, Data 00004100
Nov  6 23:48:35 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000058 beef9097 0000a097 00001414 00000000
Nov  6 23:51:10 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 69, Class Error: ChId 0058, Class 0000a097, Offset 00001418, Data 00000004, ErrorCode 0000000c
Nov  7 13:02:52 unicorn kernel: NVRM: GPU at PCI:0000:0a:00: GPU-dfde4129-ba3c-74bc-84aa-ea76a1cf90ed
Nov  7 13:02:52 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff
Nov  7 13:06:37 unicorn kernel: NVRM: Xid (PCI:0000:0a:00): 12, COCOD 00000038 beef3901 0000a040 000001b8 ffffffff

If I play Eurotruck Simulator 2 the game crashes instantly with graphical artifacts when an error happens, other games run flawlessly eventhough there are errors in the system log.
The card gets to around 55-60 degrees.

So, is my card broken, or are the nvidia drivers acting up? The drivers are version 396.54 and I'm running a 4.18.12 gentoo kernel. I've been running the same driver for weeks before the errors started.

tuubi commented on 10 November 2018 at 1:55 pm UTC

Stupendous ManSo, is my card broken, or are the nvidia drivers acting up? The drivers are version 396.54 and I'm running a 4.18.12 gentoo kernel. I've been running the same driver for weeks before the errors started.
Same driver, but not the same kernel?

Stupendous Man commented on 10 November 2018 at 2:10 pm UTC

tuubiSame driver, but not the same kernel?
I upgraded the kernel a week or two before the problems started showing. But you are right, I should try downgrading and checking. Not sure if the kernel I used is still available though.

Stupendous Man commented on 10 November 2018 at 2:35 pm UTC

I just tried downloading to a much earlier kernel, but no luck; I still get errors.

tuubi commented on 10 November 2018 at 3:28 pm UTC

According to this documentation one of those error messages (number 69) could be HW related. Maybe try reinstalling or a different driver? If that doesn't help, it probably is your GPU.

Stupendous Man commented on 10 November 2018 at 3:38 pm UTC

That's a great document, thanks! Apparently I need to practice my Google-Fu as I did not find it before.

I'll investigate and see if I can solve this.

Stupendous Man commented on 11 November 2018 at 11:59 am UTC

GuestWhat happens if you run the Unigine Heaven benchmark for a while.
That runs just fine, I've run it 3 times back to back and no errors.
Yesterday I re-seated the RAM sticks and recompiled the Nvidia drivers, at the moment that seems to have helped. I've been running Eurotruck Sim 2 for an hour and again, no errors.
I won't call this solved just yet, but it looks promising.

EDIT: I'm still getting errors, and they all have one thing in common: defective hardware. I guess I'll have to get a new GPU. Can anyone recommend a good AMD GPU in the 200-300€ price range? I'm not going with nVidia anymore, I'm tired of the drivers not being open-source.

sr_ls_boy commented on 12 November 2018 at 1:25 pm UTC

Cryptocurrency mining went bust. So, there are plenty of used cards on ebay right now.
Get whatever you can get that will give you good performance with Deus Ex: Mankind
Divided. The vegas are too expensive for your price point. The polaris line of cards
you can find good prices on. Try an 8G Radeon RX 580.

lucinos commented on 12 November 2018 at 2:37 pm UTC

I am very pleased with my RX 560. In the price range you want the obvious choice is the RX 580.

Stupendous Man commented on 13 November 2018 at 10:54 pm UTC

In the end my trouble appears to be a faulty RAM stick, not the GPU, even though all errors have been graphics related. Memtest hasn't even found any errors after 9 hours of running, but if I remove the faulty RAM the system is completely stable, put it back in and games crash after 3 minutes. Weird.

Thanks everyone for your suggestions anyway!

Avehicle7887 commented on 13 November 2018 at 11:22 pm UTC

Looks like you ain't parting with that GTX 660 for a little longer. 9 hours is quite the test, I once had a similar issue and the errors only showed up after 2 hours. It's good your nightmare is finally over ;)

