Patreon Logo Support us on Patreon to keep GamingOnLinux alive. This ensures all of our main content remains free for everyone. Just good, fresh content! Alternatively, you can donate through PayPal Logo PayPal. You can also buy games using our partner links for GOG and Humble Store.
Title: PC freeze with Vega and should I buy a RTX 2060 super?
Dax Tailor 6 Mar 2020
Hi every one.
A while ago I build a new Ryzen 3000 PC with a Vega 56 GPU. For month now my PC is freezing randomly. With freezing I mean nothing works anymore, no virtual consol, no ping, no kernel sysreq. Have to do a hw reset. And there is never an error in the kernel log.
I tried everything I could find on the internet (BIOS changes, Kernel Boot options etc.) but nothing helps. Even the RAM is currently on the low default clock.
The main problem is, I can't reproduce it. But it happens always when a game is running. It never happens when I watch videos for example. So I don't think this is the C6 state problem. Even games with a low CPU and GPU usage, so it is not related to a high power usage.
About 2 weeks ago I watched a video from AdoreTV and he sad that this happens on Windows too. Because I thought that is a Linux problem I never looked into Windows related search results :(
However, the Windows problem is related to the Vega and Navi GPUs, not the Ryzen CPU.
I really hate so say that, but I'm very close to buy a NVidia (2060super) card. If this problem does not go away very soon I might even give up gaming on the PC. (Which is a bit of a pity with at least 30 not played games on steam, but I already found a new hobby, building Lego-kind sets, and I'm 53.)
Anyway, I think my question at the moment is not so much about how to fix the problem, although I might try if someone has an idea, but is there any known issue with a RTX 2060 super on linux at the moment? Don't want to spend over €450,- when there are other big problems.
Currently running Manjaro unstable? with the 5.5.7 kernel.
Thanks for reading this,
Bye
sr_ls_boy 6 Mar 2020
You could try to ssh into your computer to gather some more information.
Dax Tailor 6 Mar 2020
You mean when its frozen?
I tried this, there is no connection possible, even a ping is not working anymore.
For some testing I let a a little script read out the current GPU state from /sys directory and let it shown on the other computer every second. When the gaming PC stop working, the update stops too.
Btw, the other computer is a DELL laptop with a pentium 3M, with 512MB RAM, 64GB PATA-SSD and runs Mint with an i3 WM. (Thought that a little bit funny to mention.)
sr_ls_boy 6 Mar 2020
Maybe ssh in the other direction and dump your kernel logs on your dell laptop.
I would ask on the [mesa issues](https://gitlab.freedesktop.org/mesa/mesa/issues) page for help.

EDIT
Give us something to read. Start a game and then post the contents of dmesg.
Tell us about your graphics stack. I don't have manjaro. What version of mesa?
What version of libdrm? How about the contents of /var/log/Xorg0.log? Do you
use ACO as your shader compiler?

Last edited by sr_ls_boy on 6 Mar 2020 at 9:17 pm UTC
Dax Tailor 6 Mar 2020
Hmm, that I have not thought about. I have to check out how to write the logs over the LAN.
I'm not sure how much part of the driver is in mesa (at least the 3D part). Can mesa even crash the kernel? But it is worth at to look there. I can't recall that any mesa related came up by my google search.
Thanks
sr_ls_boy 6 Mar 2020
Give us a dmesg and a /var/log/Xorg0.log if able.
Dax Tailor 6 Mar 2020
Ok, you ask for it;)
(But keep in mind the problem is there for month now and the first entries I found during google serach are over 2 years old.)

Mesa 20.0.1
libdrm 2.4.100
ACO I don't think. The only package I found (mesa-aco) is not installed. I guess it's LLVM then?
The kernel mode line: oops=panic udev.log_priority=3 audit=0 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_debug=1 amdgpu.gpu_recovery=1 processor.max_cstate=3 rcu_nocbs=all

The Xorg log is long to post it here, but mostly Modelines from AMDGPU. Nothing unusual I would say.
This log has to go to the laptop too.

This look a little bit odd. At the end of the Xorg log is this:
[ 11880.078] (II) AMDGPU(0): EDID vendor "GSM", prod id 30436
[ 11880.078] (II) AMDGPU(0): Using EDID range info for horizontal sync
[ 11880.078] (II) AMDGPU(0): Using EDID range info for vertical refresh
[ 11880.078] (II) AMDGPU(0): Printing DDC gathered Modelines:
[ 11880.078] (II) AMDGPU(0): Modeline "3440x1440"x0.0  319.75  3440 3488 3520 3600  1440 1443 1453 1481 +hsync -vsync (88.8 kHz eP)
[ 11880.078] (II) AMDGPU(0): Modeline "3440x1440"x0.0  429.80  3440 3584 3680 3880  1440 1448 1452 1476 +hsync -vsync (110.8 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "3440x1440"x0.0  157.75  3440 3488 3520 3600  1440 1443 1453 1461 +hsync -vsync (43.8 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "2560x1080"x0.0  185.58  2560 2624 2688 2784  1080 1083 1093 1111 -hsync -vsync (66.7 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1280x720"x0.0   74.25  1280 1390 1430 1650  720 725 730 750 +hsync +vsync (45.0 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "720x480"x0.0   27.00  720 736 798 858  480 489 495 525 -hsync -vsync (31.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1920x1080"x0.0  148.50  1920 2008 2052 2200  1080 1084 1089 1125 +hsync +vsync (67.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "640x480"x0.0   25.18  640 656 752 800  480 490 492 525 -hsync -vsync (31.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1920x1080"x0.0  148.50  1920 2448 2492 2640  1080 1084 1089 1125 +hsync +vsync (56.2 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1280x720"x0.0   74.25  1280 1720 1760 1980  720 725 730 750 +hsync +vsync (37.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "720x576"x0.0   27.00  720 732 796 864  576 581 586 625 -hsync -vsync (31.2 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "800x600"x0.0   40.00  800 840 968 1056  600 601 605 628 +hsync +vsync (37.9 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "640x480"x0.0   31.50  640 656 720 840  480 481 484 500 -hsync -vsync (37.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1280x1024"x0.0  135.00  1280 1296 1440 1688  1024 1025 1028 1066 +hsync +vsync (80.0 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1024x768"x0.0   78.75  1024 1040 1136 1312  768 769 772 800 +hsync +vsync (60.0 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1024x768"x0.0   65.00  1024 1048 1184 1344  768 771 777 806 -hsync -vsync (48.4 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "832x624"x0.0   57.28  832 864 928 1152  624 625 628 667 -hsync -vsync (49.7 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "800x600"x0.0   49.50  800 816 896 1056  600 601 604 625 +hsync +vsync (46.9 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1152x864"x0.0  108.00  1152 1216 1344 1600  864 865 868 900 +hsync +vsync (67.5 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1152x864"x60.0   81.75  1152 1216 1336 1520  864 867 871 897 -hsync +vsync (53.8 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1280x1024"x0.0  108.00  1280 1328 1440 1688  1024 1025 1028 1066 +hsync +vsync (64.0 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1600x900"x59.9  118.25  1600 1696 1856 2112  900 903 908 934 -hsync +vsync (56.0 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1680x1050"x0.0  146.25  1680 1784 1960 2240  1050 1053 1059 1089 -hsync +vsync (65.3 kHz e)
[ 11880.078] (II) AMDGPU(0): Modeline "1280x800"x0.0   83.50  1280 1352 1480 1680  800 803 809 831 -hsync +vsync (49.7 kHz e)


Repeated 3 times with different time stamps.

Searching the mesa issue side I found this: [Random crash on amdgpu due to temperature missrepoorting](https://gitlab.freedesktop.org/mesa/mesa/issues/1044)
Sounds interesting. I will try what he/she wrote to log this.

Thanks.
sr_ls_boy 6 Mar 2020
Try [comment 23](https://bugs.freedesktop.org/show_bug.cgi?id=105251#c23) and set GALLIUM_DDEBUG.
Also consider posting dmesg and the Xorg log and use the spoiler tags. I get those modelines as well.
damarrin 7 Mar 2020
Can you borrow an Nvidia card off someone? If it works ok you'll know what's what.
Dax Tailor 8 Mar 2020
@damarrian
I have an old GTX970 I tried. Because the freezing is not reproducible, and the RTX 2060 has certainly a different driver I can't tell with the old card if the RTX will work or not. I used the GTX970 for a couple of years in my old Intel based PC with Linux Mint and Arch Linux and never run into this kind of problems. And I don't nobody with an RTX card.

@sr_ls_boy
I tried what was suggested in the comment 23 and set the GALLIUM_DDEBUG. Played some games yesterday and let some games run in demo mode. But the PC never froze. Not sure if the problem solved itself or not. There is one difference, I'm running the 5.5.8 kernel now (came with a Manjaro update) and according to the kernel change log there are some things fixed in the AMDGPU driver.
The comment 23 suggested to run this commands if the error occurs:
sudo umr -lb > umr_dump
sudo umr -O verbose,use_colour -R gfx[.] >> umr_dump
sudo umr -O halt_waves,use_colour -wa >> umr_dump

I tried this and the 2nd one instantly reboots my PC. (This commands are not working with zsh by the way.)
However, I don't now how to run this when the PC is frozen.
I'm still not so sure this is a driver problem. I mean the Windows driver is based on an other source code. I'm not sure but I think the driver developer for Windows and Linux are two different teams.

I will post more, if I have more info on this. Not sure how many people with an AMD Vega card are reading this post and have the same problems. According to a poll from Hardware Unboxed [Can We Still Recommend Radeon GPUs?](https://youtu.be/1uynVO4ZXl0) there are about 19% AMD GPU users with problems.
I forgot to post the link to [Still Something Wrong At Radeon](https://youtu.be/_x-QSi_yvoU) from AdroedTV.

Thanks for your help.
Dax
damarrin 8 Mar 2020
I’m pretty sure a 20xx card uses the same driver as a 9xx card.
Dax Tailor 8 Mar 2020
@damarrin
Sure there use the same driver installation blob but there are different HW architectures. I would be very suppressed if there are no differences in the driver.

So, I just played Minecraft for a while and it happened again. The PC froze. This time I had an ssh connection open the whole time and this was dead.
The Xorg log shows only this lines at the end. You can see the time difference.

Spoiler, click me

[ 59.125] (II) AMDGPU(0): Modeline "1280x800"x0.0 83.50 1280 1352 1480 1680 800 803 809 831 -hsync +vsync (49.7 kHz e)
[ 5691.334] (EE) client bug: timer event5 debounce: scheduled expiry is in the past (-0ms), your system is too slow


The journal gives this in the end (The pam messages are from the monitoring I did every second.):
Spoiler, click me

Mär 08 10:56:03 moritz sudo[52433]: alfred : TTY=pts/2 ; PWD=/home/alfred ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Mär 08 10:56:03 moritz sudo[52433]: pam_unix(sudo:session): session opened for user root by alfred(uid=0)
Mär 08 10:56:03 moritz sudo[52433]: pam_unix(sudo:session): session closed for user root
Mär 08 10:56:04 moritz sudo[52442]: alfred : TTY=pts/2 ; PWD=/home/alfred ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Mär 08 10:56:04 moritz sudo[52442]: pam_unix(sudo:session): session opened for user root by alfred(uid=0)
Mär 08 10:56:04 moritz sudo[52442]: pam_unix(sudo:session): session closed for user root
Mär 08 10:56:05 moritz sudo[52452]: alfred : TTY=pts/2 ; PWD=/home/alfred ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Mär 08 10:56:05 moritz sudo[52452]: pam_unix(sudo:session): session opened for user root by alfred(uid=0)
Mär 08 10:56:05 moritz sudo[52452]: pam_unix(sudo:session): session closed for user root
Mär 08 10:56:06 moritz sudo[52461]: alfred : TTY=pts/2 ; PWD=/home/alfred ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Mär 08 10:56:06 moritz sudo[52461]: pam_unix(sudo:session): session opened for user root by alfred(uid=0)
Mär 08 10:56:06 moritz sudo[52461]: pam_unix(sudo:session): session closed for user root
Mär 08 10:56:07 moritz sudo[52470]: alfred : TTY=pts/2 ; PWD=/home/alfred ; USER=root ; COMMAND=/usr/bin/cat /sys/kernel/debug/dri/0/amdgpu_pm_info
Mär 08 10:56:07 moritz sudo[52470]: pam_unix(sudo:session): session opened for user root by alfred(uid=0)
Mär 08 10:56:07 moritz sudo[52470]: pam_unix(sudo:session): session closed for user root
-- Reboot --
Mär 08 10:58:21 moritz kernel: Linux version 5.5.8-1-MANJARO (builder@216fb1516504) (gcc version 9.2.1 20200130 (Arch Linux 9.2.1+20200130-2)) #1 SMP PREEMPT Thu Mar 5 20:29:51 UTC 2020
Mär 08 10:58:21 moritz kernel: Command line: BOOT_IMAGE=/vmlinuz-5.5-x86_64 root=UUID=7f7d3134-e671-4bf4-b00c-dac4ecf90413 rw oops=panic udev.log_priority=3 audit=0 amdgpu.ppfeaturemask=0xffffffff amdgpu.vm_debug=1 amdgpu.vm_fault_stop=2 amdgpu.gpu_recovery=1 processor.max_cstate=3 rcu_nocbs=all

I think this has cost enough of mine (and your) time already. I spend at least 30 hours on this by now and every time I thinks its working, it happens again. I will order a RTX 2060 Super today.
Putting some time into finding a solution for a problem is not an issue if there are at least some hints whats going on. But this situation is not what I have in mind when I want to play a game after working the whole day writing software.
(Anyone wants to buy a Vega 56 Shapire Pulse? :)

Thank you all for your support
Dax
damarrin 8 Mar 2020
Well, the purpose of changing the gfx card is to see if your computer continues hanging with a different card and if it doesn’t you’ll know the Vega is at fault and not something else.
Dax Tailor 8 Mar 2020
That was how I understood you. Of cause there could be other problems with the PC (mainboard, power supply etc.) but the freezing is related to having some OpenGL application running. It never happened when I watch youtube or read mails or using Fierfox, unless a game is running in the background. In addition to that, there are a lot of people reporting same problems with Vega and Navi cards.
The order for a new graphic card got out 10 minutes ago. (Because it is send to my mother I will get it next Saturday.)
It is not because I want a NVidia card. When the Vega is working it does a great job and I'm actually happy with the performance and even the fan is barley noticeable.
If its not the GPU, then I buy other components. At the end I might have 3 PC here and only one is working :'(
The reason I chose an AMD GPU is that I don't like NVidias politics but as I sad in my first post, I might end up not using a PC for games anymore at all. I'm not at this point just yet.

Dax
debiangamer 14 Mar 2020
I buy only Asus graphics cards and motherboards after decades of building my personal computers. Asus has good product quality and 3 years warranty here. Sapphire products have 2 years warranty. You use Vulkan with DXVK and many play windows games mostly.
Dax Tailor 14 Mar 2020
Years ago I had an ASUS Mainboard which had some problems with the RAM. I think it was a Socket A board. The ASUS support was very kind and tried to help me but as far as I remember the problem could never be solved. I than tried MSI and since then MSI boards are the best way to go. Before ASUS only gigabyte was the best, so I thought.
The GPU cards manufactures I used so far are ELSA, ASUS, Gigabyte, MSI, Sapphire, Palit (there is still a GTX560TI on the shelf). The RTX 2060 Super I ordered is from Palit. (I will get it today.)
The MSI GTX 970 has a bug in the fan control which is known by MSI but has never been fixed. From time to time one fan stops spinning and the the other goes up to full speed. I never closed the casing of my last PC because I had to give the not spinning fan a short nudge and it starts spinning again.

What I'm saying, the manufacture is not the way someone should select PC components. From time to time every company produces a bad component. Checking tests is the best way I know. Of cause not all tests are without bias and some are very bad. A while ago I found the youtube channel IgorsLab (in German). He uses very high end equipment to test HW. Never found someone who actual measured the 10ms peek power consumption of GPU's, which could be a problem for the power supply.
(No I'm not starting writing about power supplies, this will end up in a short novel:)

I don't think the problem I have with the Vega is related to Sapphire but to AMD. But it would be interesting if there are manufactures who has this problem more likely. As I wrote, my PC freezes completely. That means the Linux kernel is not running anymore. Not sure a driver can actually so this. My hunch is, that the GPU is holding the DMA or an IRQ or makes some bad noise on the power so the CPU stops working.

Thanks for reading,
Dax

PS.
As soon has I have more information I will post this here. I'm might buy other components to build a 2nd PC for the Vega card to test this. Maybe this is not one problem but a combination of more then one.

PPS.
I'm working as a SW developer on embedded systems and yesterday I finally could build the test system to evaluate the new APU board we like to use in the next gen of our devices. Its an [AMD R1605B](https://en.wikichip.org/wiki/amd/ryzen_embedded/v1605b) APU with Vega graphic. First tests, using debian unstable, looks good. Hopefully this APU does not have the same problems I have. Our devices are running 24/7 in industrial production lines.
Dax Tailor 7 Apr 2020
Just a short update.

Now the 2060 is in use for over 2 weeks and not a single freeze. I think its safe to say the Vega has a problem. If hardware or driver is still a question.

I'm working from home at the moment. My PC is running much longer than usual.
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon Logo Patreon. Plain Donations: PayPal Logo PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
Login / Register