Working around Ryzen CPU freezes
Page: 1/4»
  Go to:
Shmerl Mar 1, 2018
Many Ryzen CPUs for a long time have been affected by random freezes and reboots, which some managed to narrow down to C6 power states. Even RMA often didn't help with these.

Recently I found an actual kernel bug report about it: https://bugzilla.kernel.org/show_bug.cgi?id=196683

Apparently, AMD said that they are going to release some microcode update (or MB manufacturers are going to release some firmware updates?), which handle this. But until that will happen, you can work around it without disabling C6 states, if you build the kernel with

CONFIG_RCU_NOCB_CPU=y

And then use rcu_nocbs=0-... kernel boot parameter to enable it.

I was bugged by this issue for a long time, and recently decided to build a kernel like above. It happens to be quite easy with Debian. It indeed works around the problem.

Here is an example HOWTO on doing it:
_______________________________________________________________________
If this is useful to anyone, here is what I did on Debian testing:

Install current Linux source package and some tools and libraries:

sudo apt-get install linux-source build-essential kernel-package libncurses5-dev libelf-dev libssl-dev

Note: in my case I pulled linux-source from Debian sid, since the default kernel in testing still didn't get bumped to Linux 4.15.x, even though 4.15 is available in general.

Unpack the source for example to $HOME/build.

I like using explicit tar parameters even though they are long, it just makes it hugely more readable and easy to understand:

linux_ver="4.15"
config_ver="4.15.0-1-amd64"

linux_src="/usr/src/linux-source-${linux_ver}.tar.xz"
build_dir="${HOME}/build"
source_dir="${build_dir}/linux-source-${linux_ver}"

mkdir -p $build_dir
tar --extract --verbose --use-compress-program "xz --decompress" --file ${linux_src} --directory ${build_dir}
cd $source_dir


Now you are in the unpacked source directory. You'd need to create proper .config file. My goal was to make only minimal tweaks from the stock Debian kernel, so I copied default config from /boot first:

cp -v /boot/config-${config_ver} ${source_dir}/.config

Now, you need to enable the actual workaround. There is a useful tool for configuring kernel parameters - menuconfig (that's why libncurses5-dev was needed above).

make menuconfig

It will build the tool in place and will run it. Find RCU options under General Setup > RCU Subsystem





And there enable "Make expert-level adjustments to RCU configuration" and "Offload RCU callback processing from boot-selected CPUs"



Then select Save (into .config), and a few times Exit to exit the tool.

One more thing is needed - comment out CONFIG_SYSTEM_TRUSTED_KEYS in resulting ${source_dir}/.config, otherwise the build will fail.

perl -p -i -e 's/^CONFIG_SYSTEM_TRUSTED_KEYS=/#CONFIG_SYSTEM_TRUSTED_KEYS=/' .config

Now you are ready to build it (I chose suffixes -rcu, and -1 for versions):

make -j$(nproc) deb-pkg LOCALVERSION=-rcu KDEB_PKGVERSION=$(make kernelversion)-1

Press Enter a few times to complete the configuration caused by the modifications, and it should proceed with the build. After it completes, you should have the ready package in $build_dir:

cd ..
ls -1 linux-image*
linux-image-4.15.4-rcu_4.15.4-1_amd64.deb
linux-image-4.15.4-rcu-dbg_4.15.4-1_amd64.deb


Install the result:

sudo dpkg -i linux-image-4.15.4-rcu_4.15.4-1_amd64.deb

Now open /etc/default/grub with your editor (under sudo), and add this to your kernel boot parameters (assuming you have 8 core / 16 thread processor).

GRUB_CMDLINE_LINUX_DEFAULT="... rcu_nocbs=0-15"

... here means whatever was there before the change, don't remove what was there, just add the new parameter after a space!

Save and update system grub:

sudo update-grub

That's it. Reboot the system and you are ready to use the workaround kernel.
Xpander Mar 1, 2018
I haven't had any issues with it after i disabled C6 power state in the BIOS. also have AGESA 1.0.0.0a BIOS
weird that some need to enable kernel configs and boot parameters. My CPU is Launch one so it should in theory have that segfault bug also when lots of parallel compiles running. Though i have ran that ryzen-kill test for 2 hours 2 times and one time 4 hours with no issues.

xpander@arch ~ $ uptime
 19:29:19 up 7 days, 21:01,  1 user,  load average: 1.52, 1.43, 0.85

xpander@arch ~ $ inxi -F
System:    Host: arch Kernel: 4.15.4-1-ARCH x86_64 bits: 64 Desktop: MATE 1.20.0
           Distro: Arch Linux
Machine:   Device: desktop Mobo: ASUSTeK model: PRIME X370-PRO v: Rev X.0x serial: N/A
           UEFI [Legacy]: American Megatrends v: 3803 date: 01/22/2018
CPU:       8 core AMD Ryzen 7 1700X Eight-Core (-MT-MCP-) cache: 4096 KB
           clock speeds: max: 3925 MHz 1: 1881 MHz 2: 1813 MHz 3: 2691 MHz 4: 2382 MHz
           5: 1842 MHz 6: 1855 MHz 7: 1862 MHz 8: 1873 MHz 9: 2958 MHz 10: 2413 MHz
           11: 1709 MHz 12: 1710 MHz 13: 1727 MHz 14: 1716 MHz 15: 1710 MHz
           16: 1711 MHz
Graphics:  Card: NVIDIA GP104 [GeForce GTX 1070]
           Display Server: x11 (X.Org 1.19.6 ) driver: nvidia
           Resolution: [email protected][email protected]
           OpenGL: renderer: GeForce GTX 1070/PCIe/SSE2 version: 4.6.0 NVIDIA 390.25
Audio:     Card-1 NVIDIA GP104 High Def. Audio Controller driver: snd_hda_intel
           Card-2 M-Audio driver: USB Audio
           Card-3 Focusrite-Novation Focusrite Scarlett 2i2 driver: USB Audio
           Sound: Advanced Linux Sound Architecture v: k4.15.4-1-ARCH
Network:   Card: Intel I211 Gigabit Network Connection driver: igb
           IF: enp8s0 state: up speed: 1000 Mbps duplex: full mac: 60:45:cb:9a:09:31
Drives:    HDD Total Size: 7751.6GB (64.3% used)
           ID-1: /dev/nvme0n1 model: Samsung_SSD_960_EVO_250GB size: 250.1GB
           ID-2: /dev/sdc model: ST3000DM001 size: 3000.6GB
           ID-3: /dev/sda model: Samsung_SSD_850 size: 500.1GB
           ID-4: /dev/sdb model: ST2000DM001 size: 2000.4GB
           ID-5: /dev/sdd model: ST2000DM001 size: 2000.4GB
Partition: ID-1: / size: 230G used: 129G (59%) fs: ext4 dev: /dev/nvme0n1p1
Sensors:   System Temperatures: cpu: 38.0C mobo: 29.0C gpu: 31C
           Fan Speeds (in rpm): cpu: 0 sys-1: 1163 sys-2: 919
Info:      Processes: 345 Uptime: 7 days Memory: 11726.2/32166.7MB
           Client: Shell (bash) inxi: 2.3.56

Shmerl Mar 1, 2018
Quoting: XpanderI haven't had any issues with it after i disabled C6 power state in the BIOS. also have AGESA 1.0.0.0a BIOS
weird that some need to enable kernel configs and boot parameters.

I updated the first post with some details. Disabling C6 is really a very crude fix (which in case of my MB firmware doesn't even work, since C6 doesn't get disabled). It causes higher CPU temperature and power usage. RCU workaround helps without disabling C6 so the temperature practically isn't affected.
Xpander Mar 1, 2018
Quoting: Shmerl
Quoting: XpanderI haven't had any issues with it after i disabled C6 power state in the BIOS. also have AGESA 1.0.0.0a BIOS
weird that some need to enable kernel configs and boot parameters.

I updated the first post with some details. Disabling C6 is really a very crude fix (which in case of my MB firmware doesn't even work, since C6 doesn't get disabled). It causes higher CPU temperature and power usage. RCU workaround helps without disabling C6 so the temperature practically isn't affected.

judging by the clock numbers, mine doesnt enable C6 either. i can see the cores go down as far as 1.7Ghz, though they should only go down to 2.2Ghz with C6 disabled. didnt notice any temp increase either. So this is a bit weird one.. Maybe arch adds this by default to the kernel, though i dont see any kernel parameters.

edit: also i had lots of those computer freezes or black screens when i was on 4.14 kernels and the C6 setting didnt help. But i have no idea if 4.15 made it go away or some BIOS/microcode updates.
Shmerl Mar 2, 2018
Quoting: Xpanderedit: also i had lots of those computer freezes or black screens when i was on 4.14 kernels and the C6 setting didnt help. But i have no idea if 4.15 made it go away or some BIOS/microcode updates.

What is your current AGESA version and microcode? You should be able to see the first in the firmware somewhere, and the second like this:

grep microcode /proc/cpuinfo                                                                                                                                        
microcode       : 0x8001129
...


So I currently have 0x8001129.
nox Mar 2, 2018
What kind of freezes are we talking about here?
I have the same ryzen as xpander, and I haven't had any issues to speak of at all. So, this intrigues me!
Shmerl Mar 2, 2018
Complete system freezes, you can't even access the system remotely over ssh when they happen. This is a hardware problem, and not every chip has it. So you might have a good one.
Xpander Mar 3, 2018
Quoting: Shmerl
Quoting: Xpanderedit: also i had lots of those computer freezes or black screens when i was on 4.14 kernels and the C6 setting didnt help. But i have no idea if 4.15 made it go away or some BIOS/microcode updates.

What is your current AGESA version and microcode? You should be able to see the first in the firmware somewhere, and the second like this:

grep microcode /proc/cpuinfo                                                                                                                                        
microcode       : 0x8001129
...


So I currently have 0x8001129.

same version microcode as yours and AGESA 1.0.0.0a
when i had freezes i think i might have had them because of my RAM OC or the BIOS that was just bad.
but yeah no issues since 4.15 kernels, but i updaded BIOS around the same time also, so i dunno which one of those gave me the stability.
SirBubbles Mar 3, 2018
Would this have anything to do with weird momentary freezes when doing just about anything under Gnome-shell? I mean, I'm on ubuntu 17.10 with gnome-shell, nvidia drivers, kernel 4.13.0-36, and I'll often get freezes of around 5-10 seconds at random. No idea of the cause, but I do have a Ryzen 1700 at 3.7 ghz. Any idea if this is the same issue?
(*EDIT* Note that I don't get mystery lock-ups such as you're describing here, so it might be a separate issue.)
GustyGhost Mar 3, 2018
I am affected by this bug. From what I understand, it was only the first few month's production of Ryzen (Summit Ridge) chips. The flaw was supposedly fixed for chips manufactured in the following quarter but don't quote me on that.

All in all, the freezes only occur maybe once a week with moderate use. Not ideal but I'm not about to go recompiling my kernel over it either.
Shmerl Mar 4, 2018
Browsing around my ASRock X370 Taichi firmware settings, I found this one:

Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control.

I changed it from auto to low, let's see if it will help with stock kernel.
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations: PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
Login / Register


Or login with...
Sign in with Steam Sign in with Google
Social logins require cookies to stay logged in.