Support us on Patreon to keep GamingOnLinux alive. This ensures we have no timed articles and no paywalls. Just good, fresh content! Alternatively, you can donate through Paypal, Flattr and Liberapay!
  Go to:
Working around Ryzen CPU freezes
devnull commented on 8 May 2018 at 12:08 am UTC

Hmm.. Was about to create a new thread but it might actually be related to this. Has anyone had the same thing with other Ryzen CPUs? I get this weird latency on one of the fource CCX's of a 1950x. It's weird as in, almost exactly double. I can work around it a bit by bumping the clock on those cpus. Possibly another architecture "quirk" (aka bug), is runtimes on processes where inter CCX thread migrations happen, in general are almost double when the target CPU speed is lower (XFR / boost, or intentional).

ie
sCCX | dCCX | sCPU | dCPU | Result
--------------------------
0 |0 | 0 | 16 | Fast- expected
0 |0 | 0 | 1 | Fast- expected
0 |1 | 0 | 8 | Slow - problem
0 |1 | 0 | 24 | Slow - problem
0 |1 | 0 | 12 | Slower - expected though
0 |1 | 0 | 28 | Slower - expected though

I can go into more details about the test itself but it's basically a timed busy loop in bash. CPUs threads set realtime, nohrz, etc, everything I could do to isolate them. While not scentific (+/- 40ms), it doesn't have to when one is looking at a 480ms difference.

Shmerl commented on 8 May 2018 at 1:41 am UTC

I didn't specifically analyze it, so I can't say. I can run your test and compare the results.

wolfyrion commented on 8 May 2018 at 11:53 am UTC

Hi,
I own as well 1950x with the ASrock Fatal1ty X399 Professional Gaming motherboard
I didnt have any problem at all so far with all the things I have put to this monster

I am running it @3.6 all cores @3200ghz
(my only problem is that is getting a bit hot when compiling stuff , like for example compiling unreal engine it goes up to 85c with a water cooler - idle is going from 35-50)

Also I am using the latest bios from Asrock
http://www.asrock.com/mb/AMD/Fatal1ty%20X399%20Professional%20Gaming/index.asp#BIOS

Here is an inxi , in case you need me test something let me know.
image

devnull commented on 8 May 2018 at 6:08 pm UTC

It isn't pretty but created as a working test case.. more like working draft.



#!/bin/bash

sysp=/sys/bus/cpu/devices

fast() {
  cpu=cpu$1
  read ttfreq < $sysp/$cpu/cpufreq/cpuinfo_max_freq
  echo $ttfreq > $sysp/$cpu/cpufreq/scaling_min_freq
}

slow() {
  cpu=cpu$1
  read ttfreq < $sysp/$cpu/cpufreq/cpuinfo_min_freq
  echo $ttfreq > $sysp/$cpu/cpufreq/scaling_min_freq
}


priority() {
  taskset -pc $2 $1
  renice -20 -p $2
}

doit() {
read ttcpu < /sys/bus/cpu/devices/cpu${1}/cpufreq/scaling_min_freq
read tpcpu < /sys/bus/cpu/devices/cpu${2}/cpufreq/scaling_min_freq

printf "OUT: Parent:%s:%s Target:%s:%s\n" "$pcpu" "$tpcpu" "$tcpu" "$ttcpu"

# for i in {1..200000}
for i in {1..50000}
do
 :;
done
}

usage() { 
printf "%s" "
$@ Required:
        --target
        --source

"
exit 1

}
##  $1 - [ fast | slow ]:TargetCPU
##  $2 - [ fast | slow ]:SourceCPU

get_args() {
while [[ "$1" ]]; do
case "$1" in                                                                                             
"--target") tcpu=${2/*:/}; tcmd=${2/:*/} ;;
"--source") pcpu=${2/*:/}; pcmd=${2/:*/};;
*) printf "Unknown option: $1"; usage ;;
esac
shift 2
done 
}        

zz() {
tcpu=${1/*:/}
tcmd=${1/:*/}

shift
pcpu=${1/*:/}
pcmd=${1/:*/}
}

if [ $# -le 4 ]
then
printf "## %s\n" "$*"
get_args $* || exit $?

$tcmd $tcpu
priority $$ $tcpu

$pcmd $pcpu
priority $PPID $pcpu

doit $tcpu $pcpu

slow $pcpu
slow $tcpu

else usage
exit 1
fi


Invocation:
Local
Start on CPU4 @ slowest speed
Test on CPU2 @ slowest speed


perf stat -d -d -d  ./child.sh --target slow:2 --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs


CCX jump
Start on CPU4 @ slowest speed
Test on CPU25 @ slowest speed


perf stat -d -d -d  ./child.sh --target slow:25 --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs
OUT: Parent:4:2200000 Target:25:2200000 278.624873 task-clock (msec) # 0.998 CPUs utilized


Same CCX jump, faster clock
Start on CPU4 @ slowest speed
Test on CPU25 @ fast speed



perf stat -d -d -d  ./child.sh --target fast:25 --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs
OUT: Parent:4:2200000 Target:25:4100000 140.126839 task-clock (msec) # 0.997 CPUs utilized



Example output:


OUT: Parent:4:2200000 Target:2:2200000 130.223193 task-clock (msec) # 0.994 CPUs utilized


Longer version:

# perf stat -d -d -d ./child.sh --target slow:2 --source slow:4 2>&1 | egrep '(OUTtask-clock)'|xargs
OUT: Parent:4:2200000 Target:2:2200000 130.223193 task-clock (msec) # 0.994 CPUs utilized
# perf stat -d -d -d ./child.sh --target slow:25 --source slow:4 2>&1 | egrep '(OUTtask-clock)'|xargs
OUT: Parent:4:2200000 Target:25:2200000 278.624873 task-clock (msec) # 0.998 CPUs utilized
# perf stat -d -d -d ./child.sh --target fast:25 --source slow:4 2>&1 | egrep '(OUTtask-clock)'|xargs
OUT: Parent:4:2200000 Target:25:4100000 140.126839 task-clock (msec) # 0.997 CPUs utilized


## do all the things...
for spd in fast slow; do for target in $spd:{0..31}; do perf stat -d -d -d  ./child.sh --target $target --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs ; done ; done



Explanation:

OUT: Parent:4:2200000 - CPU and clock we're on now
Target:2:2200000 - CPU and clock we're testing
130.223193 task-clock (msec) # 0.994 CPUs utilized - how long it took

High variance in the time is what I'm after, there shouldn't be much.

There are a lot of assumptions made / aren't scripted since it was a quick test. I'm assuming for example the current scheduler is ondemand, though I've seen the same with conservative. Especially if you test with fast but find the clock drops back down it could be the scheduler, or throttling (none of which are accounted for atm).

Though I try to pin the test as much as I can, some CPUs are isolated on boot. GRUB line:

nohz_full=0,16,1,17,8,24,9,25,10,26,11,27
rcu_nocbs=0,16,1,17,8,24,9,25,10,26,11,27
isolcpus=0,16,1,17,8,24,9,25,10,26,11,27

That is intentional as I pin vms. It also doesn't really affect the test from what I can tell.

Some other things to note:
- X399 AORUS Gaming 7 board
- There is _zero_ scripted thermal monitoring
- BIOS supports setting custom power states
- c6 is disabled thus 2.2GHz is the lowest clock for me, ymmv.
- I can hit faster OC but not needed to validate the test

Clocks are intentionally reset to lowest atm due to the way Ryzen works, not all cores can run full OC/XFR, ymmv. Shouldn't be a problem with most schedulers unless you've pinned them higher. Just something to be aware of. Should check what they were before changing but I'm lazy.

What kinda makes this worse is it's running in UMA/ "Creator mode", _NOT_ NUMA / "Gaming".

Shmerl commented on 9 May 2018 at 3:08 am UTC

Here is what I get with Ryzen 2700X (CXX jump numbers adjusted):

perf stat -d -d -d ./child.sh --target slow:2 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs
OUT: Parent:4:2200000 Target:2:2200000 99.580689 task-clock (msec) # 0.740 CPUs utilized


perf stat -d -d -d ./child.sh --target slow:15 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs
OUT: Parent:4:2200000 Target:15:2200000 102.475220 task-clock (msec) # 0.995 CPUs utilized


perf stat -d -d -d ./child.sh --target fast:15 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs
OUT: Parent:4:2200000 Target:15:3700000 96.033168 task-clock (msec) # 0.992 CPUs utilized


But may be your issue is Threadripper specific.

devnull commented on 9 May 2018 at 3:54 am UTC

Something weird with those. If the layout of the 2700X is the same (lstopo or "lscpu --all -y --extended" is awesome for this), you want to test between:

0-> {0..3,8..15}
1-> {4..7,16..23}

It's interesting you still see a few ms gain though.

Shmerl commented on 9 May 2018 at 3:57 am UTC

2700X has only 8 physical cores (16 virtual).

lstopo is a neat tool - never hard of it before

devnull commented on 9 May 2018 at 5:22 am UTC

Hmm.. brain fart. Are you still not testing the same ccx though? First one is close but you didn't include fast.

Assuming it's:

    CCX0        CCX1
0 1  2  3  |  4  5  6  7
-------------------------
0 1  2  3  |  4  5  6  7
8 9 10 11  | 12 13 14 15
-------------------------


lstopo/hwloc is quite handy indeed. It misses somethings like identifying nvme drives (lists the bus of course), but you can export it to an XML and add whatever you want. I have VMs for example mapped. Almost like porn on massive servers

Don't know why the forum is eating that ascii. It looks fine on preview but post gets garbled.. hm.

devnull commented on 10 May 2018 at 10:56 am UTC

Possibly related, appears I wasn't the only one to notice this. There have been some scheduling changes in 4.16. @Shmerl were your tests above still on 4.15?

I've seen something quite similar with Samsung NVME drives. The latency remains high
because the drive remains at a lower power state.

From ioping, drop is after starting dd in another term.

Quote4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=19 time=5.59 ms
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=20 time=5.81 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=21 time=5.66 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=22 time=5.80 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=23 time=5.62 ms
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=24 time=5.78 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=25 time=192.7 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=26 time=180.0 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=27 time=69.9 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=28 time=54.1 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=29 time=56.1 us (fast)

Shmerl commented on 11 May 2018 at 1:26 am UTC

I think I upgraded to 4.16 before running the tests.

  Go to:

Due to spam you need to Register and Login to comment.


Or login with...

Livestreams & Videos
Community Livestreams
See more!
Popular this week
View by Category
Contact
Latest Comments
Latest Forum Posts