Viewing topic Working around Ryzen CPU freezes - Page: 2

Title: Working around Ryzen CPU freezes

Page: «2/2

devnull 8 May 2018

Link

Hmm.. Was about to create a new thread but it might actually be related to this. Has anyone had the same thing with other Ryzen CPUs? I get this weird latency on one of the fource CCX's of a 1950x. It's weird as in, almost exactly double. I can work around it a bit by bumping the clock on those cpus. Possibly another architecture "quirk" (aka bug), is runtimes on processes where inter CCX thread migrations happen, in general are almost double when the target CPU speed is lower (XFR / boost, or intentional).

ie
sCCX | dCCX | sCPU | dCPU | Result
--------------------------
0 |0 | 0 | 16 | Fast- expected
0 |0 | 0 | 1 | Fast- expected
0 |1 | 0 | 8 | Slow - problem
0 |1 | 0 | 24 | Slow - problem
0 |1 | 0 | 12 | Slower - expected though
0 |1 | 0 | 28 | Slower - expected though

I can go into more details about the test itself but it's basically a timed busy loop in bash. CPUs threads set realtime, nohrz, etc, everything I could do to isolate them. While not scentific (+/- 40ms), it doesn't have to when one is looking at a 480ms difference.

0 Likes

Shmerl 8 May 2018

Link

View PC info

I didn't specifically analyze it, so I can't say. I can run your test and compare the results.

0 Likes

wolfyrion 8 May 2018

Link

Hi,
I own as well 1950x with the ASrock Fatal1ty X399 Professional Gaming motherboard
I didnt have any problem at all so far with all the things I have put to this monster :P

I am running it @3.6 all cores @3200ghz
(my only problem is that is getting a bit hot when compiling stuff , like for example compiling unreal engine it goes up to 85c with a water cooler - idle is going from 35-50)

Also I am using the latest bios from Asrock
http://www.asrock.com/mb/AMD/Fatal1ty%20X399%20Professional%20Gaming/index.asp#BIOS

Here is an inxi , in case you need me test something let me know.

External Media: You need to be logged in to view this.

0 Likes

devnull 8 May 2018

Link

It isn't pretty but created as a working test case.. more like working draft.

#!/bin/bash



sysp=/sys/bus/cpu/devices



fast() {

  cpu=cpu$1

  read ttfreq < $sysp/$cpu/cpufreq/cpuinfo_max_freq

  echo $ttfreq > $sysp/$cpu/cpufreq/scaling_min_freq

}



slow() {

  cpu=cpu$1

  read ttfreq < $sysp/$cpu/cpufreq/cpuinfo_min_freq

  echo $ttfreq > $sysp/$cpu/cpufreq/scaling_min_freq

}





priority() {

  taskset -pc $2 $1

  renice -20 -p $2

}



doit() {

read ttcpu < /sys/bus/cpu/devices/cpu${1}/cpufreq/scaling_min_freq

read tpcpu < /sys/bus/cpu/devices/cpu${2}/cpufreq/scaling_min_freq



printf "OUT: Parent:%s:%s Target:%s:%s\n" "$pcpu" "$tpcpu" "$tcpu" "$ttcpu"



# for i in {1..200000}

for i in {1..50000}

do

 :;

done

}



usage() { 

printf "%s" "

$@ Required:

        --target

        --source



"

exit 1



}

##  $1 - [ fast | slow ]:TargetCPU

##  $2 - [ fast | slow ]:SourceCPU



get_args() {

	while [[ "$1" ]]; do

	case "$1" in                                                                                             

	"--target") tcpu=${2/*:/}; tcmd=${2/:*/} ;;

	"--source") pcpu=${2/*:/}; pcmd=${2/:*/};;

	*) printf "Unknown option: $1"; usage ;;

	esac

	shift 2

	done 

}        



zz() {

tcpu=${1/*:/}

tcmd=${1/:*/}



shift

pcpu=${1/*:/}

pcmd=${1/:*/}

}



if [ $# -le 4 ]

then

printf "## %s\n" "$*"

	get_args $* || exit $?



	$tcmd $tcpu

	priority $$ $tcpu



	$pcmd $pcpu

	priority $PPID $pcpu



	doit $tcpu $pcpu



	slow $pcpu

	slow $tcpu



else usage

exit 1

fi

Invocation:
Local
Start on CPU4 @ slowest speed
Test on CPU2 @ slowest speed

perf stat -d -d -d ./child.sh --target slow:2 --source slow:4 2>&1 | egrep '(OUT:|task-clock)'|xargs

CCX jump
Start on CPU4 @ slowest speed
Test on CPU25 @ slowest speed

perf stat -d -d -d  ./child.sh --target slow:25 --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs

OUT: Parent:4:2200000 Target:25:2200000 278.624873 task-clock (msec) # 0.998 CPUs utilized

Same CCX jump, faster clock
Start on CPU4 @ slowest speed
Test on CPU25 @ fast speed

perf stat -d -d -d  ./child.sh --target fast:25 --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs

OUT: Parent:4:2200000 Target:25:4100000 140.126839 task-clock (msec) # 0.997 CPUs utilized

Example output:

OUT: Parent:4:2200000 Target:2:2200000 130.223193 task-clock (msec) # 0.994 CPUs utilized

Longer version:

# perf stat -d -d -d ./child.sh --target slow:2 --source slow:4 2>&1 | egrep '(OUT:|task-clock)'|xargs
OUT: Parent:4:2200000 Target:2:2200000 130.223193 task-clock (msec) # 0.994 CPUs utilized
# perf stat -d -d -d ./child.sh --target slow:25 --source slow:4 2>&1 | egrep '(OUT:|task-clock)'|xargs
OUT: Parent:4:2200000 Target:25:2200000 278.624873 task-clock (msec) # 0.998 CPUs utilized
# perf stat -d -d -d ./child.sh --target fast:25 --source slow:4 2>&1 | egrep '(OUT:|task-clock)'|xargs
OUT: Parent:4:2200000 Target:25:4100000 140.126839 task-clock (msec) # 0.997 CPUs utilized

## do all the things...

for spd in fast slow; do for target in $spd:{0..31}; do perf stat -d -d -d  ./child.sh --target $target --source slow:4  2>&1 | egrep '(OUT:|task-clock)'|xargs ; done ; done

Explanation:

OUT: Parent:4:2200000 - CPU and clock we're on now
Target:2:2200000 - CPU and clock we're testing
130.223193 task-clock (msec) # 0.994 CPUs utilized - how long it took

High variance in the time is what I'm after, there shouldn't be much.

There are a lot of assumptions made / aren't scripted since it was a quick test. I'm assuming for example the current scheduler is ondemand, though I've seen the same with conservative. Especially if you test with fast but find the clock drops back down it could be the scheduler, or throttling (none of which are accounted for atm).

Though I try to pin the test as much as I can, some CPUs are isolated on boot. GRUB line:

nohz_full=0,16,1,17,8,24,9,25,10,26,11,27
rcu_nocbs=0,16,1,17,8,24,9,25,10,26,11,27
isolcpus=0,16,1,17,8,24,9,25,10,26,11,27

That is intentional as I pin vms. It also doesn't really affect the test from what I can tell.

Some other things to note:
- X399 AORUS Gaming 7 board
- There is _zero_ scripted thermal monitoring
- BIOS supports setting custom power states
- c6 is disabled thus 2.2GHz is the lowest clock for me, ymmv.
- I can hit faster OC but not needed to validate the test

Clocks are intentionally reset to lowest atm due to the way Ryzen works, not all cores can run full OC/XFR, ymmv. Shouldn't be a problem with most schedulers unless you've pinned them higher. Just something to be aware of. Should check what they were before changing but I'm lazy.

What kinda makes this worse is it's running in UMA/ "Creator mode", _NOT_ NUMA / "Gaming".

0 Likes

Shmerl 9 May 2018

Link

View PC info

Here is what I get with Ryzen 2700X (CXX jump numbers adjusted):

perf stat -d -d -d ./child.sh --target slow:2 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs

OUT: Parent:4:2200000 Target:2:2200000 99.580689 task-clock (msec) # 0.740 CPUs utilized

perf stat -d -d -d ./child.sh --target slow:15 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs

OUT: Parent:4:2200000 Target:15:2200000 102.475220 task-clock (msec) # 0.995 CPUs utilized

perf stat -d -d -d ./child.sh --target fast:15 --source slow:4 2>&1 | egrep '(OUT:|task-clock)' | xargs

OUT: Parent:4:2200000 Target:15:3700000 96.033168 task-clock (msec) # 0.992 CPUs utilized

But may be your issue is Threadripper specific.

0 Likes

devnull 9 May 2018

Link

Something weird with those. If the layout of the 2700X is the same (lstopo or "lscpu --all -y --extended" is awesome for this), you want to test between:

0-> {0..3,8..15}
1-> {4..7,16..23}

It's interesting you still see a few ms gain though.

0 Likes

Shmerl 9 May 2018

Link

View PC info

2700X has only 8 physical cores (16 virtual).

lstopo is a neat tool - never hard of it before :)

0 Likes

devnull 9 May 2018

Link

Hmm.. brain fart. Are you still not testing the same ccx though? First one is close but you didn't include fast.

Assuming it's:

    CCX0        CCX1

0 1  2  3  |  4  5  6  7

-------------------------

0 1  2  3  |  4  5  6  7

8 9 10 11  | 12 13 14 15

-------------------------

lstopo/hwloc is quite handy indeed. It misses somethings like identifying nvme drives (lists the bus of course), but you can export it to an XML and add whatever you want. I have VMs for example mapped. Almost like porn on massive servers :P

Don't know why the forum is eating that ascii. It looks fine on preview but post gets garbled.. hm.

0 Likes

devnull 10 May 2018

Link

Possibly related, appears I wasn't the only one to notice this. There have been some scheduling changes in 4.16. @Shmerl were your tests above still on 4.15?

I've seen something quite similar with Samsung NVME drives. The latency remains high
because the drive remains at a lower power state.

From ioping, drop is after starting dd in another term.

4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=19 time=5.59 ms
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=20 time=5.81 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=21 time=5.66 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=22 time=5.80 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=23 time=5.62 ms
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=24 time=5.78 ms (slow)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=25 time=192.7 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=26 time=180.0 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=27 time=69.9 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=28 time=54.1 us (fast)
4 KiB <<< /dev/nvme1n1 (block device 953.9 GiB): request=29 time=56.1 us (fast)

0 Likes

Shmerl 11 May 2018

Link

View PC info

I think I upgraded to 4.16 before running the tests.

0 Likes

Shmerl 11 May 2018

Link

View PC info

Quoting: devnullI've seen something quite similar with Samsung NVME drives. The latency remains high
because the drive remains at a lower power state.

Funny, I've just got recent Samsung nvme for my primary drive, and was trying to figure out its erase block size. Samsung support was completely abysmal, they refused to give any information and said something weird like "we aren't allowed to support Linux users". Whaat??

I ended up setting 6 MiB offset for partitions to account for some potentially weird erase block sizes (like 1536 KiB one). And fio produces some of [these results](https://www.reddit.com/r/linuxquestions/comments/8hzz20/how_to_configure_samsung_evo_970_for_optimal/).

0 Likes

devnull 11 May 2018

Link

To be fair, that is a bit of an odd question :) Not entirely sure I know why you'd need that nor if you meant the trim? And did you mean the _960_ ? 970's just came out like this week no?

0 Likes

Shmerl 11 May 2018

Link

View PC info

It's Evo 970 (I've never used an NVME before, so I had to research this stuff).

See:

* https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/

* https://superuser.com/questions/1243559/is-partition-alignment-to-ssd-erase-block-size-pointlessnt

* https://forums.anandtech.com/threads/samsung-tlc-erase-block-sizes.2448833/

0 Likes

«2 /2

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!