Shadow of Mordor on AMD Ryzen CPU suffers from a performance hit due to non-optimal thread scheduling

By ghem - 27 May 2017 at 8:27 pm UTC

While doing some comparative benchmarks between my RX 470 and GTX 1060 on a Ryzen 1700 CPU and an i7-2700k CPU, I encountered odd behaviour with Shadow of Mordor.

On 1080p high preset this benchmark is almost exclusively CPU-bound on both a Ryzen 1700 (3,75GHz) and an i7-2700k (4,2GHz). So when I got 30 to 40% better performance on the i7 compared to the Ryzen with the GTX 1060, I was shocked and began to investigate what was causing such a performance drop with Ryzen.

Interesting to note is that, on Ryzen, the performance of the GTX 1060 and the RX 470 was identical in CPU-bound parts of the benchmark, even though AMD’s open source driver (Mesa 17.2-git in this case) still has a significantly higher CPU overhead than Nvidia's proprietary driver. So this pointed to a driver-independent bottleneck on the game side itself.

With that information, I started suspecting a thread allocation problem, either from the Linux kernel (4.12rc1) or from the game (if it forces the scheduling through CPU affinity).

You see, Ryzen has a specific architecture, quite different from Intel's i5 and i7. Ryzen is a bit like some sort of CPU Lego, with the CCX being the base building block. A CCX (core complex) comprises 4 CPU cores with SMT (simultaneous multithreading) and the associated memory caches (level 1 to 3). So a mainstream Ryzen CPU is made of 2 CCXes linked with AMD’s infinity fabric (a high speed communication channel). Even the 4 cores Ryzen are made this way (on these cpus, two cores are disabled in each CCX).

If you’re interested in the subject, you can find more in-depth information here: Anandtech.com review of Ryzen

So how does this all relate to Shadow of Mordor? Well, AMD’s architecture is made to scale efficiently to high core numbers (up to 32), but it has a drawback: communication between CPU cores that are not on the same CCX is slower because it has to go through the Infinity Fabric.

On a lot of workloads this won’t be a problem because threads don’t need to communicate much (for example in video encoding, or serving web pages) but in games threads often need to synchronize with each other. So it’s better if threads that are interdependent are scheduled on the same CCX.

This is not happening with Shadow of Mordor, so performance takes a huge hit, as you can see in the graph below.

This graph shows the FPS observed on a Ryzen 1700 @ 3,75GHz and an RX 470 during the automated benchmark of Shadow of Mordor. The blue line shows the FPS with the default scheduling and the red line with the game forced onto the first CCX. The yellow line shows the performance increase (in %) going from default to manual scheduling.

As you can see, manual scheduling roughly yelds a 30% performance improvement in CPU-bound parts of the benchmark. Quite nice, eh?

So how does one manually schedule Shadow of Mordor on a Ryzen CPU?

It’s quite simple really. Just edit the launch options of the game in Steam like this:
taskset -c 0-7 %command%
This command will force the game on logical cores 0-7 which are all located on the first CCX.
Note: due to SMT, there are twice the amount of logical cores as real physical cores. This is because SMT allows two threads to run simultaneously on each physical core (though not both at full speed).

The above command is for an 8 core / 16 threads Ryzen CPU (model 1700 and higher).
On 6 core Ryzen (models 1600/1600X), the command would be taskset -c 0-5 %command% and on a 4 core Ryzen (models 1400/1500X) taskset -c 0-3 %command%

Caveat: on a 4 core Ryzen limiting the game to the first CCX will only give it 2 cores / 4 threads to work with. This may prove insufficient and counter-productive compared to running the game with the default scheduling. You’ll have to try it for yourself to see what option gives the best performance.

Due to its specific architecture, Ryzen needs special care in thread scheduling from the OS and games. If you think a game does not have the performance level it should have you can try forcing the scheduling on the first CCX and see if it improves performance. In my (admittedly limited) experience though, Shadow of Mordor is the only game where manual scheduling mattered. The Linux scheduler does a pretty good job usually. Article taken from GamingOnLinux.com.

Tags: AMD, Benchmark, Feral Interactive, Hardware | Apps: Middle-earth: Shadow of Mordor

27 Likes

Some you may have missed, popular articles from the last month:

RTS game PERIMETER: Legate Edition gets Linux ARM64 binaries and Steam Workshop support

Raspberry Pi prices rise again, along with a new 3GB Raspberry Pi 4 announced

Combine spells to solve puzzles in the wonderful Rhell: Warped Worlds & Troubled Times - out now

Google will finally release Chrome for ARM64 Linux devices

The comments on this article are closed.

All posts need to follow our rules. Please hit the Report Flag icon on any post that breaks the rules or contains illegal / harmful content. Readers can also email us for any issues or concerns.

29 comments

Page: «2/2

ghem 28 May 2017

Quoting: GuestWhen I was doing Linux development at VP, one of the things we determined was that Linux REALLY needs to implement PTHREAD_SCOPE_PROCESS to allow games to better prioritize how their threads get scheduled, for just reasons like this. They quite often do on Windows, and this carried along correctly on OS X, but Linux only has PTHREAD_SCOPE_SYSTEM which effectively makes it impossible for a process to prioritize its own threads well.

This is really interesting, first time I hear about this.
Could you give more information on the difference between PTHREAD_SCOPE_PROCESS and PTHREAD_SCOPE_SYSTEM? I've read the manual but still don't clearly understand what difference it makes in practice.

On this specific issue, I thought the solution would be more to separate each CCX as different NUMA nodes so the scheduler could take into account the additional cost of having interdependent threads on different CCXes?

0 Likes

ghem 28 May 2017

Quoting: SamsaiThis is a fantastic piece of research right here. Mad respect for the guest writer!

Thanks a lot, it's much appreciated ^_^

Quoting: soulsourceThe right place to "fix" this issue is the task scheduler of the OS, and I would be strongly surprised if we wouldn't see patches from AMD that address such issues soon.

Quoting: shirethroneVery interesting read. I like to have background information.
Do developers have to act or the kernel to fix this issue?

In this case, I don't know for sure.

IIRC the linux kernel was patched by AMD in version 4.9 so it should be aware of Ryzen's topology and schedule threads accordingly. And indeed, the scheduling is working fine generally.

At some point I even tried to do all sorts of manual scheduling with Dirt3 on Wine and each single time the result was slightly to much worse than letting the scheduler do its job.
That's why I was surprised to see this performance hit with Shadow of Mordor. At this point I pretty much expected manual scheduling to be useless.

So it may be a problem with the game forcing scheduling in a manner that doesn't work well with Ryzen but we would need a confirmation from Feral to be sure.

Quoting: pete910To stop the stuttering maybe better fixing it to the real cores not the firs 8 or so
[...]
Better for me using
taskset -c 0,3,5,7,9,11,13,15 %command%

I just did a quick test where I scheduled the game only on even logical cores (so at most one thread per real core) and I obtained a similar result as with the default scheduling. So it seems that it is indeed an issue with the CCXes and not SMT. This would make sense as I have no such issue on the i7 2700k.

1 Likes

Egonaut 28 May 2017

View PC info

That doesn't surprise, since you only have 2400 Mhz, it's at the lower end of possible memory clocks of DDR4. Btw, you should be able to run your dual ranked memory with 2666 Mhz. My dual ranked memory can run at that speed, but you might need to adjust your memory timings. At least set even values for CL, that seems to make the biggest problem with current BIOS's, uneven CL timings don't work.

My memory has a XMP profile of 3000 MHz at CL15, to run it at 2666 I need to set a CL of 16 but that makes it run perfectly fine. If you have 4 ram modules you might need to use only 2 to get higher speeds. With current BIOS's it's hard to obtain higher frequencies with 4 modules of dual ranked dimms.

Also try to set a higher SOC voltage, that can give you much better OC capabilities. Up to 1.10 Volts are fine. AMD engineers recommends that, here's a interesting video where AMD engineers explain how to do OC on Ryzen boards: https://youtu.be/vZgpHTaQ10k

Last edited by Egonaut on 28 May 2017 at 1:30 pm UTC

1 Likes

ghem 28 May 2017

Quoting: GuestI'll try to explain PTHREAD_SCOPE better...

Thanks for the very comprehensive answer! <3

Quoting: berillionsSo ...
I would like to buy a Ryzen 5 1400 but with this problem, the good deal is :
- To buy this Processor even with this issue for this game and maybe in future games. Even if a fix exist
- To buy an Intel Processor ...

There are a lot of problem with Ryzen processor on Linux; i don't know what i must to do ... Intel or AMD processor ...

I did a quick run of benchmarks to simulate the performance of a 4C and 6C Ryzen cpus by restricting the cores the game has access to.
Keep in mind it's only a simulation so a real 4C/6C Ryzen cpu could behave differently.

First Shadow of Mordor:

External Media: You need to be logged in to view this.

4C/6C performance is lower than 8C (to be honest I was expecting 6C performance to be similar to 8C so it's a surprise here) while the worst by far is with default scheduling.
All in all the performance hit with a 4C is significant but not something that would make the game unplayable.

Now Hitman:

External Media: You need to be logged in to view this.

This is quite different from Shadow of Mordor:
- the default scheduling is fine and the CCXes don't seem to create any problem
- there is very little difference between 4C, 6C and 8C. And the benchmark is mostly cpu bound.

I personally think what you see with Hitman will be closer to the average experience you'll get with a 4C Ryzen cpu. But I would need to do a lot more benchmarks to confirm this.
Don't forget also that more and more titles will use Vulkan which will reduce the cpu overhead so things should get better too.

On the flip side, please note the Ryzen 1400 has half the level 3 cache of all other Ryzen cpus (8MB vs 16MB) so this could adversely affect performance.

If you don't want to have to bother with all these things, you might indeed be better off with an intel cpu, though intel has no competition at the R5 1400 price (Core i3 are useless now that there is the pentium G4560). But with a Ryzen cpu, you have a very good upgrade path down the road, just by changing the cpu alone.

Honestly, I think there are four possibilities:
- you're on a tight budget: get the pentium G4560 and a good GPU. Try to grab a second hand 7700 / 7700k in a couple of years.
- you have a medium budget: get the Ryzen R5 1600 and overclock it. You will be able to upgrade to more cores / better single thread performance down the road (Ryzen 2)
- you have a high budget and want the best performance now: get a 7700k and overclock it
- you have a high budget and want some future proofing: get a Ryzen 7 1700 and overclock it

If you opt for a Ryzen cpu, make sure to get good DDR4 (at least 2666Mhz, single rank if you can get that information).

0 Likes

Egonaut 28 May 2017

View PC info

Quoting: ghemIf you opt for a Ryzen cpu, make sure to get good DDR4 (at least 2666Mhz, single rank if you can get that information).

I disagree here, you should look for dual ranked memory. Dual ranked memory has a higher performance at the same frequency. Yes currently there are still some compatibility problems with dual ranked memory but this will be fixed soon.

Also I would target higher frequencies if the budget allows it, but 2666 Mhz should be the lowest target.

0 Likes

niarbeht 28 May 2017

I'm left to wonder if there isn't some CGROUP magic that could accomplish something similar, but I dunno. I was planning to leave researching how CGROUP stuff works for after when I had a Ryzen or Ryzen 2 system capable of doing passthrough correctly.

0 Likes

F.Ultra 29 May 2017

Supporter

Quoting: GuestI'll try to explain PTHREAD_SCOPE better...

As you know, processes on Linux all have a "priority" value, which is how the kernel scheduler picks the next process to be run. Every process has at least one thread, and games often use multiple threads to do things. Each thread in a process also has a priority value, and this influences which thread in a process the scheduler picks to run.

On Windows, OS X and BSD, thread priorities are *relative* to the processes priority. That means when the kernel picks the process to be run, the scheduler can also see which of that processes threads most urgently need attention. The process itself is completely free to set it's threads priorities as appropriate.

However on Linux, there is a bit more of a potted history with threading. Originally the kernel didn't have threads at all, only processes. Threads were later added by creating processes which shared another process's address space. So it's a bit of a bodge really. This is why if you use the right parameters to "ps", you can see threads, and why threads have a pid_t value.

So on Linux, a threads priority isn't relative to it's owning process - its relative to all the other process priorities too. The problem with this is that changing process priority is, quite rightly, a privileged operation... otherwise it'd be easy to starve out more essential system processes by a rogue process setting a high priority. This also means that a process cannot effectively set its own threads priorities. This means a game is unable to hint to the kernel effectively which of its threads should get priority over others. This is often the cause for stuttering in sound for example, because you cant hint that your sound engine thread needs priority over that background IO thread.

On the other hand a process scope is not so much fun if you run other applications at the same time since all the process scope threads in your application will now be seen as a single entity when it competes for CPU time against all the other threads from the other applications. These days the inter thread intra process priority is handled with cgroups but that is not something that I've tried.

Quoting: shirethroneVery interesting read. I like to have background information.
Do developers have to act or the kernel to fix this issue?

This should be a developers fix mainly because what is happening here is that you have a application where several different threads will perform write+read to shared memory, at the same time you have lot's of applications where none of the threads share memory for anyting (like say a web server) so there is no way the kernel might know which of your threads are sharing memory or not so that is not much it can do. So this behaviour should be apparent on any OS as well.

Last edited by F.Ultra on 29 May 2017 at 1:38 pm UTC

0 Likes

F.Ultra 30 May 2017

Supporter

Quoting: GuestFrom a programmers perspective if you can manage your own thread priorities, the user can raise the overall process priority appropriately... or some other process manager could.. for example Windows auto boosts the front process's priority.

Moreover, cgroups is Linux specific, and PTHREAD_SCOPE_PROCESS is not. You have more chance of getting it well supported if it works like every other OS does.

Of course, but as I wrote it's not fun if/when there are other processes running on the same machine competing for CPU time with your threads since all the process scope threads in your application is seen as a single item for the global scheduler. Now I have no experience with programming games since I mainly do server stuff, but the problems that you describe sounds more being due to starting more threads than there are cores on the system which makes them compete with each other in the first place.

The Windows auto boost of the interactive thread is perhaps nice for a game but creates havoc for system daemon writers like me (and thankfully it can be user disabled [but only on the system as a whole]) with latencies going completely haywire every time a end user would move a window around and so on.

Quoting: GuestYou are thinking of forked processes, which are copy-on-write. Web servers typically use forked processes for resilience - if a worker sub-process dies, it can be re-spawned and doesn't crash the whole web server. It also provides for process isolation which increases security.

Threads implicitly share their processes memory. Again the confusion here is because threads were originally poorly implemented on Linux as a "hack" on fork().

Yes this particular scenario of a web server such as Apache works this way by forking but it's not mandatory nor is it always desirable. You can avoid heavy overhead IPC:s (and do note that such IPC have a much heavier overhead than the Infinity channel have on a Ryzen type of cpu architecture) and so on by using threads instead of forked processes even when most of the work load is not shared so still, there is no way for the kernel/system to know what to do with threads and a cpu architecture such as Ryzen more than give the developers tools for managing this on their own because it's only them that know if thread A will often share data with thread B but not with thread C and thus pin them so that threads A and B are always running on the same CCX while thread C can roam free wherever there are idle process time.

The problem here is not that the threads have access to a shared memory pool, the problem is that some of these threads actively do work on the same memory locations which if they run on different CCX:es will cause massive amounts of copying over the Infinity channel.

IF this was "solved" by the kernel then the consequences would be that all threads created by a application would be pinned to schedule on the same CCX leaving quite a few cores running completely idle and all this just to handle a particular work load where the threads do heavy work with the same memory.

0 Likes

pete910 30 May 2017

View PC info

Wonder if we'll be able to get a patch like Tombraider has got for ryzen?

40% boost they've managed apparently

Last edited by pete910 on 30 May 2017 at 5:40 pm UTC

0 Likes

«2 /2

While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations:

PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!