More Ryzen / Threadripper problems or hardware?
devnull Oct 30, 2017
Second try, hope this doesn't repost, if so I'll remove one.

TL;dr - Anyone with Threadripper or Ryzen hardware still seeing stability problems? Specifically with PCIE errors.


Long version - Between motherboards, power supplies, SSD's and Video cars I've now spent several thousand upgrading to a 1950x, yet still getting hardware problems.

Example errors:

#
pcieport 0000:00:01.1: AER: Corrected error received: id=0000
pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Transmitter ID)
pcieport 0000:00:01.1: device [1022:1453] error status/mask=00001000/00006000
pcieport 0000:00:01.1: [12] Replay Timer Timeout
pcieport 0000:00:01.1: AER: Corrected error received: id=0000
pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0009(Receiver ID)
pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
pcieport 0000:00:01.1: [ 7] Bad DLLP
#
# dmesg |grep pciep |grep -- '\['|sort | uniq -c
316 pcieport 0000:00:01.1: [12] Replay Timer Timeout
1689 pcieport 0000:00:01.1: [ 6] Bad TLP
17 pcieport 0000:00:01.1: [ 7] Bad DLLP
1652 pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000040/00006000
17 pcieport 0000:00:01.1: device [1022:1453] error status/mask=00000080/00006000
279 pcieport 0000:00:01.1: device [1022:1453] error status/mask=00001000/00006000
37 pcieport 0000:00:01.1: device [1022:1453] error status/mask=00001040/00006000
46 pcieport 0000:01:00.2: [12] Replay Timer Timeout
46 pcieport 0000:01:00.2: device [1022:43b1] error status/mask=00001000/00002000
# # lspci -vv | egrep '(1453|43b1)'
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1453 (prog-if 00 [Normal decode])
00:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1453 (prog-if 00 [Normal decode])
01:00.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 43b1 (rev 02) (prog-if 00 [Normal decode])
pcilib: sysfs_read_vpd: read failed: Input/output error
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1453 (prog-if 00 [Normal decode])
40:01.3 PCI bridge: Advanced Micro Devices, Inc. [AMD] Device 1453 (prog-if 00 [Normal decode])


My concerns are twofold - Getting to the root cause and warranty should this turn out to be hardware related. Trying to find answers from vendors is an exercise in futility. I don't do Windows, at all. I don't even own a pirated copy. Disabling the AER is not an option either since the hardware is throwing errors for a reason. Neither is increasing

Something else odd is they appear to change in frequency based on where the hardware physically is. Maybe RFI / shielding problems?

Hardware:
Ryzen Threadripper 1950x (16 core)
Asus ROG Zenith
2x Samsung SM961 NVMe
2x Samsung Pro 960 SSDs
4x 3TB WD Reds
1x EVGA 1080TI/FTW3
32G DDR4
1x EVGA 1kW Supernova P2

4.13.9 Kernel.

Also tried with and still freaking own - the AORUS board, different RAM, Pro 950's, EVGA 850W PSU, EVGA 1080TI Kingpin (returned due to coil whine. Should have kept it as the FTW3 had whine too. Though there was nothing else wrong, replacing the powersupply fixed it. EVGA wouldn't comment on why).

Basically a beast.
Xpander Oct 31, 2017
no issues for me with my Asus X370 Prime and Ryzen 1700X

xpander@arch ~ $ dmesg |grep pciep |grep -- '\['|sort | uniq -c
      1 [    1.297189] pcieport 0000:00:01.3: AER enabled with IRQ 28
      1 [    1.297204] pcieport 0000:00:03.1: AER enabled with IRQ 29


and what stability issue?

i havent had any issues since april of this year, after few bios versions that fixed all this.
longest uptime has been 15 days only, but i do reboots to update kernel, so i really haven't kept the system going for longer periods.
lucinos Oct 31, 2017
have you tried with kernel 4.9?
Xpander Oct 31, 2017
Quoting: lucinoshave you tried with kernel 4.9?

4.9 doesnt even support ryzen. 4.10 had initial support and 4.11 more stuff.

im on 4.13 as well
sterky Oct 31, 2017
Hey,

Ryzen 1800x
Asus Rog Crosshair VI Hero
PCIe NVME drive
Nvidia GTX 1070

$ dmesg |grep pciep |grep -- '\['|sort | uniq -c
      1 [    1.182122] pcieport 0000:00:01.1: AER enabled with IRQ 28
      1 [    1.182150] pcieport 0000:00:01.3: AER enabled with IRQ 29
      1 [    1.182164] pcieport 0000:00:03.1: AER enabled with IRQ 30


Take a look here, maybe its something with Nvidia 1080 cards?
GTX 1080 Throwing Bad TLP PCIe Bus Errors

Good luck
devnull Oct 31, 2017
Fun fact, I booted a 2.6.32 kernel on it (long story, needed reiserfs which isn't in elrepo for 4.x kernels).

Regarding the geforce URL, it's broken. Spent 20 minutes filling out capchas to no avail. NVIDIA uses Incapsula on their a lot of their sites which breaks them all... and although they use Google anyway, it's only THEM.

I'm guessing the link is like everywhere else, telling people to disable PCI memory mapping (the pci=nommconf), or dropping back to gen3 PCI. Neither are _really_ options. FWIW gen2 does make the errors go away which furthers my suspicion it's a hardware problem. Given the thousands people have spent I'm guessing hardware vendors will not be opening that can of worms.
pete910 Nov 1, 2017
Quoting: Xpanderno issues for me with my Asus X370 Prime and Ryzen 1700X

xpander@arch ~ $ dmesg |grep pciep |grep -- '\['|sort | uniq -c
      1 [    1.297189] pcieport 0000:00:01.3: AER enabled with IRQ 28
      1 [    1.297204] pcieport 0000:00:03.1: AER enabled with IRQ 29


and what stability issue?

i havent had any issues since april of this year, after few bios versions that fixed all this.
longest uptime has been 15 days only, but i do reboots to update kernel, so i really haven't kept the system going for longer periods.

As above, with a gigabyte aorus K7 and 1700x . No issues from release bar mem speed. which was fixed from the f4 bios end of may

dmesg |grep pciep |grep -- '\['|sort | uniq -c
      1 [    1.216868] pcieport 0000:00:01.1: AER enabled with IRQ 28
      1 [    1.216889] pcieport 0000:00:01.3: AER enabled with IRQ 29
      1 [    1.216904] pcieport 0000:00:03.1: AER enabled with IRQ 30
      1 [    1.216910] pcieport 0000:00:01.1: Signaling PME with IRQ 28
      1 [    1.216917] pcieport 0000:00:01.3: Signaling PME with IRQ 29
      1 [    1.216926] pcieport 0000:00:03.1: Signaling PME with IRQ 30
      1 [    1.216940] pcieport 0000:00:07.1: Signaling PME with IRQ 31
      1 [    1.216956] pcieport 0000:00:08.1: Signaling PME with IRQ 33
wolfyrion Nov 1, 2017
I have a threadripper 1950x as well and I already solved all the problems that I had and also tweaking the kernel.

For you case I think if you add these kernel parameters in Grub you should be fine :)

pcie_aspm=off

I am using grub Customizer just for easy editing
lucinos Nov 1, 2017
Quoting: Xpander
Quoting: lucinoshave you tried with kernel 4.9?

4.9 doesnt even support ryzen. 4.10 had initial support and 4.11 more stuff.

im on 4.13 as well

I dunno about Ryzen, but usually all x86 CPUs are backwards compatible. You can even run 32bit OSes. I would suppose it is the new Ryzen features that are not supported. So running older kernel should not be optimum but should run and maybe the bugs are not triggered also. So 4.9 may work and also may work without problem. Has happened to me a many many times. On the other hand some new cpus seems that really demand new kernels so maybe you are right
While you're here, please consider supporting GamingOnLinux on:

Reward Tiers: Patreon. Plain Donations: PayPal.

This ensures all of our main content remains totally free for everyone! Patreon supporters can also remove all adverts and sponsors! Supporting us helps bring good, fresh content. Without your continued support, we simply could not continue!

You can find even more ways to support us on this dedicated page any time. If you already are, thank you!
Login / Register


Or login with...
Sign in with Steam Sign in with Google
Social logins require cookies to stay logged in.