I helped fix sleep-wake hangs on Linux with AMD GPUs

nyanpasu64.gitlab.io

752 points by fanf2 3 days ago

jorvi 3 days ago

> Through some digging, I found that when a desktop enters S3 sleep, the system cuts power to PCIe GPUs

I am not sure how correct this assumption is. S3 is supposed to cut power to everything but RAM, but for example Gigabyte Aorus motherboards are notorious for an NVMe SSD sleep bug that randomly prevents the system from properly sleeping or waking.

This is fixed by adding the following udev rule:

  # Generic PCIe fix for sleep bugs by preventing wakeup from any PCIe port
  ACTION=="offline", SUBSYSTEM=="pci", DRIVER=="pcieport",     ATTR{power/wakeup}="disabled"

or more targeted:

  # Gigabyte sleep fix by preventing wakeup from problematic PCIe port, depends on motherboard model
  ACTION=="offline",  SUBSYSTEM=="pci", ATTR{vendor}=="0x8086", ATTR{device}=="0x43bc", ATTR{power/wakeup}="disabled"

You can find any glitched PCIe wakeup device with:

  1. cat /proc/acpi/wakeup (you'll have to trial and error your way through the wakeup devices if it isn't immediately clear)
  2. cat /sys/class/pci_bus/*/*/yourWakeupDevicePci/uevent | grep PCI_ID
  3. prepend "0x"

You also have the option of:

  udevadm info --attribute-walk /dev/whatever

but for that you need to know some basic identifier of your glitchy device.

Or if you want to shellscript it (less reliable than letting udev do it for you and needs to be done via systemd service file or another automation):

  # Gigabyte sleep fix, port depends on mobo model
  /bin/bash -c 'if grep 'RP05' /proc/acpi/wakeup | grep -q 'enabled'; then echo 'RP05' > /proc/acpi/wakeup; fi'";

Yes I really hate this (and other) Linux sleep issues.

nyanpasu64 3 days ago
Hmm, on my motherboard I had to disable spontaneous wake by adding to /etc/udev/rules.d/:
```
  ACTION=="add", KERNELS=="0000:00:01.1", ATTR{power/wakeup}="disabled"
```
And my Logitech Bolt receiver wakes multiple of my Linux computers instantly, I don't know why it doesn't do that on Windows and haven't tried doing a USB capture (and don't know what equipment I'd need to try it out, logic analyzer? Glasgow?). In the meantime I've added a rule to block that:
```
  ACTION=="add", SUBSYSTEM=="usb", DRIVERS=="usb", ATTRS{idVendor}=="046d", ATTRS{idProduct}=="c548", ATTR{power/wakeup}="disabled"
```
- jorvi 2 days ago
  
  Good tip!
  KERNELS=="0000:00:01.1" sounds like an interesting way to do it, since you can target separate functions of the PCI device (in this case: domain 0, bus 0, slot 1, function 1).
VMG 3 days ago

As somebody with an Aorus motherboard who has probably burned a few kWh on this issue, I was really excited to try these solutions - no luck. Thank you anyway!
- jorvi 3 days ago
  
  Did you try the general fix? And reload udev rules?
  You also have to make sure it applies after the default rules.
  You can check if the rule applies once you have everything set up by doing an `udevadm` attribute walk of your SSD device (not partition), and then following it up all the way up the device tree until you see your specific device port (target fix) or PCIe driver subsystem (general fix). Then check if "power/wakeup" is set to "disabled". If it is set to disabled, something else is keeping your device awake on sleep.
  For that you can check /proc/acpi/wakeup, and there's also a specific systemd invocation (that I forgot) you can do that shows if your device slept, how long it slept, how much battery was drained, and if your device woke-up, slept or failed to resume, it'll give you a reason.. to the best of its ability.
  - VMG 2 days ago
    
    Maybe I should have clarified more - I'm using a desktop PC, not a laptop
    The system seems to go into sleep state just fine, it just freezes when waking up again
    
    jorvi 2 days ago
    
    Yes, my problem was with an Aorus ATX motherboard too.
    > The system seems to go into sleep state just fine, it just freezes when waking up again
    The behavior differs per device. Some sleep but freeze on wakeup, some immediately wakeup after sleeping, and some (like mine) will go to sleep 98% of the way, but the fans keep spinning. That was a dead giveaway for me that the bug was triggered. I suspect that on some systems, either the fans turn off and the system succeeded to getting to sleep 99% of the way, or the fans are so quiet that people do not notice them. And from that limbo state you can only recover with a hardware shutdown or interrupting power.
    I'm fairly certain this issue is fixable for you as well, either with only this fix (and we just have to find the proper port(s), or with this fix and for whatever other device might be causing this issue.
    
    nyanpasu64 a day ago
    
    Yiikes, on my B550M DS3H, previously when I woke my PC from sleep immediately after it slept (eg. by pressing the keyboard or case power button), it would "wake" after 0.5 seconds asleep and turn on the power light but not respond to user input, and not even shut down if I held the power button for 4 seconds! I had to pull power at the wall. This behavior occurred on both Windows and Linux, and was fixed at some point in a BIOS update.
mafuyu 2 days ago

Wow, thanks for this tip! I've been dealing with suspend issues with an X570 Aorus Master as well.
Running `echo GPP0 >> /proc/acpi/wakeup` into a systemd unit at boot solved the issue for me... except the first sleep after a boot would always wake back up immediately.
I applied your udev rule and that issue seems to be resolved as well!
- jorvi 2 days ago
  
  This is more so for your future unit file use: did you use `Type=oneshot` and `RemainAfterExit=yes`?
  I remember there being some strange interaction with the wakeup behaviour being toggled otherwise. But this could be due to me being on NixOS.
  - mafuyu a day ago
    
    I just did `ExecStart` with `multi-user.target`. That implies the unit is `simple`, so it very well could be sequencing incorrectly at boot and failing. That's a good point; I'll have to keep that in mind!
    
    jorvi 18 hours ago
    
    Apologies for the confusion, I don't mean it was failing to run.
    If you don't add "RemainAfterExit", the service will run at every boot, because after a reboot it is considered "inactive. This will execute your shell code, which effectively toggles wakeup.
    "RemainAfterExit" is meant for unit files that change the state of your system. After running once, the service will be considered "active", until you manually deactivate it, which will execute whatever you might have set in "ExecStop".
    "Type=Oneshot" is necessary for "RemainAfterExit".
    In this case I still would prefer doing it via udev though. I've made it my rule of thumb to evade shell scripting wherever feasible, because it usually ends up being more brittle, and more prone to footgunning :)
bArray 2 days ago

> I am not sure how correct this assumption is. S3 is supposed to cut power to everything but RAM, but for example Gigabyte Aorus motherboards are notorious for an NVMe SSD sleep bug that randomly prevents the system from properly sleeping or waking.
You would hope that you could probe the hardware to see if it really is in sleep or not, or that re-waking the hardware would not cause issue if it never went to sleep.
Also I would expect that you could send a sleep command to the PCIe device, then try to sleep the bus itself. The to wake you would bring back the bus and then wake the device.
krastanov 2 days ago

sigh, I have been struggling with this issue for a while, but this did not seem to work either. I have documented it here: https://bbs.archlinux.org/viewtopic.php?id=302440
Any further insight you might have on these Aorus wakeup issues? In particular, it seems the wakeup in my case is coming from `.../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/device:45/wakeup/wakeup6` which does not really mean anything to me.
- jorvi 2 days ago
  It doesn't to me either, but following a tip from someone higher up the comments you could try
  ACTION=="offline", KERNELS=="0000:00:45.6", ATTR{power/wakeup}="disabled"
  Maybe ACTION=="change" and / or KERNELS=="0000:00:45" or KERNELS=="0000:00:45.?"
- faraggi 2 days ago
  
  I had this issue and this MB for years I eventually solved it by physically removing a crappy USB C PCIE card I'd bought because my case didn't have any USBC ports.
  (Additionally I also previously disabled PCI wakeup buses and haven't touched it again since it's working)
  hope that info helps.
gU9x3u8XmQNG 2 days ago

Are these fixes, or workarounds?

lorenzbrun 2 days ago

Author of memreserver (one of the mentioned userspace workarounds) here. I've debugged this a few years back, only public comment I can quickly find is [1]. I also remember some mailing list discussions, but it basically came down to the isuse that Linux didn't have staggered suspend hooks that reliably ran before disks and parts of the memory subsystem were frozen. Apparently this is now possible. Sadly the Freedesktop Gitlab doesn't seem indexable so this knowledge seems to have gotten lost.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2125#note_17...

sabujp 3 days ago

This is amazing work! If folks have ever wondered why suspend is so difficult to get working on linux and why debugging it is equally difficult, this is a single datapoint with lots of information about all the things that can go wrong. Even now I have a thinkpad P1G4 where the fans won't turn off automatically unless I turn them off before going into suspend. Recently I also started having crackling issues with my bluetooth headphones after resuming from suspend and had to disable node suspension there also (https://wiki.archlinux.org/title/PipeWire#Noticeable_audio_d...).

Apofis 3 days ago

Remarkable that it's 2025 and laptop sleep/suspend still doesn't work right on linux. I think the first time I encountered this was probably 15 years ago now?
- xondono 3 days ago
  
  Sleep & suspend doesn’t work on Windows either.
  Power control is the kind of stuff that benefits from very tight integration, and PCs just don’t have that.
  Firmware is seen by most vendors as a pure cost to minimize, so you get a fragmented market full of subcontractors delivering the bare minimum that is considered “working”. Manufacturers also know most people aren’t going to use a big part of the functions they’re supposed to provide to OSes, and no one is really checking them, so it’s very common for devices to have only partial support for things they supposedly do.
  - kiwijamo 3 days ago
    
    Even Apple struggled to get it working perfectly in my experience across several models in the PPC/x86 era. Yes they are better(-ish) but when I had Apple laptops I'd still see weird sleep/wake issues in around 1 in every ~50 sleep/wake cycles. I also had one Apple laptop which had its battery going from 100% to 0% overnight during sleep requiring a cold start in the morning on a regular basis despite it being put to sleep the evening before and seemingly going to sleep without issues. Lenovo manages to do sleep/wake fine in Linux almost as well as Apple in my experience and I sleep/wake my Lenovo laptop regularly -- this is across two different models I have used so far (X1 and X390). Hopefully Apple has improved this in their ARM laptops but haven't used them much so can't really comment on ARM.
    
    nyarlathotep_ 2 days ago
    
    Even now on ARM it's not perfect. my M1 Mini will wake from sleep to a greenscreen and then crash/reboot ~once every few months, irrespective of uptime.
    My work mac (M1 Pro) occasionally locks up and reboots when waking from sleep too, at about the same frequency.
    Always wondered why this was such a difficult problem to solve, seemingly irrespective of operating system. (Linux has never been acceptable or reliable in this respect, IME, regardless of distro and hardware configuration.)
    
    turtlebits 2 days ago
    
    I've never had an issue with wake/sleep on any of my M1/2/4 devices. The only issue I can ever recall with sleep was the 2019 16" Intel (which had a host of issues).
    
    adolph 2 days ago
    
    > Lenovo manages to do sleep/wake fine in Linux almost as well as Apple in my experience and I sleep/wake my Lenovo laptop regularly -- this is across two different models I have used so far (X1 and X390).
    I'd rate my OpenBSD X1 as not terrible too. Not as smooth as MacBook but adequate.
    
    winrid 2 days ago
    
    Carbon or Extreme? I can't even get resume on windows to work properly on my 1st gen X1 Extremes
    
    adolph 17 hours ago
    
    Carbon, gen6, chosen after reading Joshua Stein [0]. As I understand it, OpenBSD and NVIDIA aren't a lovely pairing.
    0. https://jcs.org/2019/08/14/x1c7
    
    lyu07282 2 days ago
    
    I always had the suspicion that this has more to do with a lot of kernel hackers using Thinkpads, so they fix them up, rather than Lenovo doing a good job.
    
    sroussey 2 days ago
    
    It’s much better on ARM. And their external monitor support is so much faster and reliable now. Having control over all their hardware has made a noticeable improvement.
    
    tiagod 2 days ago
    
    This is not my experience. My M1 Pro MacBook has very strange issues with sound over HDMI. I usually need to reboot it when I connect it to my TV or media won't play if the sound is output over HDMI.
    
    Shadowmist 2 days ago
    
    Every time I turn on my M1 Max Mac I have to unplug and replug the HDMI but other than that the machine is a dream to use.
    
    smallpipe 2 days ago
    
    Same here on an M1. If you find a solution…
    
    bzzzt 2 days ago
    
    That's annoying but could also be caused by the monitor or TV HDMI implementation.
    
    tiagod a day ago
    
    I don't think so. In that state, video will jump and stutter wildly, run very fast, and audio will be very broken. It's not just the sound that's affected.
  - stephen_g 2 days ago
    
    I was amazed when I got an 11th-generation Intel NUC and it didn't sleep/wake properly.
    I would have expected it with a cheap clone, but this was a fully integrated computer made by the same people who designed and manufactured the processor itself!
    I can't remember if it ever fixed it - maybe after a year or two there were finally newer EFI and graphics drivers that fixed the issue, but maybe it never did? In the end I ended up getting a newer machine to replace it anyway and it became my home server so just runs Proxmox now and doesn't need to ever sleep...
    
    emmelaich 2 days ago
    
    I know the NUC I had had it's ethernet port put PERMANENTLY to sleep.
    There was, eventually, a Windows program which hacked the device to take it out of sleep. But I had no Windows OS on it. Before I got around to installing enough Windows, the NUC died for other reasons :-(
  - mindcrime 2 days ago
    
    > Sleep & suspend doesn’t work on Windows either.
    Or Macintosh. My $DAYJOB Macbook sleeps properly about about 2 out of 10 times, at best. Most of the time it fails to sleep and by the next morning when I open it up, the battery is dead. :-(
    For comparison, my System76 laptop running PopOS! sleeps perfectly, every time with no issues. shrug
    
    vlovich123 2 days ago
    
    > My $DAYJOB Macbook sleeps properly about about 2 out of 10 times, at best. Most of the time it fails to sleep and by the next morning when I open it up, the battery is dead. :-(
    Important to remember that work laptops typically install all sorts of crap spyware and fleet management software that causes the system to misbehave. That’s not as much on Apple although not protecting their brand against such software is on them.
    
    mindcrime 2 days ago
    
    Good point. And this machine does have the typical stack of enterprise spyware crapola. Curiously enough though, sleep / suspend do work occasionally. Just not consistently.
    
    stock_toaster 2 days ago
    
    > Or Macintosh. My $DAYJOB Powerbook...
    You use a 20+ year old[1] PowerPC laptop for your dayjob?
    Talk about hardware longevity!
    [1]: Powerbooks were last made in 2006 I think?
    
    mindcrime 2 days ago
    
    Typo/brainfart/whatever. I meant to say Macbook. It's about 2 years old.
  - bachmeier 3 days ago
    
    > Sleep & suspend doesn’t work on Windows either.
    What's strange is that it never used to be a problem. There are five Windows laptops floating around our house at various times (mixture of work and personal) and suspend works properly on none of them. Oddly, it works on my personal laptop with Debian Stable almost every time, failing maybe 1/25 times. Other distros are about the same as Windows.
    
    asmor 3 days ago
    
    Modern Standby. Windows wanted to do the Apple "power nap" stuff, but never realized how painful it'd be if you don't control all the hardware and have millions of different hardware permutations (with a lot of terrible drivers) instead of just a few. Not that it would've helped, half the time my machine is either overheating or off it seems to be wake timers doing windows updates (which yes, you can disable, but most wouldn't).
    I don't get why S3 sleep had to die for this, but it did.
    
    carlhjerpe 3 days ago
    
    I'm super happy with S0 on Linux. The implementation is all about doing as little as possible but effectively remain "on".
    
    vladvasiliu 2 days ago
    
    What's the point, though? If it doesn't actually do anything, why keep it on?
    
    carlhjerpe 11 hours ago
    
    Because you don't have to reinitialize hardware if you don't shut it off. Which is hard and causes problems. It's easier to just go into power saving++.
    If you want your device to be off, power it off.
    
    jorvi 2 days ago
    
    > Windows wanted to do the Apple "power nap" stuff
    On Linux, you can run a systemd unit file that will trigger `rfkill` on sleep and a different `rfkill` invocation on wake and you effectively dodge all that crap because the laptop isn't connected to WiFi and thus will sit around realizing its spinning its wheels and wil shut down further down the s0 chain.
    > I don't get why S3 sleep had to die for this, but it did.
    Worse yet, the dirty little secret is that many laptops that offer both S0 and S3 will actually drain more energy in S3 than in S0 because the S3 mode has had poor QA.
    
    OvbiousError 3 days ago
    
    My colleague showed me his windows machine recently. The rubber on the back around the fans has melted from the times he forgot to shut it down and sleep didn't trigger when he packed it away in his backpack.
    
    dhon_ 3 days ago
    
    Linus tech tips on YouTube did a video about a windows bug where sleeping while charging would allow the laptop to wake up to check for updates etc but often caused this issue of turning on in a bag
    
    whizzter 2 days ago
    
    It wouldn't happen that this feature was released around early/mid 2020? Windows sleep used to be semi-reliable but one it's been shit for a couple of years.
    (Any link to the video/docs for turning it off?)
    
    dharmab 2 days ago
    
    Search "LTT Windows Modern Standby" on YouTube. Sadly all the workarounds to turn it off no longer work reliably. For reliable sleep, buy a Framework (only current Windows laptop that still supports S3 Sleep) or Macbook.
    
    isodude 2 days ago
    
    TL;DR pull the the plug from the laptop _before_ closing the lid. That way it will not be sleeping thinking it got power from the wall.
  - jml78 2 days ago
    
    Are there issues in Windows? Sure but if you give me 100 laptops, 80% will do this right without any issue. Maybe 30% of those laptops will work right on any Linux distro without major fucking around with bullshit trying to make it work. Yes those numbers are made up but I have been running versions of Linux since Slackware in the 90s. I still have a desktop with an amd cpu and nvidia gpu that I can’t get to sleep/suspend right. Works fine when dual boated in windows. I just gave up and manually do shit now when using Linux
    
    dharmab 2 days ago
    
    Oh no, Windows Modern Standby is infamously terrible and unreliable. Here is a youtube video with millions of views explaining the problems in detail: https://youtu.be/OHKKcd3sx2c
    
    iamtedd 2 days ago
    
    Two years ago, that video was published.
    Good thing it's all fixed now, eh! /s
    
    weberer 2 days ago
    
    0% of new Windows laptops support proper S3 sleep mode since Microsoft gutted it in favor of "modern standby".
    
    dharmab 2 days ago
    
    Framework does! I think they might be the _only_ one to consistently support it.
    
    lenkite 2 days ago
    
    Once upon a time, Windows sleep was reliable 99.xy% of the time. If you put a Windows laptop to sleep it stayed asleep cozily.
    Now its like an elderly nursing home patient who wakes up several times and raids the kitchen to eat all the time.
    The fall of a once great OS is sad to see. I guess AI matters more to Microsoft than their Core OS nowadays.
- washadjeffmad 3 days ago
  
  This space is problematic enough that you could reliably segfault 2017-2019 Intel MacBooks by closing the lid before unplugging HID peripherals, preventing suspend (and cooking it in your bag on the commute home).
  It also plagues Windows on custom PC builds, even when there are vendor drivers. Not every component plays nicely with suspend states, ASPM, C-sates, load line calibration, etc. And while often the capability exists natively to address issues (in BIOS, Linux, etc), how many people know how to start looking?
  - janderson215 3 days ago
    
    Hmm I thought that was a feature, not a bug. I used to leave everything plugged and close the lid if I wanted big downloads to keep going or wanted an even quicker start up.
    
    c-hendricks 3 days ago
    
    The computer staying awake when there's a monitor / keyboard/ other HID connected is one thing
    The computer locking up when said devices are removed and not properly going to sleep then is a different (much worse) issue
  - burnte 3 days ago
    
    My custom-build desktop had an issue with the previous AM4 motherboard I had where if you told Windows to hibernate, the entire machine would shut down as normal, but then a few seconds later it'd wake back up by itself and unhibernate. I had to turn off the PSU power switch during those few moments to keep it in hibernate mode. New mobo and that's gone now. BIOS updates never helped. Really odd.
    
    iamtedd 2 days ago
    
    Probably no help to you now, but if someone has a Windows machine that keeps waking up by itself, you can see what triggered the wake by running
    powercfg /lastwake
    in a command window as admin.
    
    burnte 2 days ago
    
    True, and that helps if it won't go to sleep. This was some weird bug in either the board or the bios.
- rikthevik 2 days ago
  
  After my (closed) gaming laptop started making annoying Windows noises earlier today, I'm led to believe that it doesn't work properly on Windows either.
  It seems like it's basically hardware whack-a-mole at this point. The only reason Apple does it reasonably well is they control more of the stack and they support less hardware. The only reason Windows does it better than Linux is they have more eyes on it.
  - ch33zer 2 days ago
    
    The reason it works better on Mac and windows is because they're designed to be desktop OSs. On Linux the funding goes to things cloud services and Android use, which decidedly does not include suspension or other desktop features
    
    heavyset_go 2 days ago
    
    The article we're replying under has the author working with Mario Limonciello, an AMD employee and kernel developer.
    I've interacted with him countless times on the kernel mailing list and bug tracker, he's literally paid by AMD to work on Linux support for AMD's consumer desktop hardware.
- carlhjerpe 3 days ago
  
  If you have a modern machine with S0 sleep, which is "modern standby" it's very much solved. What it does is it pauses all userspace processes, disables all cores but one and keeps it running on the lowest frequency. The system stays "on" but all devices go in power-saving state which is good enough for days.
  So it's not really a problem unless you really wanna do deeper sleeps.
  - grayhatter 3 days ago
    
    > so it's not really a problem unless you really wanna do deeper sleeps.
    the way I parsed this was; so it's not really a problem unless you want to use your computer the way you want to use it.
    I get things are complicated, and hardware support is a mixed bag. But it doesn't have to be this way.
    
    carlhjerpe 2 days ago
    
    My honest experience is that S0 is a godsend, when you use a device on a weekly basis S0 is good enough and it just works, no messing and fiddling and tweaking, just running.
    Chasing "real sleep" gives me nothing but pain. Also Android devices "sleep" fully awake so it's really "what people are doing".
  - makeitdouble 2 days ago
    
    Another way to put it: nobody solved this problem, so the next move was to embrace never sleeping and market it as a feature.
    Microsoft also went that route with the Surface line, it just never sleeps.
- bongodongobob 3 days ago
  
  It doesn't work right on Windows either to be fair. With a mixed laptop fleet at work, we've just disabled sleep/hibernate company wide because it causes way too many problems.
  - larrik 3 days ago
    
    That seems like a good way to cook your laptop when you throw it into a bag with the CPU pegged.
    
    bongodongobob 3 days ago
    
    Not if you turn it off first.
    
    Joker_vD 2 days ago
    
    For some reason, people really insist on being able to put their machines to sleep, I honestly don't know why. Long boot times, maybe?
    
    bongodongobob 2 days ago
    
    I mean I totally get it. At the end of the day I have a billion tabs open and 14 instances of notepad. But after getting burned by sleep failing enough times, you gotta change your habits.
    
    Joker_vD 2 days ago
    
    Firefox preserves opened tabs on close/re-open, and so does Notepad++/Sublime (and for those, it works even for tabs you've never saved as files). And let's be honest, losing most of those browser is inconsequential.
    So while I get the "I just want sleep to work, dangit" attitude (it really should just work, to be honest) the fact is that it barely does work. Seriously, it took this long to realize that VRAM contents may not fit into RAM entirely, so powering down the drives that hold the swap should probably be postponed to the last moment.
    
    fc417fc802 2 days ago
    
    That won't restore my 3+ neovim instances, including loaded buffers, associated undo trees, tabs, and splits. Neither will it restore the various PDFs I have open, the file browser instances pointed at specific locations, nor the containing window and desktop layouts for all of that.
    It's pretty unbelievable when you think about it. The majority of mainstream progress on application state management has taken place in mobile operating systems and essentially amounts to the expectation that you won't lose data if and when your process is unexpectedly forced to terminate without user interaction. Forget actually picking up exactly where you left off.
    And as long as I'm complaining. All of my sshfs mounts tend to break if I sleep-wake as well. Remounting them generally doesn't fix programs that were using them (for obvious reasons) - I usually have to manually close and reopen all of that.
    
    robertlagrant 2 days ago
    
    MacOS is pretty good at it as well.
    
    bongodongobob 2 days ago
    
    I know, I'm agreeing. Leaving everything "running" is just the ideal path of least resistance. So we have sleep mode.
    
    trinsic2 2 days ago
    
    LOL yeah... this always makes me laugh. I always get the "well it should work". My response, "Well, should's don't mean anything when it comes to computer technology". People want to conform to what the technology may be able to do, instead of working around it when it can't do what it designed to do, its a huge time sink for most people.
    Just change your fucking habits and move on. I think Linus Tech tips complained about a unrelated but equally frustrating issue with S0 awhile back. I fucking hate S0 it makes me feel like tech companies are trying to force people to have always on computers by removing S3 support entirely from modern hardware.
- dismalaf 2 days ago
  
  Recent Windows laptops have even more issues. My wife literally never suspends her Windows laptop for this reason. Meanwhile my Intel/Nvidia laptop running Debian works flawlessly (albeit with Nouveau drivers, gave up on the proprietary ones for reasons unrelated to suspend).
- caycep 3 days ago
  
  it's arguably not great on windows either... (see Gigabyte Aorus comment above)
- trelane 2 days ago
  
  This is highly hardware dependent. Modern hardware is complex enough that it pretty much has to explicitly support Linux or it won't work well.
  I have been buying System76 for about twenty years now, and this is has not been an issue for me either.
- xtracto 2 days ago
  
  Even hibernate has problems. The power features are some of the things I wish worked better in Linux in general :(
- ycui1986 2 days ago
  
  they don’t work on Windows either. Multiple of my laptops crash when wake up 30% of the times.
jamesdutc 2 days ago

It can be really hit-or-miss, and it can be really hard to debug errors like in the post.
A lot of workarounds that are suggested for various issues are also not really viable. Some of the workarounds involve turning off different power-saving modes; however, the point of enabling sleep is often to increase the amount of usable time between charges, and turning off these power-saving modes can often dramatically shorten battery life.
But getting sleep to work (even S0ix!) is not impossible.
I have a bunch of handheld AMD 7840U and AMD 8840U devices that I have installed Arch Linux on: GPD Win Max 2, GPD Win Mini, GPD Win 4, Minisforum V3, OneXPlayer X1 Ryzen. These devices were not designed with Linux support in mind. I would be very surprised if the companies that made them ever tested them with Linux. Yet with just a small amount of work (generally fiddling with `/proc/acpi/wakeup` and `/sys/devices/*/*/*/power/wakeup` to disable sources of spurious wakeups,) I have gotten essentially flawless S0ix support (… on all but the newest OneXPlayer X1 Ryzen.)
(In general, out-of-the-box stock Linux kernel support on these devices is fantastic. Touchscreens work, pen input works, wifi and Bluetooth work well. The only gap I've seen is fingerprint reader support.)
I suspect that given how small these manufacturers are (and how small their production batches must be,) there's much less extreme-customization and tight-integration of components. This is visibly evident in the form-factors of these devices, which many millimeters thicker than they might otherwise be. (Of course, these devices are primarily advertised to a gaming audience who are eager to avoid the thermal-throttling that happens with ultra-thin devices like Surface Pro…) I partially suspect that the lack of extreme-customization, the lack of tight-integration, and the smaller production batches means that the manufacturers make much more conservative choices in components. Maybe this explains the exceptional Linux support?
- hasperdi 2 days ago
  
  Hi, do you have these tweaks published somewhere? I'm particularly interested in knowing your GPD Win Mini tweaks.
  Thanks
fulafel 2 days ago

To add: for the end user, the way to easily get working suspend is to buy known-good compatible hardware.
It's been solid on every business Thinkpad for a long long time for me and consistently seems people on Windows with the same models have more sleep problems.

Gormo 3 days ago

My sincere personal thanks for this. My main laptop is a Ryzen-based ThinkPad running Linux that I suspend and hibernate regularly, and I sporadically encounter this issue. Looking forward to 6.14!

imp0cat 3 days ago

This. Thanks a lot!

mkesper 2 days ago

Why was dm->cached_state storing -12 instead of a pointer? Most likely this happened because earlier during suspend, dm_suspend() assigned dm.cached_state = drm_atomic_helper_suspend(adev_to_drm(adev)). The callee drm_atomic_helper_suspend() could return either a valid pointer, or ERR_PTR(err) which encoded errors as negative pointers. But the caller function assigned the return value directly to a pointer which gets dereferenced upon resume, instead of testing the return value for an error.

One more point for rust in the kernel. Just can't happen if you're required to handle a Result type.

vlovich123 2 days ago

You can also get algebraic sum types in C with the C preprocessor: https://github.com/Hirrolot/datatype99
But of course defaults matter and the kernel’s rich history of not modernizing coding practices is going to work against improvements in C land. Ironically, it’s that same resistance that frustrates the Rust devs so much because their resistant to even cleaning up their own subsystems or putting down markers documenting how the subsystems are supposed to work.
Maybe https://github.com/llvm/llvm-project/issues/74205 would help once it trickles down into the kernel, but I suspect that people are still going to choose to do this manual overloading of the pointer instead of using types for safety.

jph 3 days ago

Your work will help me on a Framework AMD laptop with the GPU extension and dual boot Linux/Windows. May I donate to you or to your favorite charity? My contact info is in my profile.

lelandfe 3 days ago

Love this!

dekhn 3 days ago

I used to think that naming things, cache invalidation, and off-by-one errors were the 2 biggest problems in CS, but then I learned about the "sleep/wake" problem and realized it's NP-complete.

verall 2 days ago

I think sleep/wake is a subset of cache invalidation - if all peripherals were stateless, probably it wouldn't be an issue.
nikanj 3 days ago

Only on Linux though, on Windows it’s O(n2) and on Macos it’s O(log n)
- ncann 3 days ago
  
  With how much trouble I had with trying (and failing) to make my brand new Dell laptop sleep properly and not the "Modern Standby" crap, plus my desktop randomly breaking GPU hardware acceleration in browser after waking up, I would say it's around O(n4) now. Or maybe even O(n!).
- itsn0tm3 3 days ago
  
  Well only as long as you don‘t hackintosh. That stuff used to be a horror sometimes!
  - 0x38B 2 days ago
    
    Remembering the hours I spent going through KEXTs and bootloader config in ill-fated attempts to set up a Hackintosh fills me now with a kind of horror. Worst of all were the ACPI tables - SSDT and the like.
    In contrast to that, running MacOS in a VM is heaven. Figuring out how to pass my iPhone through to Xcode took as long as the initial setup.

jchw 3 days ago

Memory management and specifically OOM conditions remain an unbelievably painful nightmare on Linux. It's not like I run into these issues constantly, but I've definitely tried to debug issues like these (unsuccessfully). Ultimately if I OOM a machine I usually wind up installing more RAM, which is wasteful/expensive, but it's pretty clear that handling OOM conditions gracefully is going to be a hard problem for Linux to solve into the future.

This is really great work and will serve as a reference point for debugging similar issues in the future. Pretty happy about systemd's debug-shell feature, I had no idea that existed. I don't think my X670E Steel Legend board has a serial header anywhere on it, though. How do modern built-in serial ports work, anyway? Are they attached off of the chipset PCIe lanes?

Something that's also very useful when trying to dive into the Linux kernel is that there's a bunch of great talks discussing Linux kernel subsystems from conferences like FOSDEM and Linux Plumber's Conference which you can usually find recordings of online. For example, there's this one for TTM, the memory subsystem that most of the desktop GPU DRM drivers use:

https://www.youtube.com/watch?v=MG7_tUNKSt0

nyanpasu64 3 days ago

Windows says that my motherboard serial port is connected to the Pci Bus → PCI standard ISA bridge. Long live DOS!
Thanks for the video about TTM, I'll watch it when I have a chance.
Skunkleton 3 days ago

I’ve had good luck containing ooms with cgroups. I’m not sure if there is a state of the art for handling oom conditions beyond what Linux does. If anyone knows and can recommend some reading I would appreciate it.
- jchw 3 days ago
  
  There's really two problems as I understand it:
  - Overcommit. Linux will "overcommit" memory: allocations will succeed when there's no memory, and then hang when the page is actually mapped if no physical pages are available (to my understanding.) Windows NT doesn't do this. Not sure exactly how macOS/XNU handles it.
  - The OOM killer. Because allocations don't fail, to actually recover from an OOM situation the kernel will enumerate processes and try to kill ones that are using a lot of memory, by scoring them using heuristics. The big problem? If there isn't a single process hogging the memory, this approach is likely to work very poorly. As an example, consider a highly parallel task like make -j32. An individual C++ compiler invocation is unlikely to use more than a gigabyte or two of memory, so it's more likely that things like Electron apps will get caught first. The thrashing of memory combined with the high CPU consumption of compilers that are not getting killed will grind the machine to a near-complete halt. If you are lucky, then it will finally pick a compiler to kill, and set off a chain reaction that ends your make invocation.
  There are solutions... Indeed, you can use quotas with cgroups. There's tools like systemd-oomd that try to provide better userspace OOM killing using cgroups. You can disable overcommit, but some software will not function very well like this as they like to allocate a ton of pages ahead of time and potentially use them later. Overcommit fundamentally improves the ability to efficiently utilize all available memory. Ultimately I think overcommit is probably a bad idea... but it is hard to come up with a zero-compromises solution that keeps optimal memory/CPU utilization but avoids pathological OOM conditions by design.
  - fc417fc802 2 days ago
    
    > two problems ... overcommit
    Is there any other sensible way to do this though? It would be quite inefficient to constantly call mmap for additional small(ish) pieces of memory. In effect overcommit just means that until the page is actually written to it hasn't really been allocated. (Aside: I believe a malloc implementation that zero'd out blocks on allocation would fail abruptly rather than later in case that happens to be what bugs you about it.)
    Additionally how do you suppose fork should be implemented efficiently? Currently it performs copy-on-write. At minimum you'd need a way to mark pages as "never going to write to these, don't reserve space for a copy". Except such an API is either very awkward to use in practice or else leaves you with some very awkward edge cases to deal with in your program logic.
    > You can disable overcommit, but some software will not function very well
    Yeah about that.
    Chromium runs (AFAIK) 1 PID namespace per tab. On my machine right now it reports 1.1 TiB virtual memory with a little over 100 MiB resident per tab. 1.1 TiB mapped PER TAB. Of the resident I have no idea how much is actually unique (ie written to following the initial fork).
    Firefox is much more reasonable at a mere 18 GiB mapped per PID.
    
    nolist_policy 2 days ago
    
    > Chromium runs (AFAIK) 1 PID namespace per tab. On my machine right now it reports 1.1 TiB virtual memory with a little over 100 MiB resident per tab. 1.1 TiB mapped PER TAB. Of the resident I have no idea how much is actually unique (ie written to following the initial fork).
    This is most likely a trick for garbage collection or memory bug hardening or both. Haskell programs also map 1tb.
    
    jchw 2 days ago
    
    A potential workaround would be to still allow giant mmaps but not hang a program when it runs out of pages and instead send a signal to it. Obviously, neither Chrome nor Firefox actually use this much memory in practice.
    
    fc417fc802 a day ago
    
    Rather than a workaround I think that would just be an overall better approach. Receive an actionable error when the allocation happens "for real", whether that's at an arbitrary point in user code or when malloc zeros out the block ahead of time.
    However I think you'd need per-thread signal handlers for that to work sensibly. Which the kernel supports (see man 2 clone) but would require updates to (at least) posix and glibc.
    It would probably also be nice to have a way to allocate pages without writing to them. Currently we have mlock but that prevents swapping which isn't desirable in this context.
Avamander 2 days ago

> Memory management and specifically OOM conditions remain an unbelievably painful nightmare on Linux.
Yes. It's horrendous to put it mildly. Linux does not handle OOM conditions properly.
I know I can set up a few guardrails with cgroups. I know I can also install earlyoom. I know I can increase swap or use zram. In the end these are all fundamentally just nasty hacks that might spare one once in a while. They do not fix how these conditions are handled. Please do not offer these as solutions.
I've seen LUKS volumes mount themselves read-only because the kernel couldn't allocate memory in dm_crypt, for the love of god just kill something in userspace. The current state is utterly unacceptable and I'm tired of all the excuses.
jorvi 2 days ago

Have you been running zswap / zram?
With zstd you can turn 8GB of RAM into 20GB of 'RAM' without much issue. or 16GB into 40GB. Hell, if you're feeling adventurous (and Android does this, so its very stable) you can overcommit your memory past 100%.

dralley 3 days ago

Fantastic news. AMD's linux graphics drivers have mostly worked great for me but this has been the one exception that I've hit multiple times.

MegaDeKay 3 days ago

My luck has been a little less good. Latest problem I'm having is the driver spamming my logs after waking from sleep with "[drm] scheduler comp_1.0.n is not ready, skipping" after "WARNING: CPU: 12 PID: 11871 at drivers/gpu/drm/amd/amdgpu/../display/dc/dc_helper.c:100 generic_reg_update_ex+0x1d2/0x290 [amdgpu]"
https://gitlab.freedesktop.org/drm/amd/-/issues/3911
- binkHN 2 days ago
  
  I have something similar with dmesg spam and a possibly related issue, sadly:
  https://gitlab.freedesktop.org/drm/amd/-/issues/3790
  - MegaDeKay 2 days ago
    
    Using an ICC profile per the last comment in that report didn't fix it for you?
juujian 3 days ago

Same great experience, but I experience similar issues when I disconnect thunderbolt with monitors when my machine is asleep. Laptop though, so very different driver set, no GPU via pci
- jorvi 3 days ago
  
  You can probably write a udev rule for the Thunderbolt / USB-C port with either ACTION=="offline", "remove" or "online".
  "offline" is for when your system turns off or suspends, "online" vice versa, and "remove" is self-explanatory.
  If you go with "offline", I'd look into hard disconnecting the monitors. This might cause monitor rearrangement (= you'll need to manually assign) or blinking on laptop bootup. Might also stop any charging. But that could be mitigated by checking the "subsystems" attribute.
  If you go with "online", you probably need run some sort of clean-up / refresh script or rule
  If you go with "remove", you'll need the same clean-up / refresh script.
  It'll take some trial-and-error, but it'll be satisfying once it works. Also highly recommended to check the NixOS repositories and official wiki and Arch wiki to see if your laptop or monitors have workarounds for their quirks.

dado3212 2 days ago

> So I did the natural thing: I saved and extracted the amdgpu.ko kernel module, decompiled it in Ghidra, and mapped the location of the crash in dm_resume to the corresponding lines in the kernel source.

This is always my favorite part of debugging.

gmokki a day ago

Has anyone tried the AMD suspend/resume patches that Alibaba? https://lists.freedesktop.org/archives/amd-gfx/2025-January/...

"We have tried to solve these issues case by case, but found that may not be the right way. Especially about the unbalanced irq reference count, there will be new issues appear once we fixed the current known issues. After analyzing related source code, we found that there may be some fundamental implementation flaws behind these resource tracking issues.

So we try to fix those issues by two enhancements/refinements to current device management state machines."

Daunk 3 days ago

For all the years I've been using Linux, I've always had some kind of sleep issues. I've used Intel, AMD, ATI, and NVIDIA hardware across countless distros and setups, yet nothing seems to make a difference, there's always something that doesn’t work properly with sleep or hibernation.

Honestly, it's one of the main issues I wish the Linux community would take a closer look at and finally fix!

kiwijamo 3 days ago

Interestingly sleep/wake is something I've found to work almost always just fine out of the box in Linux, including on machines Windows has sleeping issues! It used to be quite bad but things has improved heaps over the last 10 or so years -- however I've also stuck with Lenovo laptops which does generally seem to have better support in Linux.
- kristianp 2 days ago
  
  I agree, I have a Thinkpad with Intel processor + nvidia GPU purchased in 2023 and I have not had sleep issues. Ubuntu 22.04.
  The nearest thing to a sleep issue is that the screen is visible for a fraction of a second on some wakeups before the lock screen covers it. A bit of a privacy issue.
  Good to see that AMD might be getting better support, in part because of its growing popularity.
- freedomben 2 days ago
  
  Indeed, whatever Lenovo has seems mostly good. Not perfect, but does the right thing 19 out of 20 times, maybe more. Unfortunately that one time it doesn't work and roasts in my backpack it's a catastrophe :-(
Narishma 3 days ago

I think it it's because there are too many subsystems involved in sleep/resume all being worked on as independent projects (kernel, drivers which sometimes have both kernel and user space components, init system, display server, desktop environment, probably others I'm not aware of). That said, I've had my share of sleep issues on Windows as well over the years, I suspect for the same reasons.
cutty_wise 2 days ago

[dead]
1970-01-01 3 days ago

This. Linux users must resort to bronze-age tooling in 2025; Crafting and launching handmade scripts by candlelight to diagnose their plethora of sleep issues. But the community likes it this way. Meanwhile, Mac users continue to have sleep that 'just works' and Windows users have an entire sleep troubleshooting toolkit:
https://learn.microsoft.com/en-us/windows-hardware/design/de...
- nyanpasu64 2 days ago
  
  I once ran into a corporate laptop that had been downgraded from Windows 11 to 10, that would burn up its CPU during Modern Standby and eventually enter hibernate after burning a good fraction of its battery. The sleep study identified various PCIe devices and I tried installing drivers but they did not help. I wonder if it would've worked better on Windows 11 with stock drivers.

nyanpasu64 3 days ago

Update: I upgraded to an Intel Arc B570 GPU... and ran into the exact same problem on an independent driver: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288

KronisLV 2 days ago

Oh hey, another Intel Arc user! I got the LE B580 and so far it's been a pretty good experience for me, though I don't really use suspend.
Also I got curious about your blog (haven't heard about Zola, recently migrated my own blog over to a newer version of Grav, which is a flat file CMS with no DB) and saw your post about looking for a new job. Hope things work out for you, stay safe!
- nyanpasu64 a day ago
  
  All the non-scalped B580s were out of stock by the time I decided to buy a GPU :( Thanks for the moral support!

whatever1 3 days ago

Apple became a trillion dollar company by mastering sleeping / waking up of electronic devices.

why nobody else sees this?

kiwijamo 3 days ago

I have had several Apple devices over the last 20 years and every single one has had the occasional sleep/wake issues. Yes it is generally better but I would stop short of saying Apple has mastered it. I've observed wake failures on all Apple laptops I've owned -- say 1 in every 50 or so wakes will fail. I had one Apple laptop in particular struggle to stay in a sleep state overnight to the point it'd often be completely flat the next morning requiring a cold boot. YMMV but Apple hasn't completely solved the sleep/wake issues in my experience. My Lenovo laptops under Linux and Windows achieve sleep/wake reliability almost as good as Apple/macOS IMHO.
meowkit 3 days ago

Because they own the now hardware, and MacOS iOS run exclusively on their hardware
Source: I work on windows power management and I know system engineers at apple.
- tredre3 3 days ago
  
  That's the often repeated argument. But as a counter point you have the Google Pixel. Google owns the hardware, even the SoC, and the software. And yet, battery is still poorer than third party Android phone manufacturers. And let's not even compare to Apple.
  So controlling the entire stack isn't enough. There has to be a desire to do better, as well as technical competency.
  - dismalaf 2 days ago
    
    Dunno, I have a Pixel 8 Pro, battery lasts all day unless I play demanding games (which nowadays is never).
  - pxx 2 days ago
    
    third-party manufacturers cheat for battery life benchmarks and they don't even get that much better battery life. the current state of the pixel phones allows me all day battery life without such headaches.
    in my experience with OnePlus, it aggressively kills and/or throttles applications that you want to be running in the background (e.g. fitness trackers or even audio players) even when you put them on the battery allowlist.
    at the same time, my iPad just randomly decides to drain all of its battery even when it's been sitting unused without any apps installed. I don't have this problem on Pixel devices. (you do get this problem as soon as you install some misbehaving apps but that's something you've done to yourself)
  - fulafel a day ago
    
    It's mostly unrelated to the platform "mastering" sleep/wakeup, those work really well on Linux as used in Android.
rafaelmn 3 days ago

I still have i9 2018 that will drain overnight in sleep mode, and that sleep discharge put the most battery cycles on the device. I think they only fixed it when they ditched the x86 ecosystem.
edoceo 3 days ago

That and app store fees
megous 3 days ago

Pretty much all smartphones can do this.
- whatever1 2 days ago
  
  *Recent. It took many years for smartphones to catch up with iPhones in terms of energy drain. Even today iPhones have the smallest batteries across their competitors.
  - megous 15 hours ago
    
    Battery life has nothing to do with suspend/resume reliability.
    Every smartphone does suspend/resume thousands of times each day, it has to be nearly 100% reliable feature and it is, otherwise users would be majorly pissed off.
thomasjudge 3 days ago

I think there was a little more to it
talldayo 3 days ago

I think it could be argued that the Mac contributes nearly nothing to Apple's current trillion-dollar valuation. If the Mac was spun out into it's own business it would be lucky to crest a $100B market cap.

voytec 3 days ago

I've had zero problems with S3 wake/sleep on AMD ThinkPad with FreeBSD for years. And FreeBSD uses AMD drivers pulled from Linux. How is this still a problem on Linux?

    hw.acpi.lid_switch_state=S3

larrik 3 days ago

I battled sleep issues on my laptop for months, and had very different results than most people with my model. I think the behavior is a total crapshoot from machine to machine.
In fact, it turned out that a BIOS update that happened in the middle of my issues broke sleep functionality for the whole machine for a few months, so that wasn't even my fault.
kiwijamo 3 days ago

Ditto. My Thinkpad X1 and X390 both have reasonably reliable sleep/wake under Debian.

zrm 3 days ago

> To make room for VRAM, memreserver allocates system RAM based on used VRAM plus 1 gigabyte, then fills the RAM with 0xFF bytes and mlocks the memory (so none of it is swapped out).

That seems like a bit of trouble if you have 16GB of system RAM and a 24GB GPU.

sidkshatriya 3 days ago

TL;DR:

During suspend, for graphics cards, GPU VRAM needs to be transferred to system RAM.

However, during high memory usage scenarios the VRAM + RAM usage could exceed system memory -- this would ordinarily involve system swap coming into play and handling the temporarily out of memory issue. However system swap was already deactivated when it came time to suspending the AMD card causing all sorts of problems.

The fix was asking the GPU to evict its VRAM to system RAM via the hook ("suspend prepare") before swap was deactivated in linux kernel.

nyanpasu64 3 days ago

<s>Technically it was the suspend notifier; even suspend prepare executes after swap is disabled.</s> See replies.
- sidkshatriya 3 days ago
  
  Isn’t the hook called PM_SUPEND_PREPARE as per https://nyanpasu64.gitlab.io/blog/amdgpu-sleep-wake-hang/#so...
  - nyanpasu64 3 days ago
    
    Sorry I had forgotten my own article :( It is confusing that PM_SUSPEND_PREPARE and dpm_prepare share the same word.

badsectoracula 2 days ago

I wonder if this will help a similar problem i have with my AMD GPU: very often when i wake/resume the PC, the output is almost frozen. "Almost" because it actually isn't frozen, if there is any output/animation/etc going on it plays fine, but once i try to move the mouse it freezes and everything updates at a single frame per couple of seconds - sometimes freezing completely. I can usually Ctrl+Alt+Fn to another virtual desktop in text mode and, if that is possible (i.e. the computer hasn't completely frozen, though sometimes it takes about a minute to switch), i can Ctrl+Alt+Fn back and everything works fine. Dmesg has a ton of spam messages from amdgpu after that.

AFAICT (from the behavior) something isn't properly saved/restored and communicating with the GPU (the mouse cursor is a hardware cursor thus needs to send commands to the GPU to update its position) causes some sort of issue. Switching to another virtual terminal that is running in text mode probably forces the driver to reset its graphics state. Of course that is just my assumption based on what i see going on.

Weirdly enough this only happens after i replaced my RX 5700 XT with a RX 7900 XTX so it might be something GPU (or GPU arch) specific.

I've been considering plugging my laptop and see if there is something i can figure out (GPU aside the PC is usable, but i guess if this a kernel side thing i'd need a second computer connected to it to debug it), but as this isn't something i've tried before (though i know someone who has and said it isn't anything special) my annoyance still hasn't gone over the "i need to get to the bottom of this" threshold :-P.

It'd be nice if 6.14 fixes the issue, though i am not sure as i rarely have more than 1/3rd of the system RAM (32GB) in use and VRAM (24GB) barely goes above 1-2GB of use outside games. But this post might be helpful in diagnosing the issue next time it happens :-).

nyanpasu64 2 days ago

I've heard reports of reduced frame rate issues on laptops from DMUB/DMCUB/panel self-refresh, though your issue sounds different and I probably can't usefully debug it without a full dmesg/journalctl --system log.

Asmod4n 3 days ago

> On my laptop, I opened a terminal and ran sudo minicom --device /dev/ttyUSB0 --baudrate 115200 to monitor the computer over serial. In addition to saving logs

You can just use screen for that and have a working terminal with color support et al.

vbernat 2 days ago

Another lightweight option is tio.

fowl2 2 days ago

I guess "hibernating" (writing VRAM to swap) works better than expecting userspace to gracefully handle device resets. One linear read vs. a thundering herd of processes re-initialising, decompressing, etc. should be more predictable/reliable at least.

I do wonder however how much VRAM is "volatile" - ie. framebuffers - and could just be thrown away. And web browsers seem to handle GPU resets just fine, so maybe they could opt-in?

nyanpasu64 2 days ago

I didn't mention in the article but the Nvidia drivers at one point would drop VRAM rather than preserving it, leading to corrupted RGB noise textures in window managers and browsers (and potential crashes though I don't think I encountered them). I suggested doing this on the AMD bug tracker (https://gitlab.freedesktop.org/drm/amd/-/issues/2362#note_20...), but the amdgpu developers weren't interested.
- fowl2 2 days ago
  
  Oh wow didn't really expect anything other than whole device loss. Just returning garbage does sound bad.
  Do you know if they tried to communicate with clients and were just ignored/not implemented or if the APIs just don't support it?
  A quick search indicates that "residency"[1] exists, but no idea the extent it's useful/implemented.
  [1] https://learn.microsoft.com/en-us/windows/win32/direct3d12/r...
  - nyanpasu64 2 days ago
    
    As I understand, KWin does call glGetGraphicsResetStatusARB(), and on Nvidia GPUs, sleep-wake (or opening apps if it crashed the GPU) would cause KDE to detect a graphics reset and print "Desktop effects were restarted due to a graphics reset" (https://github.com/KDE/kwin/blob/10c04995c1f9f82ddbd6610e5e0...). I haven't used Nvidia GPUs in years and don't know if this is still an issue. I think many apps don't check for graphics resets?
    I'm not sure if residency is relevant here; the Microsoft link indicates that eviction makes memory inaccessible from the GPU to make room for other memory, which explains why Linux uses the same name for "backing up" VRAM before sleep (through the same underlying mechanism).

yellow_lead 3 days ago

I have an Nvidia GPU and a sporadic crash (black screen) with no logs on Linux. I suspect it's a driver issue too. Going to try some of these tips to enable the debug shell, but I'm not sure if they'll be effective.

Anyone have other tips for this type of thing? I did try upgrading drivers/kernels already

an_ko 3 days ago

Run a memtest. Graphics cards usually crash badly when given invalid data, which can happen sporadically if you have bad RAM.
If memtest shows a specific memory region as failing, swap out sticks to check which it is, and buy a new one. (Or if you're on a tight budget, you can disable that region with kernel boot options.)
If memtest gives errors in lots of places, might be a bad overclock. Loosen timings or give it more voltage.
- porridgeraisin 3 days ago
  
  This.
  I would also recommend running it on one memory slot only (do this one by one).
  Recently it happened that even a multi-day memtest run did not catch the defect in one of my ram slots.
  But when I emptied it, all the gpu driver problems immediately stopped.
  The symptom was a crash followed by garbage on screen and a "zzzz" sound. Sometimes followed by a shutdown.
- bmicraft 3 days ago
  
  > If memtest shows a specific memory region as failing, swap out sticks to check which it is, and buy a new one. (Or if you're on a tight budget, you can disable that region with kernel boot options.)
  Honestly, I'd always do that as the first option. In most cases you can still get years of life out of that stick of ram (but do check regularly, like after a week and then double the interval every time if it didn't get worse).
- yellow_lead 2 days ago
  
  Thanks, I'll try it!
nyanpasu64 3 days ago

Does `sudo systemctl enable nvidia-suspend` help?
- yellow_lead 2 days ago
  
  Hey, great article :)
  I'll try it.

rakejake 2 days ago

I have a problem similar to what OP faced but with an NVIDIA GPU (RTX 4080). When I wake the computer from suspend, the computer will usually wake up, show me the timestamp changing on the lock screen. But sometimes randomly, the timestamp will not change the the screen will freeze on the old timestamp. After this, it is either REISUB or hard reset.

@nyanpasu64, do you think enabling nvidia-suspend/nvidia-resume services will do the trick? I didn't go through the codefix in the above post in detail, but it looks like the fix is to raise a relevant notification to all listeners in the suspend_prepare() method for amdgpus, not nvidia.

Running Ubuntu 24.10.

nyanpasu64 a day ago

I would try both nvidia-suspend enabled and disabled; https://forums.developer.nvidia.com/t/fixed-suspend-resume-i... suggests turning it off, but I'd try with it both on and off.
fata1ity_ 2 days ago

I have encountered this running Ubuntu with an nvidia gpu. Have you tried adding button.lid_init_state=open as a kernel boot parameter? That is what solved it for me.
- rakejake 2 days ago
  
  Thanks. Will try.

raffraffraff 3 days ago

> I dug a PS/2 keyboard out of a dusty closet and plugged it into my system (only safe when the PC is off!)

Lol, I remember.

0xTJ 3 days ago

Very excited to see 6.14 hit Arch! Hanging around sleep (with symptoms that sound like what's described in the write-up) has been the one persistent occasional issue, so I'm hoping that this fixes what I'm seeing.

Voultapher 3 days ago

Since a couple Linux versions something around 6.10 IIRC I've had it where my Nvidia system wakes into a black screen, but with a cursor and alt shells work, specifically KDE Plasma seems bugged here, but they say it's a kernel issue, or at least there are dozens of separate issues open about this kind of bug and it's rather annoying that I can't put my machine to sleep.

If anyone has ideas what could fix this I'd really appreciate it. The machine is dual booted with Windows, and there sleep works without issue, so it's clearly possible, as it was for years before that on Linux as well.

saint_yossarian 2 days ago

I had a similar issue recently and at some point realized it was because systemd started freezing all processes before suspend, and didn't always succeed (either with the freeze, or with the unfreezing after resume).
Check your logs if you have a message "Freezing user space processes failed". You can disable this behaviour with the instructions at https://github.com/systemd/systemd/issues/33626#issuecomment...
thangalin 3 days ago
I have a similar issue. When turning off the monitor using its power button, the system semi-crashes. After powering the monitor back on, I have to go into an alt shell and kill lightdm, which also kills all running GUI applications. Not the greatest workflow.
```
    $ uname -a
    Linux hostname 6.13.1-arch1-1 #1 SMP PREEMPT_DYNAMIC Sun, 02 Feb 2025 01:02:29 +0000 x86_64 GNU/Linux

    NVIDIA-SMI 570.86.16
    Driver Version: 570.86.16
    CUDA Version: 12.8
```
Instead of powering off the monitor, I've created an alias:
```
    alias off='xset dpms force off && await.sh && xset -dpms'
```
And shell script (press any key to unblank the screen):
```
    $ cat ~/bin/await.sh 
    #!/usr/bin/env bash

    xinput test-xi2 --root 2>&1 | \
      grep --line-buffered -m 1 'EVENT type 2 (KeyPress)' &> /dev/null
```
This allows me to run "off", a comromise to put the monitor in low-power mode. This solves the problem of having to kill lightdm, which improves the workflow. Still, sometimes I have to open an alt shell and then press Alt+F7 to return to X. Rarely, I'll have to go to the alt shell/Alt+F7 a second time to suppress sporadic screen glitching.
Adding nomodeset to the GRUB configuration was another suggestion:
```
    # grep vmlinuz /boot/grub/grub.cfg | head -1
    linux /vmlinuz-linux root=UUID=... rw nomodeset quiet resume=UUID=...
```
Hope it helps.
- saltcured 2 days ago
  
  Now I'm not sure if my Fedora 41 experience is better or worse than your problem.
  An older Thinkpad with secondary NVIDIA dGPU used to work fine, but now every time the monitor powers off (or is unplugged), Xorg instantly dies with no apparent logs.
  So I don't have to do anything special to recover except login over again and start a whole new MATE session.
nyanpasu64 3 days ago

Does `sudo systemctl enable nvidia-suspend` help?
mathfailure 3 days ago

Just use an older branch of drivers (like 535).

devilsdata 2 days ago

Props to you for this.

I am not clever/experienced enough to solve my own issues with sleep-wake hangs on Linux at work.

I’ve instead opted to work around it. I use Firefox, Obsidian, and Tmux with Neovim for all my work. Tmux has resurrect and a plugin that saves my entire terminal state automatically every few minutes.

I also have a command that automatically sets up my i3wm/regolith windows exactly how I like.

Basically if I run `wkup`, I’m exactly where I was, down to the line of code open on NeoVim, Firefox tab, and dev server or cargo running.

bbkane 2 days ago

Could you link to your i3wm/regolith window setup code? I'd like to set up something similar!
- devilsdata a day ago
  
  Here we go:
  https://github.com/lkdm/dotfiles/blob/main/bin/executable_wk...
  Hope it helps you out

schainks 2 days ago

Ugh the first time I was debugging problems like this, it was in production and for some IoT hardware we had deployed in the field.

Fortunately, although that's not the focus of this article, system hibernate is WAY more reliable than system sleep in Linux due to the way it works.

Use system hibernate if your SSD is fast enough. It works better than system sleep and isn't a ton slower.

asmor 3 days ago

Some AMD integrated GPUs are surprisingly fragile with this. I have a GPD Win Max 2 8840U (a "concept car" handheld laptop hybrid) and when I got it last year, it would fail to wake from suspend and hibernate about half the time in Windows, with Linux actually being more reliable (but also not perfect), and only this year did an AMD GPU driver fix this.

empiricus 3 days ago

I notice I am confused how the code needed for the GPU to sleep was implemented. It was failing when simply saving/copying gigabytes of flat memory, but on the other hand it was able to recover successfully the previous complex hw and sw state and data structures?! I guess it probably makes sense if after waking up that data is actually dropped and the gpu and ui is reinitialized and redrawn.

nyanpasu64 3 days ago

As I understand if the GPU fails to save VRAM it drops the RAM copy(?) before restoring the GPU data structures or abandoning sleep entirely. But if it saves VRAM then crashes after the GPU is suspended, it will often fail to wake up the GPU, resulting in no monitor signal. Though I'm fuzzy on the details (and the order of operations depends on which kernel you're running).

stycznik 2 days ago

It seems like the GPU should grow the capability to keep its VRAM intact through suspend, it's already complex[0] enough it's basically another computer attached to your computer anyway..

[0] https://github.com/jhuber6/doomgeneric

Namidairo 2 days ago

For a second I thought this was referring to the other reset bug on Polaris, Vega and Navi. (These apparently have broken Function Level Reset sequences, requiring quite specific reset code as a separate module or a system reboot to bring back to a working state.)

isodude 2 days ago

There should exist something like memtest86, but for S3 and S0, that you can run on the laptop to identify hardware that do not suspend properly.

podiki 2 days ago

I didn't have issues with sleep/wake until somewhat recently (not sure when) and found this post. Grabbing the patch from the commit referenced and using it on top of 6.12 and 6.13 kernels seems to have fixed it for me too (for the past couple of weeks and counting).

Great work!

sim7c00 2 days ago

very nice and detailed writeup, so many interesting stuff in here really. reading about systemd (bugreport) always hurts my brain (bugreport) but all the low level interactions between OS and drivers /firmware regarding these states and what kind of issues can happen between them, how to find out whats happening better, verynice :). many thanks! for the year long hunt and the excellent writeup.

progforlyfe 3 days ago

Extremely high level genius stuff -- nice work and thank you for your efforts!

Bobaso 2 days ago

My thinkpad E14 gen5 intel + ubuntu 24.04 is my first ever laptop where sleep work exactly as intended. with ~1% battery waste per hour

deepsun 3 days ago

I have to unplug Logitech wireless receivers for mouse/keyboard, otherwise desktop wakes up immediately.

igtztorrero 2 days ago

Thanks, this happen to me, I avoided sleep functions

mistyvales 3 days ago

Highly relevant! Thanks for this.

thrdbndndn 2 days ago

what is 'agd5f/linux'?

nyanpasu64 2 days ago

As I understand https://gitlab.freedesktop.org/agd5f/linux is Alex Deucher (amdgpu maintainer)'s personal Git tree/repo (https://docs.kernel.org/process/maintainers.html#radeon-and-...), where amdgpu changes get merged into before they reach the main Linux tree (and/or the drm/tip repo, I'm not sure their relationship).

kkarpkkarp 3 days ago

omg, thank you

ej1 3 days ago

[dead]

tombot 2 days ago

> This took over a year of debugging and multiple attempts by many people to fix.

2025 finally Linux on the desktop

Nezghul 2 days ago

I'm also programmer and such bugs immediately reminds me of all my managers demanding exact time it would take me to fix such bug and me telling them it could take from 1 day to 1 year was never taken seriously :(

tgsovlerkhgsel 3 days ago

AMD GPU linux drivers are (were?) a nightmare in general, and this includes iGPUs in their processors. Sadly, I don't have the impression that AMD is actively working on fixing this.

Just to make sure I'm not griping over something long fixed, I took a quick look and instantly found someone with a very similar issue to the one I ran into happening on a semi-recent kernel: https://community.amd.com/t5/pc-drivers-software/linux-amdgp...

It looks to me that if you want to have a working computer under Linux, it's worth the extra cost to avoid AMD.

tostiheld 3 days ago

> It looks to me that if you want to have a working computer under Linux, it's worth the extra cost to avoid AMD.
I think this is a rather hasty conclusion. The popular opinion is the opposite. If you want a working computer under Linux, it's worth it to avoid NVIDIA, especially for laptops. Sure, AMD are not perfect contributors to the kernel, but they are contributing more than NVIDIA[0]. NVIDIA has made some moves recently[1], but the AMD GPUs are still better integrated. Notably since the Steam Deck had been released, the situation has been excellent.
Anecdotally, my laptop with an NVIDIA GPU has many issues that have persisted over the years with things like high idle power draw or frequent straight up crashes, or incomplete Wayland support. My 3 devices that have an AMD GPU (1 desktop, 2 laptops) however, have been working flawlessly from day 1.
[0] https://www.phoronix.com/news/NVIDIA-Contributions-2010s-Ker...
[1] https://github.com/NVIDIA/open-gpu-kernel-modules
- MegaDeKay 2 days ago
  
  It depends on the use case. AMD is notorious for their "AMD reset bug" when passing a GPU through to a VM using VFIO. Restart the guest and most cards will lock up because AMD doesn't handle PCI resets properly. You then need to reboot the host to fix it (!). This has been a problem since Polaris if not before and AMD hasn't fixed it, despite knowing full well that the problem exists. At least in this regard, NVIDIA (and Intel as far as I know) work fine.
  The community has been able to come up with a workaround for some older cards but the problem persists even in their current cards.
  https://github.com/gnif/vendor-reset
- jamesdutc 2 days ago
  
  Agreed.
  I have first-hand experience across five distinct AMD 7840U and AMD 8840U devices that near-perfect, out-of-the-box Linux-support (with stock kernels and no dodgy kernel flags!) is possible. This includes support for S0ix suspend.
  https://news.ycombinator.com/item?id=43083669
  I don't doubt it when people recount their bad experiences with AMD devices; however, my experience should serve as an existence proof that it's not a universal experience.
  In the case of each device mentioned in the comment above, I followed a standard installation procedure from an Arch installer USB. I use only stock kernels: linux, linux-lts, and linux-zen. For almost all of the devices, the only kernel flags I pass are for enabling hibernate or handling FDE. (In one or two cases, the devices have portrait displays that have been installed for use in landscape-orientation. These need an `fbcon=rotate:…` kernel flag.)
  In all but one case (the OneXPlayer X1 Ryzen) everything (except fingerprint readers) works flawlessly. In the case of the OneXPlayer X1 Ryzen, there is an intermittent issue with hang on suspend, but that may have gone away with a recent kernel update. If not, I'll probably come back to this blog post and see what I can do…
- mardifoufs 2 days ago
  
  Are you using proprietary Nvidia drivers? They work flawlessly for me, even better than the AMD open source "mainline" kernel drivers.
- tgsovlerkhgsel 2 days ago
  
  In my case, the choice wasn't between AMD and Nvidia but between the iGPUs included in the CPUs, i.e. AMD or Intel.
trelane 2 days ago

This used to be the case.
However, they have published docs on their hardware. For instance, https://gpuopen.com/amd-gpu-architecture-programming-documen... Their driver is also in-tree: https://wiki.archlinux.org/title/AMDGPU
It may or may not be as good as Intel. I don't have the depth of experience there right now.
It looks like AMD and Intel GPUs should be roughly similar in terms of support / out of the box experience on Linux, but AMD will have better performance.
Nvidia is still super proprietary, so it will generally have more headaches and hoops to deal with.
So, generally, I rank them by default:
1. AMD 2. Intel 3. Nvidia
Unless there is a reason (firmware or configuration differences for Intel/AMD needing to run/develop CUDA code for Nvidia) I try to stick with that. I've not yet had an AMD laptop though.