Tracking down why my desktop fails every second resume

Collecting a thread to pull on...

Had an interesting problem for a while now - my desktop under linux would mostly suspend and resume just fine, except when it didn't. This is annoying as I'm the type of person who likes to leave a big dev environment running and come back to it.

Power management problems are the worst type of problems to debug in many ways, so documenting any progress I made was fairly important.

The Ubuntu guide to kernel suspend is the most useful one I found: https://wiki.ubuntu.com/DebuggingKernelSuspend

And the important bit is this action:

sync && echo 1 > /sys/power/pm_trace && pm-suspend

This does some kernel magic which encodes suspend/resume progress into the systems RTC clock via a hash, which allows - if things freeze - to reboot and grab the point at which they did. You have about 3 minutes to do so after the next boot before the data vanishes and you grab it from dmesg.

This led to an immediate reproduction - suspend->resume worked the first time, and then hung my system on the second time. So it works, but something gets corrupted through the process and we need to (hopefully) just reset it on resume to avoid the problem.

$ dmesg | grep -A10 Magic
[    3.607642]   Magic number: 0:474:178
[    3.625900]   hash matches /build/linux-B4zRAA/linux-4.8.0/drivers/base/power/main.c:1070
[    3.644583] acpi device:0e: hash matches
[    3.663313]  platform: hash matches

That's the easy part. What the hell does it mean?

Goto the Source

We get a source line out of that request, and we're running an ubuntu kernel which has a convenient source package we can grab. So let's get that so we can account for the Ubuntu packages:

$ cd ~/tmp
$ uname -a
Linux will-desktop 4.8.0-19-generic #21-Ubuntu SMP Thu Sep 29 19:39:23 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ apt-get source linux-image-4.8.0-19-generic-generic

Which leads to this:

static void async_suspend_noirq(void *data, async_cookie_t cookie)
{
    struct device *dev = (struct device *)data;
    int error;

    error = __device_suspend_noirq(dev, pm_transition, true);
    if (error) {
        dpm_save_failed_dev(dev_name(dev));
        pm_dev_err(dev, pm_transition, " async", error);
    }

    put_device(dev);
}

So the error line we're getting puts right on that if (error) line which hopefully means this is just some device failure we can add a PM script for.

From dmesg above we've got two more things to look at - whatever acpi_device:0e is and the platform driver for. Some googling shows that this puts us into the category of very annoying problems: we're not even successfully getting into the resume code, so the failure on the second resume happens very early. https://lkml.org/lkml/2016/7/14/160

Time to rebuild the kernel...

Which is often less work then it sounds, but judging from that LKML link it's pretty much the only lead we have to go on since we don't have a Thinkpad but the problem is probably suspiciously similar.