Timers stop firing when system uptime is 7 weeks


#1

I’m using juce 1.51 to build an app on Mac (Leopard and Snow Leopard) and Windows (XP, Vista, and 7). I’ve only tested the Timer with top-of-tree (as of 22:14 GMT today) and 1.51 on Snow Leopard, but looking at the code, I’m 99% sure the same is true on all platforms in both versions.

If your system has been up for more than 2^32 millisecs (just over 7 weeks), Timer objects stop firing for running apps. For a newly-launched app, the first Timer you register will fire once, but that’s it.

The root problem is that Time::getMillisecondCounter() is documented to return a monotonically increasing value of millis since system startup in a uint32, which is impossible. After 2^32 millisecs of uptime, the counter has to either roll over (thereby breaking monotonicity) or peg at UINT_MAX (breaking “accurate to within a few millisecs”). Stepping through the Time code in the debugger may actually change the behavior (by affecting the 1000ms difference check), but it doesn’t really matter; there’s no way it can do what it’s documented to do–and, either way, Timer::run() will never fire any timers. Each time through the loop, now <= lastTime, so it waits 2ms and continues the loop.

I think the Timer problem can be fixed entirely within Timer::run. I’ll try it locally and submit a patch, if it’s as simple as it should be.

As for the root problem, the simplest fix might be to explicitly make it roll over, and change the documentation to explain the problem (c.f. timeGetTime on MSDN), and of course make sure to fix anything other than Timer that relies on it. (Also, I haven’t looked through the rest of the API to see if the same problem exists in other functions, but if so, they’d need to be fixed the same way.)

Other options off the top of my head include switching getMillisecondCounter to return a uint64 or adding a parallel getMillisecondCounter64 call (which is no fun to implement on top of timeGetTime–you might want to look at reading the 64-bit value directly out of the system shared memory page…), or counting millis since process launch instead of system start (which still has a 7-day rollover problem, but fewer people would be affected, and the workaround isn’t as bad as “reboot your computer”), and/or adding an explicit “reset counter” call that could be used for app-level workarounds.


#2

Hmm…all that does is delay the problem until much later in the future. You might think this is improbable but if the client software is running near the event horizon of a black hole, time dilation effects could cause that timer to arrive sooner than you would expect.


#3

First, the Timer problem can’t be fixed purely within Timer. If Time were guaranteed to roll over, then Timer could be fixed with a one-liner, but as it is, it pegs on some platforms, and may not roll over properly on others.

I believe the Mac implementation just needs to replace the (uint32) with (uint32)(int64) in millisecondsSinceStartup (although I think this is only guaranteed portable to systems with IEEE doubles), the linux implementation just needs to add (uint32) casts to tv_sec and tv_usec, and the Windows implementation is fine as-is. I haven’t looked at Android or iOS.

The Mac implementation casts a double to a uint32_t, which does different things depending on your hardware and compiler settings, but for x86 with default settings it pegs to UINT_MAX, which can’t be handled usefully in Timer. I think just replacing the (uint32_t) with (uint32_t)(int64_t) will fix that. I think the linux implementation might have some issues as well (i haven’t tested yet), but tossing in a couple extra (uint32_t) casts will definitely fix that. Windows is already guaranteed to roll over as implemented. I haven’t looked at other platforms.

Hmm…all that does is delay the problem until much later in the future. You might think this is improbable but if the client software is running near the event horizon of a black hole, time dilation effects could cause that timer to arrive sooner than you would expect.[/quote]No, you’ve got it backwards.

As far as the CPU clock, and any local users, are concerned, half a billion years of CPU time is half a billion years of user time, because they’re experiencing the same time dilation.

As far as remote users are concerned, they’re seeing the CPU run slower; half a billion years of CPU time is a few googol years of user time. (And I’m even taking into account the fact that the black hole will probably eventually evaporate due to Hawking radiation, which means the time dilation won’t last forever.) Most people will get bored and quit the app before the heat death of the universe, and I’m willing to give a refund to anyone who doesn’t.


#4

I hope you’ve got that clause in your EULA!

Very interesting post - and well spotted about the cast to uint32 from a double. TBH I have to admit that when casting doubles to ints I don’t think I’ve often given much thought about what would happen when the value’s out of range, I’ll try to be more vigilant about that in the future!

AFAICT the latest Timer code is absolutely fine as it stands - it contains a check for roll-over and should carry on without a glitch. I’ll have a good look at the getMillisecondCounter code, and see whether it makes sense on all platforms to go to 64-bit, or just to make sure it rolls over (which is what I always designed it to do, despite not having been explicit about that in the documentation for that function)

Thanks!


#5

I hope you’ve got that clause in your EULA![/quote]Hmm… better get the lawyers on that one. I suspect that if any of them can find a usable precedent, they’ll also get a Nobel Prize in physics.

[quote=“jules”]Very interesting post - and well spotted about the cast to uint32 from a double. TBH I have to admit that when casting doubles to ints I don’t think I’ve often given much thought about what would happen when the value’s out of range[/quote]The most annoying thing about this is that the rules are different depending on your CPU, compiler, and even minor settings. So, it’s very hard to test for these problems.

[quote=“jules”]AFAICT the latest Timer code is absolutely fine as it stands - it contains a check for roll-over and should carry on without a glitch.[/quote]I’m looking at Timer::run (juce/src/events/juce_Timer.cpp:75) in the current latest git version, and I don’t think so. The lastTime variable still pegs. So, if you launch an app after 49.7 days of uptime, now quickly rolls over to 0 while lastTime is still near 0xFFFFFFFF, and you can never get out of that state.

So, timers will still fail to function in any app whose running time is over (7 weeks - system uptime). That’s not as bad as without rollover (where 7 weeks - system uptime could effectively be negative, so all apps would fail), and the end-user workaround is better (quit and relaunch the app, rather than power down and reboot the computer), but it’s still a bug that will affect people.

Changing this check to “if (now <= lastTime && now > lastTime - 0x3FFFFFFF)” should fix it. It’s still not perfect–for example, you can easily hibernate a computer across the rollover. But then there’s no way to make this perfect (you can hibernate a computer for 49 days, after all). And it’s probably good enough.

[quote=“jules”]I’ll have a good look at the getMillisecondCounter code, and see whether it makes sense on all platforms to go to 64-bit, or just to make sure it rolls over (which is what I always designed it to do, despite not having been explicit about that in the documentation for that function)[/quote]Yeah, I got the feeling this function was intended to be a cross-platform (and less-stupidly-named) version of Windows’ timeGetTime.

For rollover, I think this is all you need:

//juce_android_SystemStats.cpp:137 return (uint32)t.tv_sec * 1000 + (uint32)(t.tv_nsec / 1000000); //juce_linux_SystemStats.cpp:150 return (uint32)t.tv_sec * 1000 + (uint32)(t.tv_nsec / 1000000); //juce_mac_SystemStats.mm:199 return (uint32)(int64)(mach_absolute_time() * highResTimerToMillisecRatio);
I believe the Mac version isn’t actually guaranteed to be portable, but it is guaranteed to work on x86, x86_64, and ARM. If that’s not good enough, the best answer is to keep the numerator and denominator around as integers instead of a double ratio.

If you really want to go 64-bit, it is a huge pain on Windows if you have to support pre-Vista systems. The only two options are to synthesize the 64-bit time out of a less-accurate 64-bit time and timeGetTime (or some other ms-accurate value), or to use the undocumented value maintained by ntdll in the page it shares with every process. If you want to know more about the latter, I can probably dig up some code (which doesn’t work on Vista and later, but you don’t need it there).

PS, I’m pretty sure your Windows implementation of the hi-res counter isn’t right either. You’re relying on the 2008/Win7 behavior of QPC (and even there “you can get different results on different processors due to bugs in the basic input/output system (BIOS) or the hardware abstraction layer (HAL)”), which means the counter can only be safely used on a single thread with hard processor affinity, and in 2000/XP/2003 it still often won’t work right if sleep mode or even a SpeedStep change kicks in while the app is running. Also, on multiprocessor, older multicore, and older variable-clock-speed systems, QPC can be very slow–hundreds of cycles, or even hundreds of microsecs, and possibly flushing the cache across all cores (because the only way to implement what the API claims to offer is to either RDTSC and sync, or use the ACPI timer). Read http://blogs.msdn.com/b/psssql/archive/2010/08/18/how-it-works-timer-outputs-in-sql-server-2008-r2-invariant-tsc.aspx for a description of how Microsoft themselves deal with the problem of trying to make usable timers on Windows (and follow the three links near the top) to see how hard it is to get this right–and that’s for a server, where people aren’t trying to write sample-accurate stored procedures and run them on home PCs…


#6

Thanks, I did some work on this yesterday and sorted out the basics - I’ll check it in today. You’re right, the timer stuff was still wrong, but actually quite easy to make it handle the roll-over.

The win32 high-res stuff is all a complete minefield, and I’ve never been quite sure what to do about it. Would be nice if MS could just give an “official” description of how best to get a decent timer on all their platforms.