Something fishy in PNGImageFormat file operations

I have a sad tale and a suspicion or two. My plugin has a loop in which it loads a number of PNG files. Works on Mac–everywhere. Works on anything in WinXP (including Cubase 7–I figured how to install it there). Works on anything in Win7–except Cubase. Then it crashes without so much as a thank you. I have two virtual machines (32 and 64-bit). I have a laptop (64-bit). It crashes without fail on all of those machines. It’s not always the same file that kills it, but it’s almost always in the first 8-10 files.

Smells like a race condition, right? I made sure it was thread-safe (at least with my plugin) by protecting the image database with a mutex. I did it with external files and I did it by embedding the files within the plugin itself. Always crashes under Cubase7 on Win7 and never anywhere else (haven’t been on Win8 in a while, so let’s exclude it from the discussion). I’m not able to run Cubase under debug, so I had to revert to the time-honored tradition of dumping messages to a text file. I made sure to fflush after each write.

I found that the crash always occurred somewhere in [color=#0000FF]PNGImageFormat::decodeImage(InputStream&)[/color]. This function breaks down into two halves–in the first half, the file is read and some basic header data extracted. In the second half, the data are converted to the Image format. The crash is always in the first half–the file-reading/analysis half. I started to put more little debug prints into the function in that area. And it began to work…

I didn’t look real hard at the IO functions, but it sure looks to me like you’re using some sort of asynchronous IO. My speculation is that there’s some sort of global mode that Cubase sets in a way you don’t expect. The callbacks might be coming back before the IO is really complete and the logic dies because it’s operating on bad data. I’ve dealt with DMA controllers that interrupt before transfers are really complete and you have to be very careful. My sprintfs take long enough to allow file operations to complete.

I’ve been digging into this one for more than a week. I’ve rewritten the controlling logic enough different ways to change timing around in all sorts of ways. Yet it always happens right inside [color=#0000FF]PNGImageFormat::decodeImage[/color]. Jules, do you (or anyone else) remember that file IO well enough to comment?

I strongly suspect that PNG is innocent - most likely you’ve got some kind of dangling pointer that’s overwritten some memory, and PNG just happens to be the scapegoat that triggers a crash because it does a lot of allocation. I’ve often seen crashes in there myself, but always as a side-effect of something else being broken.

+1

I’ve been using PNG file with Juce for ages and it has been just fine the whole time.

[quote=“otristan”]+1

I’ve been using PNG file with Juce for ages and it has been just fine the whole time.[/quote]

I have too–years now. It continues to work fine in every other case, but the failure on Cubase 7/Win 7 is really egregious. I don’t think it’s a loose pointer–that would inevitably cause other problems, but the plugins are rock-solid. As I tried to make clear, my suspicion isn’t the PNG logic–it’s the apparent asynchronous IO. I suppose I’m going to have to dig still deeper.

It doesn’t use async IO… It just reads from a normal juce InputStream, so wouldn’t expect any problems.

I haven’t gone away on this one. I’ve been searching for ways I might have caused it for a couple of weeks. I’m more convinced that there’s a problem here. I’ve been looking at crash dumps and I get a very uniform result. Before I say what that is, I should say that I get exactly the same thing in two different plugs. One is in a tight loop loading 94 png files. The other loads over 100. In both cases, the crash happens fairly early, perhaps after a dozen files. Sometimes it’s more, sometimes less.

These calls are in a critical region. Nothing else is going on in my plug. It happens in more than one DAW (Cubase being the easiest to drive the error). At first I wondered why it didn’t happen on the Mac, but I discovered you’re using CoreImage on the Mac, so this code isn’t exercised. If I start from a fresh system boot, it will always happen., whichever plugin I use.

So with the preface out of the way, the crash happens on in a free call (it was called by png_destroy_struct, which was called by png_destroy_read_struct, which was called by decodeImage). The crash log complains of a corrupted heap.

I should point out that the actual conversion of the png file works perfectly well. My suspicion is that there is a fencepost error somewhere in one of the routines you use here. Because malloc will usually give you more than you ask for, I think this occasional error is usually harmless. But depending on the fragmentation state of the heap, you don’t always get this little safety margin in an allocated chunk of memory. I can typically play around on the machine for a while, churning up the heap, and then all these files will load without difficulty. As far as I can tell, this is just a single heap inside the context of the DAW, but I don’t really know.

But I really do think this one’s real.

There isn’t much allocation done in the PNG loader… There are a couple of HeapBlocks, but it all looks fine to me, and if there’d been an overflow, I think it would have been caught in valgrind by now. I guess you could try making those HeapBlocks allocate a bit more space in case libpng is overflowing a bit, but it doesn’t feel likely to me.

He’s baaaack…

I’ve found that I can comment out the second call to png_destroy_read_struct inside PNGimageFormat::decodeImage and the crash goes away. The plugin remains stable after that. If I allow that call to be made under specific (and common) circumstances, it will crash the application with a corrupted heap. It never crashes anywhere else. With the call commented out, it runs just fine. Now I’m the first to say this isn’t a longterm solution. I don’t see anything in the immediate area that steps outside of the boundaries of the structure or anything like that. But I think this is a telling symptom of a problem that’s been introduced in a few new DAWs. The new problem is caching of DLL code.

I know this to be a fact in an upcoming version of the “DAW that runs AAX”. I experienced it and have had it directly confirmed by the vendor. I strongly suspect that it’s part of Cubase 7 and Studio One (made by many of the people who made Cubase). This causes terrible problems if you have static elements in your plugins (other that basic things like integers and pointers). The problem is caused by the fact that you can’t count on constructors to be run on more complex objects. The code loader will run the very first time the DLL is loaded, and from everything I can tell the constructors are run at that time. But–even if you remove all instantiations of a plug–it will be loaded from cache after that. The methodology seems to differ from DAW to DAW. In some cases, the cache lasts through reboot of the computer. In other cases it doesn’t.

But when the DLL comes in from cache, it may be as simple as a binary copy. Those constructors aren’t run. There is of course good housekeeping that can be done at shutdown: make sure to delete every dynamic object and set the pointer to NULL. Set various flags and counters to a known uninitialized state. And so on. But it also means that any object that dynamically allocates resources (many Juce classes do this) may be left in a dangerous state. Even a statically-declared String might be a problem, especially if you need to modify it later. That String will probably have pointers to memory in the heap that existed the last time it was alive. As soon as you modify that string, the old buffer will be freed and the heap will be corrupted.

This problem may not be experienced by many Juce users if they don’t use statics. But I need to communicate between plugin instantiations in order to share some resources and pass messages. Over the last several weeks, I’ve assiduously worked on my code to make sure that I have no static complex objects and that I’ve cleaned up. All of those statics have the same values they are given in the source code. This isn’t hard, once you get the hang of it. You can always tell when the destructor is running for the last existing object and can clean up accordingly.

I’m asking that you take a quick spin through Juce to see if there are static objects. I think there may be a couple. If that’s a case, you’re going to need some sort of initialization and shutdown calls for those modules. You can’t count on constructors and you probably can’t count on destructors either. I now know of 3 DAWs that do caching of this sort. Unfortunately, they’re good ones.

I think you’re probably misdiagnosing that.

png_destroy_read_struct calls free(). During the free() call, the MS debug allocator will check the heap for corruption and assert. That does NOT mean that the heap corruption was caused by png_destroy_read_struct - that’s just where it got detected. The damage could have happened long before, and just because you remove the code that detects it doesn’t mean that everything’s ok!

I also think you may also be misunderstanding how DLLs work - there’s no such thing as a “cache” for them (??) The OS loads and frees the DLLs, and statics won’t be destroyed until the DLL is about to be unloaded from memory. Perhaps instead of looking at statics, the thing to check would be that the initialiseJuce/shutdownJuce calls are being made correctly, because if there was a mistake so that it shutdown juce when the last instance is deleted, but failed to re-initialise it when another is subsequently created, that might make things go wrong?

I thought that’s what I said. I said the call was a symptom–not the cause. I fully concur that this is quite likely to happen sometime in advance of this. But here’s the thing. On the Mac the actual PNG code isn’t run: CoreImage is run instead. There are absolutely no heap problems. Nada. On many other Windows DAWs that don’t appear to cache. there are no such problems. I have reordered things at least a dozen times–radically. I’ve searched quite exhaustively for loose pointers and such. But Iin every single core dump, I get exactly the same culprit–the call to free from the function I’ve mentioned.

And by the way, these are in release builds, so I don’t have the advantage of the MS heap analysis. If I try to run the debug build from the normal app startup, I get killed by non-germaine juce asserts (don’t even get me started there). Can’t launch the apps from the debugger because they detect the debug state and exit (as a protection mechanism). I’m quite sure that the MS heap tool would tell me there’s a problem. If you know a library I can link into the app that will analyze the heap whenever I want, I’d be happy to run it. This doesn’t appear to be a race condition at all, so I don’t think a slowdown would make any difference.

I understand them perfectly well. Been writing the damned things long enough. What I’m telling you is that these DAWs are circumventing that mechanism. That’s straight from the horse’s mouth. Contact me offline and I’ll give you contact information for the horse. These DAWs are trying to save startup and verification time. They are doing something–I’m assuming close to a binary copy–and keeping a more quickly loadable form of the plugin after its initial load. The OS and its mechanisms are not involved except at the first load.

It’s more interesting, as it always is. Plugin verification (AUval, ProTools eval, whatever Cubase does, etc) typically evaluates only the AudioProcessor part of the plugin. They create it and put it away, in some cases not even calling PrepareToPlay (actually AUval does, but it’s a much tougher evaluator). The editor is not created and thus not evaluated. So the DSP part of the plug gets evaluated, the thing gets cached, and when you really launch it the editor gets instantiated, and our little symptom function steps on the heap.

Well, I’ve had another look through the PNG code, and I really can’t see anything at all in there that’s even the slightest bit dodgy! Not sure what else to suggest. Valgrind would be a good tool for this, if only it was possible to run that on the PC.

Ok, understood! Well, that makes things tricky… TBH I’m amazed that something like that would work at all!

Ok, understood! Well, that makes things tricky… TBH I’m amazed that something like that would work at all![/quote]
Well it appears that it doesn’t work all the time.

I don’t know that the png code causes this–it’s just where it turns up. I took a quick, non-exhaustive drive through the Juce code and found many areas that might be problematic–more in a paragraph or two. I found that with the not-to-be-named-DAW-that-runs-64-bit-AAX, that I could not count on constructors for static objects–except the the very first load. It appears that the standard loading/linking takes place then. After that, the cached code is used. Whether it’s a prebound memory image or something else, I simply don’t know. That cached code did not run load-time constructors, so I had to put any static items into a known safe state when I shut down the last instantiation of a plug. Things that had been Strings (for example) needed to become pointers to strings. That way I could NULL the pointer at shutdown and create a new object at startup. Same is true for Arrays of many sorts and so on. Once I got that taken care of, things started to run again.

Cubase and Studio One appear to use a different strategy that does not last over a system reboot. So if I reboot the system, then run Cubase, it will scan the plugins (just instantiating the AudioProcessor part). This means that on that first run, the plugin gets loaded twice–and apparently cached in some form. In those cases I always get the crash in the png area. On subsequent runs (without rebooting), the scan doesn’t happen. The plugin gets loaded only once–and the png crash never happens. Since the program is completely reloaded, it must have a brand-new heap. The only thing that’s different is that double-load. That’s why I think there’s some sort of caching going on. I’m open to another interpretation if you’ve got one.

I should mention some of the other things I tried. Instead of reading the png files in through the png conversion utility, I read them into memory streams instead–just to perturb the heap. But that didn’t move the crash from where I’ve described. I also added in some routines ahead of the png conversion just to fragment the heap–I allocated dozens large and small chunks, overwrote them, then freed them in different orders. I thought that surely I’d find any heap corruption there. But those routines ran fine and the code still crashed at the same point in the png code.

I went through that png conversion code and I didn’t see anything obvious either. The only even possible place that I might wonder about would be the decompression. I used Photoshop to generate and compress the pngs. Any possibility they might decompress to a buffer larger than the one that’s provided? It would be wonderful if that turned out to be the problem. But I still suspect that it might actually lie in code that ran before. With that in mind, I perused some of the juce code for static items that might expect a constructor or some load-time initialization. Here are a few:

juce_Component.cpp - there’s a pointer named currentlyFocusedComponent. If the last component destructor doesn’t null this pointer, perhaps it could be a problem.

juce_ComponentPeer.cpp - there’s an Array called heavyweightpeers. What’s the status of this piece of memory when the last plugin goes away?

juce_DeletedAtShutdown.cpp - there’s a SpinLock called ShutdownLock. What happens if no constructor is run on this?

That’s enough in the way of examples. I want to hasten to say that I’m not complaining that I think there’s something wrong with Juce. There wasn’t anything wrong with my old code either. I could count on all static initializers and constructors to be run when the DLL or the component got loaded. But this whole caching crap has added a new twist. I may be wrong on many of the details–after all I’m trying to suss out what different DAW vendors are doing and they’re not inclined to share. I’m going to see if I can learn a little more about this at NAB, if the vendors let any engineers tag along.

I’ll keep looking, as I have been for the last month. But I wouldn’t surprised to see these things gradually appearing more frequently.

The more I think about it, the less convinced I am that anyone would attempt to “cache” a DLL… Are you sure you’ve not misunderstood that? For it to work, it would have to involve taking a snapshot of everything on the heap that the DLL has allocated, and to later restore it at the same memory address… Surely not possible?

If someone asked me how e.g. tracktion’s DLL loading works, then I might also tell them that it “caches” the DLL, meaning that if you open a plugin, and the DLL gets loaded, and then you close the plugin, it won’t immediately release that DLL, it’ll leave it in memory for a while in case you decide to use it again. Are you sure that’s not what they meant?

It doesn’t cache the whole DLL… just the Description

Rail

[quote=“Rail Jon Rogut”]It doesn’t cache the whole DLL… just the Description
[/quote]
Thanks for that, Rail. Now what’s a Description?

I have just found something that makes ever so much more sense. I’ve found that there is a double free going on. I discovered just what I was looking for, a function called _Heapwalk. It lets me crawl the heap whenever I want. I can look at each heap block and tell whether an address is in the heap or not. During the course of this, I saw the free function called twice for the same address. There are a number of frees that happen in one of your structure destroy operations, and for some reason a redundant pointer has been passed in.

It’s going to take me some more time to nail it exactly. Right now the code is over-instrumented, so I’ll have to begin pulling some of the instrumentation out.

Search for the text Descriptor in the AAX wrapper… pretty sure you’ll find the method which defines the plugin in the wrapper (if memory serves).

EDIT: The method is: createDescriptor()

The SDK includes a PDF which describes how this works (the cache)

Rail

I found the bug. It’s a forehead slapper. I was way out there in the weeds for a while, convinced that the issue must be external to the png conversion code. But I was wrong. It’s right there.

The function png_destroy_read_struct calls the function png_read_destroy which calls png_info_destroy which calls png_info_init_3. That last function does a resize of the info struct and passes back an updated pointer (the resize operation performs a free and then another alloc). The updated pointer is passed back up the chain, but then you’ll notice that the updated pointer doesn’t really make it back to the function png_destroy_read_function. It has passed the info_ptr to png_read_destroy by value, rather than by address. This means that the updated and resized buffer pointed to by info_ptr is not updated. A little later, png_destroy_struct is called with the out-of-date info_ptr. Boom.

So how did this ever work at all? What happens is that that resize operation usually returns the same buffer that it just freed. If you’ve ever written a heap manager (I have), you’ll know that a malloc often gives you a larger buffer than you asked for. If the resize operation is fairly modest, chances are likely that you’ll get back what you just gave up. That’s what happens here. Since you almost always get back the same address, it doesn’t matter that the pointer being used in png_destroy_read_struct is out of date.

But in my case, it appears that Cubase is busy with the heap in other threads. The heap itself is thread-safe (I never saw any corruption), but it appears some higher-priority thread jumped in between the free and the malloc of that resize operation. The little fragment I just gave up got stolen, and I was awarded a different chunk of memory. Now the out-of-date pointer in png_destroy_read_struct is lethal. When that function tries to free it again, we hit the exception.

Of course, it doesn’t even have to be the case that another thread grabs that chunk of memory. Depending on the fragmentation state of the heap, the heap manager might find a better fit and return a different node anyway. I think my bad fortune may result from the fact that I load a lot of pngs. The odds weren’t in my favor. But you can see the bug, plain as day, once you know about it.

Wow. That pnglib function must be called billions of times per day, in millions of apps. Like you say, it’s a miracle it ever worked!

Do you know if this is something that’s fixed in newer versions? If so I guess I should update the version that’s embedded in juce…

[quote=“jules”]
Do you know if this is something that’s fixed in newer versions? If so I guess I should update the version that’s embedded in juce…[/quote]
It’s in the tip, so I’m the lucky finder I suppose. Do I get flowers? 8)

I should say it’s in the Juce tip. I don’t know anything about the current state of that library outside of Juce.