Two problems with String::formatted (one a buffer overflow!)

I know that Jules doesn’t like String::formatted - but I have an internationalized(*) application which has a lot of strings like: “Unable to open file %s with error code %s” and there really isn’t another way to do it.

First, there’s a buffer overflow. The buffer used is 256 bytes long - if the results of String::formatted are greater than 256 bytes, it simply writes into “random memory”. This 256 character limitation isn’t documented, and as you know, buffer overflows are very dangerous…

So far, I haven’t hit this limitation - I found it while debugging the next problem - but since I use String::formatted to make fairly long error messages that include full file paths it is a matter of certainty that if enough people use my program, someone will get an error with a file with a very long path and I’ll have something completely unpredictable happen.

My second issue is that String::formatted seemed to work fine on the Mac to print strings, but I got non UTF-8 (“garbage”) characters generated on the PC.

After some debugging, I narrowed it down to the fact that Juce’s String::formatted only seems to accept wide characters on the PC, and only narrow characters on the Mac!

The following code works fine, but requires different actual calls on Mac and on PC…[code]String MAIL_SUBJECT(“Support Request: %s”);
String title = “Some title here”;

const char* narrow = s.toUTF8().getAddress();
const wchar_t* wide = s.toWideCharPointer();

String res =
#if JUCE_WINDOWS
String::formatted(MAIL_SUBJECT, wide);
#else
String::formatted(MAIL_SUBJECT, narrow);
#endif
[/code]The code sample works fine on both Mac and PC. If I change the condition to !JUCE_WINDOWS, it works incorrectly on both Mac and PC, so I can’t use just one of the two alternatives…

This doesn’t seem right. Is there a better way to do this, or am I missing something obvious?

(* - admittedly, we haven’t prepared a translation yet, but it’s all set up to do so…)

1 Like

Damn, I hate that function.

1 Like

:slight_smile:

I knew that. But there really isn’t an alternative for internationalized applications. You need SOME sort of templated output function - it’s not just that word order changes from language to language, but if you try to split your messages up into tiny substrings and then put them together, it’s almost impossible from the translator to do it correctly.

The only alternative is to use some full-scale templating language like Clearsilver - but that’s a really heavy hammer for a problem that printf and its numerous inbred cousins do quite well.

I don’t know how to fix the wide/narrow character problem, but the simple solution to the buffer overflow is to create another version where the first argument is a length, and then deprecate the original one.

Confess - you love C++, with all its warts. This is one of them - you should learn to love its expediency and its history, and accept its gnarliness.

Really, formatted (like sprintf) kind of sucks for localization too.

Sure, you can take “%d nuns dancing on the head of %d pins” and translate it to some other western languages, but the order of the items inserted is fixed. That makes for really convoluted grammar in some languages depending on the subject. You generally want something like tokens: “%1 nuns dancing on the head of %2 pins”. That way if the language is most natural with something like “On pins of %2, %1 nuns gyrate…”, you can do it.

juce::String already has all the members you’d need to put together a pretty spiffy localized parameter string class.

Yes, absolutely good point. You absolutely can’t be sure that the order of terms is the same. I’m planning to have six languages - English, French, German, Indonesian, Spanish, Italian - because I know the first five fairly well and the last one is dead easy - and in these languages the order of nouns is basically the same - but Japanese would be very important and I’m fairly sure that its noun order can be different.

Hmm… makes me think here a bit. In particular, you only need to have the equivalent of %s - because for these messages, numbers are rare, and because you can just pre-format them as strings. So all you really need is %1, %2, %3 and nothing else.

Didn’t I send Jules something like this about a year ago?! But I can’t find it.

I might whip something out this afternoon…

(* - that is, me)

Well, interesting - I ran into a limitation of juce::String that’s preventing me from doing a really good job on this.

The issue is that there’s no efficient way to set a single character in a string! operator[] returns a const juce_wchar and there seems to be no setter method - so building a string by adding one character at a time is potentially quadratic in time, even if you have a good maximum bound in advance as to the length of the string.

Such a setter (not necessarily operator[]) should be added to juce::String. For parsing purposes, you often need to do this…

That’s crazy talk… if the underlying juce::String is UTF8 or UTF16 encoded then there is no 1:1 mapping between logical characters and physical positions in the memory block used to store the string. Attempting to discover the physical index of a logical character would run in O(N) where N ~= logical index, and then actually changing the character would be either O(1) or O(N) where N ~= logical index depending on the difference between the original code point and the new code point.

However, a manly way to resolve this would be to provide a non-const operator ONLY for a juce::String which uses UTF32 (since there is a 1:1 mapping). I believe you can do this yourself by calling toUTF32(), removing the const, and doing the work using UTF32 code points. Then you could convert it back I suppose, and wrap this all up in a nice interface that hides the mess.

if the underlying juce::String is UTF8 or UTF16 encoded

DOH! I should have realized that on my own, having done an awful lot with UTF-8 (ultra-quibble - note that there’s a dash in the official name).

As an aside, I have zero understanding of why anyone would use UTF-16 - it seems to have the worst of all worlds, not being backward compatible with ASCII, not having a predictable character size, and being twice as long as UTF-8 for coding “plain old ASCII” strings.

It’s a great choice if you want to call the Unicode version of the Win32 API functions.

In fact, it’s the only choice.

:frowning: Quite so. Yes, I vaguely knew that, but that doesn’t mean it’s rational!

Perhaps I should have spoken slightly differently and said, “What was going through the mind of the people who invented UTF-16 is beyond me.”

[quote=“TomSwirly”]but Japanese would be very important and I’m fairly sure that its noun order can be different.
[/quote]

Yes, you can see it for yourself with something like Google Translate. The # pins would be first in Japanese, and Hebrew too. German would be something like 10 Nonnen tanzen… in this case, but I’ve run into grammatical problems with technical phrases before.

If microsoft hadn’t used it for their entire win32 API then probably nobody else would ever have bothered with it. I bet there’s a MS employee somewhere who really regrets making that decision…

Not sure about how efficient it is but doesn’t String:: replaceSection do what you want?

It sure does but he’s looking for super-linear performance (as would I). Look at replaceSection:

	while (i < index) {
		//...
		++insertPoint;
		++i;
	}

Just like I predicted, it is O(N) with N ~= insert index. And the rest of the function has to do the dirty work of adding up the size of each of some of the remaining code points, then a bunch of memory shuffles.

TheVinn gets it exactly.

I frankly shouldn’t worry for my own code, because my strings are a) short b) don’t have too many segments.

But a lot of the naïve ways to build strings are quadratic in the number of pieces or the length of the string - something that bites you down the road when you run your code on an entire document and wonder why it never comes back…

I’m trying to avoid “work” (that is, going back to my Windows build and debugging it) so I wrote FormatString and the implementation is here, with some tests.

Have fun with it!

Microsoft regretting anything other than not making more money by whatever means necessary? Oh man, I gotta leave before I say something even snarkier about Microsoft…Seriously, I’m outta here.

You’re putting yourself on the front-lines with that one Tom! You’re a brave soul!

Didn’t we have a thread about this same thing fairly recently…?

Like I think I said in that other thread (which I can’t seem to find now…), my preference is still to avoid any kind of printf-style nonsense, and to just write things like this:

String translatedString = TRANS ("the {animal} sat on the {surface}.") .replace ("{animal}", getAnimal()) .replace ("{surface}", getSurface());

…which I think is much more readable than “%1” or “%s”, particularly in the actual translation file. And it’s totally clear from the code what’s going on. And it’ll be just as efficient as a template-based string formatting class.

Hmm, not a bad idea really… and people aren’t likely to use {}…