Two problems with String::formatted (one a buffer overflow!)

TheVinn · April 15, 2012, 3:34pm

I have to say, Jules’ solution is far superior.

jules · April 15, 2012, 3:38pm

Well, that was just an example anyway - you could use %foo or $(foo) or foo or anything you like, as long as you replace it correctly. It just seems to make more sense to me that the replaced symbols should be self-explanatory, rather than just being a number whose meaning is buried in the code somewhere.

tomswirly · April 15, 2012, 3:41pm

No sarcasm meant - I mean, I have between one and two hundred user messages in my program and none of them have a { or }. Why would I need that? I have ( and ), and if I needed another sort of bracket, [ and ].

It’s probably a better idea than what I was scribbling on while waiting for SLOW Windows builds (I’m slack and working on the Mac with VMWare…)

tomswirly · April 15, 2012, 3:53pm

Now I look at my code using internationalization, there are two downsides to Jules’ strategy, which could be worked around perhaps.

They both have to do with errors: what if you try to substitute a message that isn’t in the string? What if you forget to substitute a message that is in the string?

This isn’t such a big deal for me because I’m both the programmer and the translator, and because the program isn’t huge - but in “enterprise” organizations, the programmer probably has little or no contract with the translator and there might easily be thousands of messages (the last internationalized program I worked on had tens of thousands of messages!) and you cannot rely on someone looking at the output of each one of these in the context of the program - particularly since a lot of error messages are hard to reproduce.

I’m still going to use the {} version for my own work, as it’s faster and easier. It’s possible I might be the only person actually to internationalize a Juce application (I don’t see anyone mentioning it on the net) so we can deal with these issues when someone actually does have an app with a thousand messages in it…

TheVinn · April 15, 2012, 3:57pm

Using Jules’ technique you would get a more descriptive output. i.e. “{surface}” instead of “$1”

tomswirly · April 15, 2012, 4:07pm

That’s not an acceptable solution (for an enterprise operation) - as I mentioned above, you can’t have someone check all those messages by hand every time you make a build, and it might be very difficult even to make all the messages appear (because many of them are error messages that are hard to deliberately cause).

You can’t “break it and then let someone find it” in a large system with thousands or tens of thousands of messages. There needs to be a way to automatically do this checking at build time.

jules · April 15, 2012, 4:14pm

I don’t think there’s anything inherently more error-prone in using {foo} than using %1, %2. In either case you can make the mistakes that you mention above. But when things have labels rather than magic numbers, it does cut down the risk of your parameters being in the wrong order.

If I had to manage a really huge project then I’d probably create a special function that will assert if the substring isn’t found, and perhaps something that would assert if there seems to be an un-replaced symbol left behind in the string.

Probably the best solution would be a nice function that could do it all in one hit, and check the result, e.g.:

String translated = translateWithReplacements ("the {animal} sat on the {surface}.", "{animal}", getAnimal(), "{surface}", getSurface());

This would be be able to do a much smarter job of error-checking than a generic printf-style approach.

tomswirly · April 15, 2012, 5:30pm

I agree that the named parameters are easier to use, and avoid the “order issue”.

But there’s no way to force verification other than executing every possible codepath that generates messages. With %1 and such, you can automatically verify at compile time or “startup time” that you have exactly the right number of arguments.

You simply have a set of classes - Trans0, Trans1, Trans2… At construction, you verify that e.g. Trans2’s message has only two parameters in it. And Trans2 only allows exactly two parameters to format so you catch that sort of error at compile time.

The one issue is making sure that all of these are constructed during startup, but that’s pretty easy to do.

If I had to manage a really huge project then I’d probably create a special function that will assert if the substring isn’t found, and perhaps something that would assert if there seems to be an un-replaced symbol left behind in the string.

Same problem yet again. For a large program, you can’t force every single codepath to be executed in every language during testing.

The verification has to happen either at compile time, or at construction time (because it is fairly easy to force all variables of a given type to be constructed at startup). Verification that happens later on doesn’t work.

I did work on such a large system - even though we had less than 1% error rates, which was pretty good, with tens of thousands of messages and hundreds of messages either changing or being translated every week, this worked out as multiple new issues appearing every day.

This is all sort of academic - I don’t think your target audience really is large programs with dozens of programmers.

jfitzpat · April 15, 2012, 5:37pm

FWIW, I only meant to convey the idea that you want tokens, not fixed param order. I strongly concur with Jule’s position that meaningful tokens are the way to go. On top of everything else, it makes it easier for the person doing a translation. %1 is not as clear as {number_of_nuns}.

Also like Jules, I can see no real difference between %1 and {toe jam}, you have to have an escape mechanism for the token character(s) to be used in strings, etc.

The issue of errors, missing tokens, and even missing strings is common to every localization scheme. I’ve always taken the approach of a ‘master language’. The one that all the error messages, etc. are present and expected to work correctly. I then always have a mechanism to compare translations to it. Basically a simple tool to make sure that a) every string is accounted for, b) no string exceeds hard limits (more of a matter in embedded stuff when you have fixed displays, etc.), and c) that tokens match in strings that have them (not order, but same tokens). Otherwise you need a QA team to test every case in every language, which still misses stuff.

Something I’ve only adopted in recent years is to do something similar at run time. Basically, do the same tests with localized content that I used to do at build/compile and fall back to the master language if I can square the content. I also tend to be more forgiving about token formatting ({ toe jam } is taken for {toe jam}). This came from one of our startups wanting to let international distributers do their own localizations.

tomswirly · April 15, 2012, 5:42pm

jfitzpat: the one advantage of %1 is that you can do the verification at compilation/construction time - aside from that, it’s inferior.

Unfortunately, I can’t see any way with named parameters to do verification without going through every code path that might generate a message… .

TheVinn · April 15, 2012, 5:48pm

I think the take-away point here is that internationalization is messy, there is no “perfect” solution, and when deciding on how to implement the translations in a project you must carefully balance mutually exclusive properties to meet business needs.

jfitzpat · April 15, 2012, 6:13pm

[quote=“TomSwirly”]jfitzpat: the one advantage of %1 is that you can do the verification at compilation/construction time - aside from that, it’s inferior.
[/quote]

You lost me there. I use tokens of many kinds in lots of projects. Not just languages, but graphical skins, etc. and I detect resource problems when the project is built.

[quote=“TomSwirly”]
Unfortunately, I can’t see any way with named parameters to do verification without going through every code path that might generate a message… .[/quote]

That problem exists in English, long before an app is localized. I’ve seen plenty of apps pop up blank error messages in their ‘native’ language. The point I was trying to make above is that you have hopefully tested one language this way. If not, then at least it is the ‘most tested’ language.

So I automatically test target resources (not just strings) against the master for accuracy and completeness. If you test resource equivalency then the translated language should be no more or less well behaved than the master language. Resources are simpler than code paths, which never get properly covered anyway, so I’ve always validated there.

But I’m not going to argue with you. I agree with Vinnie’s point. Localization is like giving an enema to an elephant. You can proclaim to have a superior technique, but the reality is that you are worried, filthy, and sincerely wishing you were working on something else…

tomswirly · April 15, 2012, 6:45pm

You lost me there. I use tokens of many kinds in lots of projects. Not just languages, but graphical skins, etc. and I detect resource problems when the project is built.

So - how do you do it in C++?

Let’s get specific. Suppose the sentence is phrase = “Please put the {foo} in the {bar}” - but the programmer mistakenly says phrase.replace(“{foo}”, “cat”); and forgets about the second token.

How do you detect that at build time?

You can do a good job at localization. I worked somewhere where they did. It took a lot of resources, though - they more or less had a named token for each message and for each parameter in a message…!

jfitzpat · April 17, 2012, 1:34am

The same way I know that the source is all checked in and tagged…

Seriously, the superficial response would be, ‘make a class for tokens’, then the programmer can’t put in gibberish, but that isn’t what you are really asking.

But we appear to have radically different understandings of ‘build’. Think of it this way, when I started programming in C++, there was no C++ such thing as a C++ compiler. You ran a C++ preprocessor which turned your C++ source into C. You then ran a C compiler which converted your C source into assembly language, then you ran an assembler. Further, when you work on embedded platforms, you often don’t have things like ‘file systems’, so you might need to run self rolled tools to package binary resources for inclusion in flash, etc…

This is conceptually not much different from, say, the Introjucer. Which takes abstract content and generates platform specific content, as well as bundling things like binary resources in a portable way.

So, while there are ways to enforce proper syntax in C++ - between template metaprogramming and elaborate pre-processor use, you can do some fairly amazing things at compile time in C++, I find that stuff gibberish to read and hard to debug, so I tend to split it out into a separate pass. This has the benefit of skipping it until you get ready for an external build.

If you always use the same class/methods to convert tokens, it is not hard to scan your source for it with a utility and validate strings/tokens. I also tend to look for things like hardcoded strings outside of the standard method. I’ve seen people roll this sort of stuff up with lint, piping, grep, etc. but, again, I find that hard to read/maintain gibberish. I tend to just write a tool (or more likely, reuse a tool I’ve written before).

FWIW I spent 10+ years at a huge multi-national. We did very good localization as well. That’s part of the reason I think of builds differently. If you want reproducible builds, you don’t go developer->world. The ‘build master’ replicates your build on a separate, dedicated machine. It’s also where I started checking resource integrity for alternate languages. It cuts down on QA resources. We tried it on some releases and tracked language specific reports to see if the savings came with a cost.

Topic		Replies	Views
Latest tip build error General JUCE discussion	34	1040	January 16, 2011
String representation General JUCE discussion	40	3238	December 12, 2012
Std::min in JUCE causing warnings General JUCE discussion	8	858	April 9, 2010
StringFormatter General JUCE discussion	7	1217	March 3, 2011
Embedding unicode string literals in your cpp files General JUCE discussion	40	14939	November 4, 2022

Two problems with String::formatted (one a buffer overflow!)

Purchase

Discover

Learn

Support

About

Events

Two problems with String::formatted (one a buffer overflow!)

Related topics

Purchase

Discover

Learn

Support

About

Events