L20n/Background
When writing localizable software, in particular in an environment based on a toolkit, software authors and localizers face composited strings in the UI. Samples would be strings composited out of templates and computed data like in an address or a generic toolkit error message.
Localizers are facing three major challenges in this context:
- The string composition code includes assumptions on the grammar or composition of text.
- Different plural forms depending on a value.
- Different grammatical forms, depending on a given value.
The first item really just covers annoying bugs in the software, but they are way too common, and thus listed.
The second and third items are a tad more tricky, so let's make an example. It's based on two rather unfounded assumptions, firstly, that Jägermeister is such a gross drink that it would be of male gender anywhere in the world, and that daisies are cute enough to be female.
English sample
10 small Jägermeister hang around.
9 small Jägermeister hang around.
... English is a bore ...
1 small Jägermeister hangs around.
Head off and order new ones!
The same with daisies:
10 small daisies hang around.
... well, yeah, ditto ...
1 small daisy hangs around.
Head off and protect your environment!
Now for the same in German (which is still pretty darn simple)
German sample
10 kleine Jägermeister standen rum.
... German is a bore ...
1 kleiner Jägermeister stand rum.
Bestelle eine neue Runde!
10 kleine Butterblümchen standen rum.
... still ...
1 kleines Butterblümchen stand run.
Es ist Zeit für ein bisschen Umweltschutz!
Let's leave the zero value out for this discussion for the moment, because I really just entered fun values there. Let's assume that we're in the scenario of a generic toolkit message, and Jägermeister and daisy would be product specific. You could just as well think Firefox and Thunderbird. Let's have two variables in this sample, name for Jägermeister or daisy and number for the values from 10 to 1.
For positive integers, gettext offers a reasonable approach, i.e., you have a macro that returns an index based on a number, and you use that to index an array. gettext does not offer a solution for the grammar and gender problems in this example, though.
For the given languages (and most languages, really), you need to know the gender of name to create the right composed string. In the common scenario that the value of name is actually in the localization data of your application, the localizer should know the gender. For other scenarios, there may be hints from the application for common requests like this. In a CRM application, for example, extracting the gender of a contact person shouldn't be the problem. In the generic case, localizers will have to work around the lack of information like they do today. There's no way to tell the gender of a person called "Andrea" or "Mic", or "Moon".
Going back to the three target audiences, the intent of this approach is to shield as much detail of a particular language from the application logic. Thus, no software author should ever see how many grammar variations the currently chosen language would have, all this information should be strictly encapsulated within the localization and the localization library. In addition, it is subject of the third group, people knowing internals of quite a bunch of languages, to create a set of properties and values to be used by applications to mark up for example the gender of an item.
Going back to the zero terms for this example, for a known finite set of values, even this odd-looking case can be handled.