SummerOfCode/2012/l10n-tool

From MozillaWiki
Jump to: navigation, search

l10n tool for Standardization of Localization

A need is felt for standardization of translation of words during localization. As of now, lack of an exhaustive list of such standardized usage leads to various problems for l10n contributors and more so for new contributors wanting to join the community. In a nutshell, the goal of the project is to create an exhaustive database for entries, terms, words and the corresponding suggested translation for the same. This should also extend to work with small phrases and sentences. The idea is to use a MT system on the existing localization work. Essentially, write scripts that would modify the existing localization work into a format suitable for MT system learning. These scripts will extend those that of "Transvision" which produces tmx files nightly. Based on the output of this step, a list will be created that would contain the entries, terms, words and the corresponding suggested translation. Corrections to this will be made, if any, manually. This then will be organized into a database along with a small web portal that will help contributors to l10n find words and preferred translations real easy. Again the working will be done using the Transvision portal as a base. At the outset the aim will be to do this for 4 languages (due to lingual restrictions in initial verification) but later extended to all languages supported by Mozilla. Finally, in addition to this quality comparison of suggestions from the tool compared to the already localized strings will be done and using the quality score from the MT system find if any inconsistencies exist in the localized strings. As can be seen, Transvision has partly achieved a few of the goals listed above. Hence, the plan is to leverage Transvision and extend it.[Detailed Proposal][GSoC]

Objective

The main objective of this project was to enable l10n teams across languages be able to have a tool which help them translate strings by leveraging MT tool like Moses. From implementation point of view the goals were

  1. Portal to query for standardized terms
  2. Translation for Mozilla l10n-files
  3. Supplementary scripts

Latest Updates

Portal

The portal to query for standard terms is up and running. is up and running and as of now supports two Languages

  1. French
  2. Spanish

The l10n-tool [Portal] similar to Transvision that is used by the Frenchmozilla team.

How To

The portal is meant as an aid for localizers who want to do a quick query as to how a particular word was spelt in the previously translated pages. The portal is pretty straightforward to use.

There are three fields

  • Language - The language you want to run the query on
  • Accuracy - The accuracy is an indicator as to how certain the MT Tool was certain about its translation. You can adjust it to filter out results.
  • Search Term - This is where you actually enter the term you want to run.

Translating Corpus

Achieving this was probably the most desired goal of this project.

This has been successfully achieved for the French l10n-files. The Moses MT after being trained by the existing translations from the "release" repo was asked to translate a "beta" repo file from English to French. And the results were more than good.

The results were then compared to the existing french "beta" repo file. And while most of the results matched those that didn't match a comparison of the two strings suggested that the translations done by Moses were much better.

Present Status

The exercise of translating the corpus has been completed for

  1. French
  2. Spanish
  3. Hindi

Corpus and Results[Download]

Supplementary Scripts

Additional supplementary scripts have been written to make it easier to be able to setup the tool and use it.

The [l10n-tool scripts] are hosted on Github.

Midterm Progress

Week 1

To Do Status
Study Machine Translation Tools Done
Study Input format for MT Tools Done
Scripts to Convert l10n files to required format Done

Week 2

To Do Status
Install Moses Done
Parallel Corpus Study Done
Scripts to Convert l10n files to required format Partial

Week 3

To Do Status
Scripts to Convert l10n files to required format Done

After initial working with the Mozilla l10n and Moses it seems that the existing data might not be sufficient. Furthermore not all data of l10n files like .dtd and .properties can be used. They pose a potential problem for effective learning. One way to get around it was to use the Transvision to generate additional data. Basically the idea was to use the glossary of Transvision to generate additional data. However, that didn't work out as the alignment is not reliable enough to use it feed data to the MT.

So as of now I have following options

  1. Use l10n files of other projects
  2. Improve the alignment accuracy of the Transvision
  3. Some other way I haven't thought of yet

Summary

As per the reading and discussions with my mentor a decision has been made to first adopt Moses - A statistical MT Tool. However, there are potential problems that we could face with MT. First is the lack of sufficient data to learn from. (Using just Mozilla l10n) This could be addressed by using supplementary material from other l10n Projects or the well established repositories of translated work. The question remains if this will give good enough results. The answer I think as of now is Yes. Furthermore, it needs to be thought whether the tool can be integrated with existing localization tools of Mozilla like Mozilla Translator.