Bilingual termbase creation

Bilingual termbase creation

This summer I will be working on a GSOC project with Intellego team at Mozilla to lay out the foundation for an automatic terminology translation tool for websites. During the first weeks I will be researching the basics of how terminology extraction works.

The first step is to create a bilingual termbase consisting of Mozilla-specific terminology from Mozilla l10n resources (http://transvision.mozfr.org/downloads/). The transvision site holds TMX files that are utilised in the Mozilla browser and FirefoxOS, which contain translations of text between many pairs of languages.

Here is a snippet from the memoire_en-US_es-ES.tmx file.

<tu tuid="browser/chrome/browser/browser.dtd:bookmarkThisPageCmd.label" srclang="en-US">
    <tuv xml:lang="en-US"><seg>Bookmark This Page</seg></tuv>
    <tuv xml:lang="es-ES"><seg>Añadir esta página a marcadores</seg></tuv>
</tu>

The goal over the coming weeks is to extract the key terms in these phrases, statistically analyse them to build up a corpus of text that we know are direct translations between the languages. Since this involves many different techniques we decided it was best to involve third party software that was preferrably open source to carry out this task for us.

Using this corpus, a web interface would be created to dynamically replace the DOM contents of a web page with the one-to-one translation mappings collected.

Ideally we want to convert the TMX files to TBX files that contain translation of key terms that we can use to automatically translate websites.