Week 11

Continuing on from last week with the small improvements of accommodating singular and plurals words. We continued to think of ways to improve the translation engine. I suggested that there were certain words that were still incorrect due to the alignment in the transvision file.

<tu tuid="mail/chrome/messenger-newsblog/feed-subscriptions.dtd:subscriptionDesc.label" srclang="en-US">
    <tuv xml:lang="en-US"<segNote: Removing or changing the folder for a feed will not affect previously downloaded articles.</seg</tuv
    <tuv xml:lang="es-ES"<segNota: eliminar o cambiar la carpeta de un canal no afectará a los artículos descargados previamente.</seg</tuv
</tu>

articles,caducados,0.665205148335

In the above example the word articles and artículos should be paired together, as they are the correct translated pair. However, in the above sentence you can see that the positions of the two words are not exactly aligned, thus the algorithm returns the incorrect pair. My mentor suggested that I could develop an algorithm to align the words based on the results of a POS tagger.

I developed a function that would take both sentences and parse them through a POS tagger. It would then compare the tag of the source word and match that to a word that has the same tag. It would continue doing this while iterating through the sentence. These results are then passed back and appended to the corpus. I appended these results to the corpus instead of replacing because it would give us an increased accuracy in the results while not reducing the amount of translated pairs.

After this change I noticed that the previous incorrectly detected pairs are fixed.

articles,artículos,0.752740413515

Week 10

I imported the CSV of the extracted bilingual terms into the database before showing my mentor the translation results. On the surface it did seem that we were translating more words than before. It also meant that the early goal of using the transvision TMX files to create a corpus of terminology pairs was near completion.
My mentor had pointed out that words needed to be converted between singular and plural based on their source. E.g. If a source word in english is a plural but the translation for this term was only in singular we needed to detect this and convert the target word to a plural before replacement.

I had received a couple of links to research:
http://cldr.unicode.org/index/cldr-spec/plural-rules
https://developer.mozilla.org/en-US/docs/Mozilla/Localization/Localization_and_Plurals#Usage

While I was researching how to utilize these plural rules in the existing python term extraction script I came across the pattern library (http://www.clips.ua.ac.be/pages/pattern). It served a similar function to NLTK but supported more features and other languages. The library is written in Python and it also contains methods to convert words between their singular and plural equivalents.

Utilising the parsetree module I analysed the Part-Of-Speech tree of a segment, looking for tags (http://www.clips.ua.ac.be/pages/mbsp-tags) that indicated whether a word was a singular or plural. A tag of NN (noun, singular or mass) or NNP (noun, proper singular) indicated that the source word is a singular and in this case we would convert the word to a singular in the target language and visa versa for plurals.

Week 8 - 9

Over these weeks I experimented with various python libraries used to parse HTML and XML, such as Beautifulsoup and lxml. Using a parser worked well, but it missed some elements. I found that using regex was more reliable for the purpose of replacing text inside elements, as it was a more generic.

This week I backtracked to one of the project’s earlier objectives and focused on extracting terminology from the TMX files in Transvision. Through the research on NLTK and reading the Natural Language Processing with Python (http://www.nltk.org/book/) book, I decided to tackle this problem again.

I started working on a python script to do bilingual term extraction with the help of NLTK. NLTK has a module called align, it uses statistical methods to predict translated pairs. There are a couple of algorithms that NLTK provides for doing aligned sentence term extraction (https://github.com/nltk/nltk/wiki/Machine-Translation), I decided to the use IBMModel for now. The script parses the TMX file, and creates a list containing tuples of aligned pairs of translated sentences. Iterating through these pairs, they can be cleaned up using NLTK before being added to a corpus of AlignSent objects (http://www.nltk.org/howto/align.html). The IBMModel is then passed this corpus of aligned sentences, so that it can train the model and figure out which words occur most often aligned with their translated counterpart. Each match is given a rank indicating a value of how sure it is that the pair is correct.

I continued optimising the script such as lower casing all the tokens, eliminating any tokens that are less than 2 characters, eliminating any words that are known to be stop words in each respective language and including tokens that have only letters.

Through out these changes I saved the output of the model to CSV files to track the improvements of the script.

Week 7

Last week I had progressed with the project in making a working prototype of web page translator. My mentor had pointed out a few issues to fix to improve the system. The TBX file we used had a few errors, some segment pairs had translated words separated by commas. The string replacement was also not taking into account pluralisation or capitalisation when replacing the source string. I also found that the text contents were sent all at once and I needed to attempt a different method to check for any difference. A segment by segment method seemed like a good approach.

The current approach to translating the text makes use of Javascript to replace the text found in the DOM. Through research I came across the NLTK in python and it has many utilities that we could reuse for our project such as tokenizing segments of text. Moving the translation to the server side meant that utilities provided by NLTK could be used in the process to translate the DOM and sent to the browser already translated to the target language.

All hyperlinks on the website has to be changed so that when clicked it would load within the iframe and go through our proxy. Many sites including Mozilla Support sites have the X-Frame:Deny header, meaning that the website cannot be browsed within an IFrame. To get around this issue, we load each link through our translation engine - so it fetches the raw html, translates the DOM and send it to the client for the IFrame to load.

Week 6 - Translating web pages

Last week I worked on connecting the API and frontend interface together to test our prototype. So far the frontend interface has a text area for input similar to Google Translate, where the user can enter a string and press the translate button to view the translated string in the text box to the right.

This week I am focusing on translating a full web page. To accomplish this I will load the URL DOM contents in an IFrame and then translate the text contents of the web page through Javascript. Google Translate uses the same approach because loading a remote URL through an IFrame disallows any Javascript being executed due to the Same Origin policy. Therefore, the target web page needs to be requested through the server and served within the same domain as the IFrame.

One of the project aims is to use the Mozilla support sites as a testbench for how well our translation engine is working. Using python libraries such as requests, lxml and re I obtained the raw contents of the response, stripped the script tags and added a base tag to get relative URL’s. The script tags of the remote URL had to be stripped to avoid any Javascript being executed outside the IFrame.

Through Javascript I was able to query the amaGama database with the text contents of the page and obtain a JSON array of any translation matches. Using string replacement functions within Javascript I translated the text of the document dynamically.

Week 5 - API and Web Interface

I had began investigating amaGama and reading the source code for the project. I had used amaGama to build the database of the terms we had collected using the following commands:

amagama-manage initdb -s en -s es
amagama-manage build_tmdb --verbose -s en -t es -i ../GaiaGlossary.tbx

amaGama exposed a REST API that allowed passing a query in a urlencoded format. The string would be queried against the database and returns the result in a json data structure. amaGama used the levenshtein distance to rank the results, this cause some queries to be omitted from the results.

amaGama is a flask based project and we had decided early on that a Python based framework like Django or Flask would be ideal for building the web interface on. Therefore, I downloaded the source and started extending amaGama.

After setting up the project locally I looked into how amaGama stores the terminology in the database. Each terminology is split into a vector of the stemmed words.

Screen Shot 2014-06-24 at 22.55.19.png

When a word or segment is queries through the REST API, it searches through these vectors for matches and returns the translated equivalent through the lookup of a foreign key.

amaGamma was not made for searching through terminology though, therefore some queries did not return any results even through there was a clear match based on what is contained in the database. I modified the query function by removing the levenshtein distance and ranking metric, this made the query more generic and therefore returning more results. This gave us a simple base to start modifying the SQL query to our needs.

While the REST API needed improvement for now, I was happy with the results returned by the API. I started to put together the web interface to try out the results.

Screen Shot 2014-06-24 at 23.06.40.png

Week 4

Continuing on from last week I researched on how the CSV file containing the translated terminology can be converted to the TBX format we wanted. That is when I came across the Translate Toolkit Project. It contained many useful command line utilities (http://translate-toolkit.readthedocs.org/en/latest/commands/index.html) for our project.

Including a tool called poterminology that extracted terminology from the TMX files from transvision. I experimented with this tool to try and get the terminology extraction accurate. Using another command line utility I was able to convert the CSV file to a TBX file.

The toolkit also contained a Django app called amaGama (http://docs.translatehouse.org/projects/amagama/en/latest/) that used a bilingual terminology file to create a translational memory data store that we could query over a REST API. I used amaGama to build a database containing a terminology by importing the Gaia CSV file. This tool was very important to the progress of the project, since we were not having much luck with our first objective of extracting terminology from the TMX files.
The idea is to build on top of amaGama (http://docs.translatehouse.org/projects/amagama/en/latest/), utilising postgres as the db. The main thing that amaGama provides is a way to search through the terminology database with great speed. We can build a web interface on top of this platform utilising its features.

Week 3

We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology.

https://code.google.com/p/maui-indexer/
https://pypi.python.org/pypi/topia.termextract/
http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
http://ngram.sourceforge.net/
http://texlexan.sourceforge.net/

The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned.

EN:
430 message
360 You
349 file
332 server
305 page
304 messages
291 brandShortName
277 want

ES:
614 de
347 en
290 mensaje
289 que
234 página
232 web
220 para
201 conexión

I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor.

My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.

Terminology Extraction Tools Research - Week 2

Continuing on from last week, I am dedicating my time to finding a Terminology extraction tool. I had discussed the possibility of creating such a tool, but we had agreed that is would be better to find an open source tool that did the job due to the time constraints we faced.

Through research I came across various resources on the topic matter. Comparison of tools here (http://michellez1231.blogspot.co.uk/2010/07/research-on-terminology-extraction.html) However, most of these tools are not free, and if they do offer a free version there is usually a limit on usage or time. More importantly, most of these tools are not open source.

I continued my research where I came across https://code.google.com/p/extract-tmx-corpus/, this tool was windows only based. Since I used OSX, I took the time to setup a windows VM to test out the program. The program had a very simple UI, however the functionality of the program did not seem to work. The output files were all empty, and there was no terms extracted. After a couple of attempts with different data sets, I gave up and moved on.

I had discussed with my mentor that I was having trouble find a suitable terminology extractor, he mentioned that he will help me out by reaching out to other developers.

Bilingual termbase creation

Bilingual termbase creation

This summer I will be working on a GSOC project with Intellego team at Mozilla to lay out the foundation for an automatic terminology translation tool for websites. During the first weeks I will be researching the basics of how terminology extraction works.

The first step is to create a bilingual termbase consisting of Mozilla-specific terminology from Mozilla l10n resources (http://transvision.mozfr.org/downloads/). The transvision site holds TMX files that are utilised in the Mozilla browser and FirefoxOS, which contain translations of text between many pairs of languages.

Here is a snippet from the memoire_en-US_es-ES.tmx file.

<tu tuid="browser/chrome/browser/browser.dtd:bookmarkThisPageCmd.label" srclang="en-US">
    <tuv xml:lang="en-US"><seg>Bookmark This Page</seg></tuv>
    <tuv xml:lang="es-ES"><seg>Añadir esta página a marcadores</seg></tuv>
</tu>

The goal over the coming weeks is to extract the key terms in these phrases, statistically analyse them to build up a corpus of text that we know are direct translations between the languages. Since this involves many different techniques we decided it was best to involve third party software that was preferrably open source to carry out this task for us.

Using this corpus, a web interface would be created to dynamically replace the DOM contents of a web page with the one-to-one translation mappings collected.

Ideally we want to convert the TMX files to TBX files that contain translation of key terms that we can use to automatically translate websites.