Week 7

Last week I had progressed with the project in making a working prototype of web page translator. My mentor had pointed out a few issues to fix to improve the system. The TBX file we used had a few errors, some segment pairs had translated words separated by commas. The string replacement was also not taking into account pluralisation or capitalisation when replacing the source string. I also found that the text contents were sent all at once and I needed to attempt a different method to check for any difference. A segment by segment method seemed like a good approach.

The current approach to translating the text makes use of Javascript to replace the text found in the DOM. Through research I came across the NLTK in python and it has many utilities that we could reuse for our project such as tokenizing segments of text. Moving the translation to the server side meant that utilities provided by NLTK could be used in the process to translate the DOM and sent to the browser already translated to the target language.

All hyperlinks on the website has to be changed so that when clicked it would load within the iframe and go through our proxy. Many sites including Mozilla Support sites have the X-Frame:Deny header, meaning that the website cannot be browsed within an IFrame. To get around this issue, we load each link through our translation engine - so it fetches the raw html, translates the DOM and send it to the client for the IFrame to load.