Week 11

Continuing on from last week with the small improvements of accommodating singular and plurals words. We continued to think of ways to improve the translation engine. I suggested that there were certain words that were still incorrect due to the alignment in the transvision file.

<tu tuid="mail/chrome/messenger-newsblog/feed-subscriptions.dtd:subscriptionDesc.label" srclang="en-US">
    <tuv xml:lang="en-US"<segNote: Removing or changing the folder for a feed will not affect previously downloaded articles.</seg</tuv
    <tuv xml:lang="es-ES"<segNota: eliminar o cambiar la carpeta de un canal no afectará a los artículos descargados previamente.</seg</tuv
</tu>

articles,caducados,0.665205148335

In the above example the word articles and artículos should be paired together, as they are the correct translated pair. However, in the above sentence you can see that the positions of the two words are not exactly aligned, thus the algorithm returns the incorrect pair. My mentor suggested that I could develop an algorithm to align the words based on the results of a POS tagger.

I developed a function that would take both sentences and parse them through a POS tagger. It would then compare the tag of the source word and match that to a word that has the same tag. It would continue doing this while iterating through the sentence. These results are then passed back and appended to the corpus. I appended these results to the corpus instead of replacing because it would give us an increased accuracy in the results while not reducing the amount of translated pairs.

After this change I noticed that the previous incorrectly detected pairs are fixed.

articles,artículos,0.752740413515