Week 10

I imported the CSV of the extracted bilingual terms into the database before showing my mentor the translation results. On the surface it did seem that we were translating more words than before. It also meant that the early goal of using the transvision TMX files to create a corpus of terminology pairs was near completion.
My mentor had pointed out that words needed to be converted between singular and plural based on their source. E.g. If a source word in english is a plural but the translation for this term was only in singular we needed to detect this and convert the target word to a plural before replacement.

I had received a couple of links to research:
http://cldr.unicode.org/index/cldr-spec/plural-rules
https://developer.mozilla.org/en-US/docs/Mozilla/Localization/Localization_and_Plurals#Usage

While I was researching how to utilize these plural rules in the existing python term extraction script I came across the pattern library (http://www.clips.ua.ac.be/pages/pattern). It served a similar function to NLTK but supported more features and other languages. The library is written in Python and it also contains methods to convert words between their singular and plural equivalents.

Utilising the parsetree module I analysed the Part-Of-Speech tree of a segment, looking for tags (http://www.clips.ua.ac.be/pages/mbsp-tags) that indicated whether a word was a singular or plural. A tag of NN (noun, singular or mass) or NNP (noun, proper singular) indicated that the source word is a singular and in this case we would convert the word to a singular in the target language and visa versa for plurals.