Week 3

We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology.

https://code.google.com/p/maui-indexer/
https://pypi.python.org/pypi/topia.termextract/
http://okapi.sourceforge.net/Release/Utilities/Help/termextraction.htm
http://ngram.sourceforge.net/
http://texlexan.sourceforge.net/

The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned.

EN:
430 message
360 You
349 file
332 server
305 page
304 messages
291 brandShortName
277 want

ES:
614 de
347 en
290 mensaje
289 que
234 página
232 web
220 para
201 conexión

I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor.

My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.