Week 3

We were still on the hunt for good terminology extraction tool. My mentor had sent me a list of resources to look into for extracting terminology.


The most promising tool I came across for our purpose is the okapi term extraction tool. I was able to extract the terms in order of the frequency. However, it produced two separate files - one for the target language and one source language. The problem was that these extracted terms were not aligned.

430 message
360 You
349 file
332 server
305 page
304 messages
291 brandShortName
277 want

614 de
347 en
290 mensaje
289 que
234 página
232 web
220 para
201 conexión

I was unable to produce an aligned terminology extraction tool. I discussed the issues I was facing with my mentor.

My mentor mentioned that he has a CSV file of extracted terminology (https://www.transifex.com/projects/p/gaia-l10n/glossary/l/es/) and we could use that as a starting point, allowing us to skip the current step. We decided that we would use the CSV to continue with the project and I would continue researching the methods of terminology extraction.