Over these weeks I experimented with various python libraries used to parse HTML and XML, such as Beautifulsoup and lxml. Using a parser worked well, but it missed some elements. I found that using regex was more reliable for the purpose of replacing text inside elements, as it was a more generic.
This week I backtracked to one of the project’s earlier objectives and focused on extracting terminology from the TMX files in Transvision. Through the research on NLTK and reading the Natural Language Processing with Python (http://www.nltk.org/book/) book, I decided to tackle this problem again.
I started working on a python script to do bilingual term extraction with the help of NLTK. NLTK has a module called align, it uses statistical methods to predict translated pairs. There are a couple of algorithms that NLTK provides for doing aligned sentence term extraction (https://github.com/nltk/nltk/wiki/Machine-Translation), I decided to the use IBMModel for now. The script parses the TMX file, and creates a list containing tuples of aligned pairs of translated sentences. Iterating through these pairs, they can be cleaned up using NLTK before being added to a corpus of AlignSent objects (http://www.nltk.org/howto/align.html). The IBMModel is then passed this corpus of aligned sentences, so that it can train the model and figure out which words occur most often aligned with their translated counterpart. Each match is given a rank indicating a value of how sure it is that the pair is correct.
I continued optimising the script such as lower casing all the tokens, eliminating any tokens that are less than 2 characters, eliminating any words that are known to be stop words in each respective language and including tokens that have only letters.
Through out these changes I saved the output of the model to CSV files to track the improvements of the script.