Preprocessing of the corpora
The first step is to retrieve the corpora. We've already seen how to do this, but let's now formalize it in a function. To make it generic enough, let's enclose these functions in a file named corpora_tools.py
.
- Let's do some imports that we will use later on:
import pickle import re from collections import Counter from nltk.corpus import comtrans
- Now, let's create the function to retrieve the corpora:
def retrieve_corpora(translated_sentences_l1_l2='alignment-de-en.txt'): print("Retrieving corpora: {}".format(translated_sentences_l1_l2)) als = comtrans.aligned_sents(translated_sentences_l1_l2) sentences_l1 = [sent.words for sent in als] sentences_l2 = [sent.mots for sent in als] return sentences_l1, sentences_l2
This function has one argument; the file containing the aligned sentences from the NLTK Comtrans corpora. It returns two lists of sentences (actually, they're a list of tokens), one for the source language (in our case, German), the other...