— Multilingual Corpus Pipeline


This tutorial details how to setup the tools needed to build a multilingual text processing pipeline.

We first show how to setup singularity containers for SyntaxNet and TreeTagger, so that they can be run on servers, even without root access.

Install singularity and debootstrap as described here.

Syntaxnet Container

Download the syntaxnet.def file and build a singularity container for syntaxnet using the following commands. Note that you would need 20GB of free space on your machine.

singularity create --size 20000 syntaxnet.img

sudo singularity bootstrap syntaxnet.img syntaxnet.def

This installs syntaxnet inside the container in the /opt directory and also downloads parameter files for some languages.

You can enter the container using:

sudo singularity shell -w --cleanenv syntaxnet.img #for write access singularity shell --cleanenv syntaxnet.img #for read access and testing without elevated user rights

You should now be able to run the different syntaxnet models after unzipping them. To unzip the files, go to the directory where it was downloaded(/opt/models/syntaxnet/syntaxnet/models/other_language_models in our case) and run unzip <filename>.

You should tokenize your text before passing it to the parser. This separates the punctuation marks from the words, thereby increasing accuracy of the parsers. Syntaxnet provides tokenizers for some languages, these can be found on the website. If it’s available it can be done as follows:

cd /opt/models/syntaxnet


cat sentences.txt | syntaxnet/models/parsey_universal/tokenize.sh $MODEL_DIRECTORY > output.txt

Here, sentences.txt is the input file and the output of the tokenizer will be in output.txt. MODEL_DIRECTORY in our case was: /opt/models/syntaxnet/syntaxnet/models/other_language_models/<insert-language-name>

An already tokenized file can be parsed as follows:

cd /opt/models/syntaxnet


cat sentences.txt | syntaxnet/models/parsey_universal/parse.sh $MODEL_DIRECTORY > output.txt

You should now have the parsed output from syntaxnet in output.txt.

Treetagger Container

Here we describe how to make a singularity container for treetagger. Treetagger is another parser and provides lemmatization (getting root words) too, which syntaxnet doesn’t. The process is similar to what we did with syntaxnet.

Download the treetagger.def file and build the container as follows:

singularity create --size 5000 treetagger.img

sudo singularity bootstrap treetagger.img treetagger.def

Enter the container using:

singularity shell -w --cleanenv treetagger.img

The treetagger.def already contained scripts to download parameter files for a few languages. New languages can be downloaded by getting the corresponding link from the website. Note that you have to run install-tagger.sh after downloading new parameter files, to be able to use them.

To run the parser, goto the directory where treetagger was installed, (in our case /opt) and run:

cat input.txt | cmd/tree-tagger-<insert language name>

This will give the parsed text with corresponding lemmas as output.

The tree-tagger-<language> files contain commands to take the input text, tokenize and parse it. The output sometimes contains “<unknown>” in the lemma column for words the parser doesn’t recognize. This can be changed to output the same word in the lemma column by adding the “-no-unknown” tag to OPTIONS in the corresponding tree-tagger file in the cmd directory.

OPTIONS="-no-unknown -token -lemma -sgml -pt-with-lemma"

Note for Portuguese: While running the script on portuguese, we noticed that both tree-tagger-portuguese and tree-tagger-portuguese-finegrained stop at the first special character and don’t give output after that nor any error.  It was found that the script contained a 'grep' for removing blank lines, which was somehow eliminating the text after a special character. We found this could be avoided by commenting line 23 in the file as:

#grep -v '^$' |

Pipelines for different languages

The aim of this pipeline is to take as input one of the files in the NewsScape dataset and output an XML-style file with sentence splits; lemmas, POS tagging and dependency information for each word. The pipeline can be summarized in 5 major steps:
  • Extracting useful text from the input file - using a custom python script
  • Sentence splitting - using Pragmatic Segmenter
  • Tokenization - using Syntaxnet for supported languages(German, Portuguese, Polish, Swedish) and Treetagger for some others(Russian)
  • POS Tagging and Dependency Parsing - using Syntaxnet
  • Lemmatizing - using CST's Lemmatizer for supported languages(German, Portuguese, Polish, Russian) and Treetagger for some others (Swedish)
The following figure shows the sequence of operations in the pipeline from input to output.

Prannoy Mupparaju,
Jul 27, 2017, 1:56 AM