IntroductionThis tutorial details how to setup the tools needed to build a multilingual text processing pipeline. We first show how to setup singularity containers for SyntaxNet and TreeTagger, so that they can be run on servers, even without root access. Install singularity and debootstrap as described here. Syntaxnet ContainerDownload the
This installs syntaxnet inside the container in the /opt directory and also downloads parameter files for some languages. You can enter the container using: sudo singularity shell - w -- cleanenv syntaxnet . img #for write access
You should now be able to run the different syntaxnet models after unzipping them. To unzip the files, go to the directory where it was downloaded(/opt/models/syntaxnet/syntaxnet/models/other_language_models in our case) and run You should tokenize your text before passing it to the parser. This separates the punctuation marks from the words, thereby increasing accuracy of the parsers. Syntaxnet provides tokenizers for some languages, these can be found on the website. If it’s available it can be done as follows:
Here, sentences.txt is the input file and the output of the tokenizer will be in output.txt. An already tokenized file can be parsed as follows:
You should now have the parsed output from syntaxnet in output.txt. Treetagger ContainerHere we describe how to make a singularity container for treetagger. Treetagger is another parser and provides lemmatization (getting root words) too, which syntaxnet doesn’t. The process is similar to what we did with syntaxnet. Download the treetagger.def file and build the container as follows:
Enter the container using: singularity shell - w -- cleanenv treetagger . img The To run the parser, goto the directory where treetagger was installed, (in our case /opt) and run: cat input . txt | cmd / tree - tagger -< insert language name> This will give the parsed text with corresponding lemmas as output. The OPTIONS = "-no-unknown -token -lemma -sgml -pt-with-lemma" Note for Portuguese: While running the script on portuguese, we noticed that both #grep -v '^$' |
Pipelines for different languagesThe aim of this pipeline is to take as input one of the files in the NewsScape dataset and output an XML-style file with sentence splits; lemmas, POS tagging and dependency information for each word. The pipeline can be summarized in 5 major steps:
The following figure shows the sequence of operations in the pipeline from input to output. |