TreeTagger - a part-of-speech tagger for many languages

The TreeTagger is a tool for annotating text with part-of-speech and lemma information. It was developed by Helmut Schmid in the TC project at the Institute for Computational Linguistics of the University of Stuttgart. The TreeTagger has been successfully used to tag German, English, French, Italian, Danish, Swedish, Norwegian, Dutch, Spanish, Bulgarian, Russian, Portuguese, Belarusian, Ukrainian, Galician, Greek, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish, Persian, Romanian, Czech, Albanian, Coptic and old French texts and is adaptable to other languages if a lexicon and a manually tagged training corpus are available.

Sample output:

word	pos	lemma
The	DT	the
TreeTagger	NP	TreeTagger
is	VBZ	be
easy	JJ	easy
to	TO	to
use	VB	use
.	SENT	.

The TreeTagger can also be used as a chunker for English, German, French, and Spanish.

The tagger is described in the following two papers:

Helmut Schmid (1995): Improvements in Part-of-Speech Tagging with an Application to German. Proceedings of the ACL SIGDAT-Workshop. Dublin, Ireland.
Helmut Schmid (1994): Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, Manchester, UK.

Download

Executable code for PC-Linux, Windows, Mac-OS, and ARM and parameter files for various languages can be downloaded via the links below.

This software is freely available for research, education and evaluation. For commercial and other licenses, please contact the developer via the email address at the bottom of the page.

Please read the license terms, before you download the software! By downloading the software, you agree to the terms stated there.

The following steps are necessary to install the TreeTagger (see below for the Windows version). Download the files by right-clicking on the link. Then select "save file as". All files should be stored in the same directory.

Download the tagger package for your system (PC-Linux, Mac OS-X (Intel), Mac OS-X (M1), ARM64, ARMHF, ARM-Android, PPC64le-Linux).
If you have problems with your Linux kernel version, download this older Linux version and rename it to tree-tagger-linux-3.2.5.tar.gz.
Download the tagging scripts into the same directory.
Download the installation script install-tagger.sh.
Download the parameter files for the languages you want to process.
Open a terminal window and run the installation script in the directory where you have downloaded the files:

sh install-tagger.sh

Make a test, e.g.

echo 'Hello world!' | cmd/tree-tagger-english

echo 'Das ist ein Test.' | cmd/tagger-chunker-german

You also might want to have a look at my new part-of-speech tagger RNNTagger.

Make sure that the installation path contains no blanks and that the files are not automatically unzipped i.e. that the file ending .gz is still present. If you have difficulties with the installation, have a look at the installation hints (kindly provided by Joachim Wagner).

Parameter files

Albanian parameter file (gzip compressed, UTF-8, trained on Albanian POS)
Belarusian parameter file (gzip compressed, UTF-8, trained on the UD Treebank)
Bulgarian parameter file (gzip compressed, UTF-8, tagset documentation, trained on the Bulgarian Treebank)
Catalan parameter file (gzip compressed, UTF8, tagset documentation)
A Chinese parameter file and tokenizer created by Serge Sharoff are available here
A Coptic parameter file created by Amir Zeldes is available here
Czech parameter file (gzip compressed, UTF-8, trained on the Czech Academic Corpus)
Danish parameter file trained on the ePAROLE corpus (gzip compressed, UTF-8, tagset documentation)
Dutch parameter file (gzip compressed, UTF-8, tagset documentation)
Another Dutch parameter file (gzip compressed, UTF8, trained on the Eindhoven corpus, tagset documentation (starts on page 9))
English parameter file (PENN tagset) (gzip compressed, UTF8, tagset documentation, trained on the Penn treebank)
English parameter file (BNC tagset) (gzip compressed, UTF8, tagset documentation, trained on the British National Corpus)
Estonian parameter file (gzip compressed, UTF-8, tagset documentation)
Finnish parameter file trained on the Finnish Treebank (gzip compressed, UTF-8, tagset documentation).
French parameter file (gzip compressed, UTF-8, tagset documentation) trained on data kindly provided by Prof. Achim Stein
Spoken French parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Perceo corpus
A parameter file for spoken French texts can be found here
Old French parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Base de Français Médiéval
Galician parameter file (gzip compressed, UTF8, tagset documentation)
German parameter file (gzip compressed, UTF-8, tagset documentation)
Spoken German parameter file (gzip compressed, Latin-1, tagset documentation)

FOLK corpus

Middle High German parameter file trained by Sarah Schulz on the Middle High German Conceptual Database (gzip compressed, UTF-8, paper (in German))
Greek parameter file trained on the INTERA corpus (gzip compressed, UTF8, tagset documentation)
Ancient Greek parameter file (UTF8 encoding or beta encoding) trained on the PROIEL and Perseus treebanks and kindly provided by Alessandro Vatri and Barbara McGillivray (gzip compressed, no lemmas, tagset documentation)
A Hausa parameter file created by Amir Zeldes is available here
Hungarian parameter file (gzip compressed, UTF8, trained on data annotated with magyarlanc)
The Indonesian parameter file (gzip compressed, UTF8, tagset documentation) has been trained by Prihantoro on the UI corpus using lexical information from the Kateglo dictionary.
Italian parameter file (gzip compressed, UTF8, tagset documentation) trained on data kindly provided by Prof. Achim Stein
Marco Baroni's Italian parameter file (gzip compressed, Latin1, tagset documentation)
Korean parameter file (gzip compressed, UTF8, tagset documentation). This parameter file was created in joint work with Prof. Lee Minhaeng on data kindly provided by the KLPLAB headed by Prof. Ock Cheol-Young.
Latin parameter file (gzip compressed, tagset info in Italian)

various resources

Another Latin parameter file (gzip compressed, tagset info) which has been trained on the Index Thomisticus Treebank which was kindly provided by Marco Passarotti.
Mongolian parameter file (gzip compressed) created from a small Mongolian corpus by Khuder Altangerel.
Norwegian (Bokmaal) parameter file trained on the Norwegian Dependency Treebank (gzip compressed, UTF-8) with tags mapped to the universal dependency tagset
Persian (Farsi) parameter file trained on the Persian Dependency Treebank (gzip compressed, UTF8, tagset description).
Persian (Farsi) parameter file with coarse tagset trained on the Persian Dependency Treebank (gzip compressed, UTF8, tagset description).
Polish parameter file trained on the Polish National Corpus (gzip compressed, UTF8, tagset description).
Portuguese parameter file provided by Pablo Gamallo (gzip compressed, UTF8, tagset description).
Portuguese parameter file with fine-grained tagset provided by Pablo Gamallo (gzip compressed, UTF8, tagset description).
Another Portuguese parameter file trained on the Floresta Sintá(c)tica corpus and the Unitex lexicon (gzip compressed, UTF8).
Romanian parameter file (gzip compressed, UTF8, tagset, created with the help of Cristian Chirita using a MULTEXT-East corpus and lexicon)
Russian parameter file (gzip compressed, UTF8, tagset trained on a corpus created by Serge Sharoff)
Slovak parameter file (gzip compressed, UTF8)

Slovak National Corpus

tagset

Slovak parameter file (full tags) (gzip compressed, UTF-8)

Slovak National Corpus

tagset

Vladimir Benko

Slovenian parameter file (gzip compressed, UTF-8)

ssj500k 1.3

here

Spanish parameter file (gzip compressed, UTF-8, tagset documentation)
Spanish parameter file trained on the Ancora corpus (gzip compressed, UTF-8, tagset documentation)
Swahili parameter file (gzip compressed)

Helsinki Corpus of Swahili

SALAMA tagset

Swedish parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Talbanken corpus and the Stockholm University Strindberg Corpus
Ukrainian parameter file (gzip compressed, UTF-8, trained on the UD Treebank)

Chunker parameter files for PC (Linux, Windows, and Mac-Intel)

English chunker parameter file (gzip compressed, UTF8, tagset info)
Note: The English tagger parameter file is needed, as well.
French chunker parameter file (gzip compressed, UTF-8)
The chunker was trained on the French treebank whose annotation guidelines are documented here.
Note: The French tagger parameter file is needed, as well.
German chunker parameter file (gzip compressed, UTF-8, tagset info)
Note: The German tagger parameter file is needed, as well.
Spanish chunker parameter file (gzip compressed, UTF-8)
Note: The Spanish tagger parameter file is needed, as well.

Windows version

Download the Windows TreeTagger package. Unpack the zip file and follow the instructions in the INSTALL.txt file. The parameter files have to be downloaded separately. The tagger has to be invoked from a (Windows, cygwin, msys) shell. Therefore, you might want to install the graphical interface kindly provided by Ciarán Ó Duibhín.

Acknowledgments

The Russian parameter file was created on a corpus provided by Serge Sharoff. He has a webpage with various resources for Russian NLP.

The French and the Italian parameter files are provided by Achim Stein.

The parameter file for the French chunker was created by Michel Généreux.

The second Italian parameter files was provided by Marco Baroni.

The English parameter file was trained on the PENN treebank and uses the English morphological database created by Karp, Schabes, Zaidel and Egedi.

The Spanish parameter file was trained on the Spanish CRATER corpus and uses the Spanish lexicon of the CALLHOME corpus of the LDC.

The Spanish chunker was trained on the IULA Spanish treebank.

The Galician parameter file was trained on the Xiada corpus provided by the Centro Ramón Piñeiro para a Investigación en Humanidades

The Bulgarian parameter file was created by Julien Nioche on the Bulgarian Treebank. It uses UTF-8 encoding and the BulTreeBank tagset.

Michel Généreux created the parameter file for the French chunker.

The Estonian parameter file was trained on the Tartu Morphologically disambiguated corpus. Thanks to Mark Fishel for pointing me to this data!

Many thanks to Marco Baroni, Pablo Gamallo, Julien Nioche, Serge Sharoff, Michel Généreux, and Achim Stein for making their parameter files publicly available! Also thanks to Holger Wunsch and Cassio Binkowski for compiling the TreeTagger on MacOS and to Florian Bemmann for compiling it for Android systems!

TreeTagger - a part-of-speech tagger for many languages

Download

Acknowledgments

Links