TreeTagger - a part-of-speech tagger for many languages
The TreeTagger is a tool for annotating text with part-of-speech and
lemma information. It was developed by Helmut Schmid in the TC project
at the Institute for Computational Linguistics of the University of
Stuttgart. The TreeTagger has been successfully used to tag German,
English, French, Italian, Danish, Swedish, Norwegian, Dutch, Spanish,
Bulgarian, Russian, Portuguese, Belarusian, Ukrainian, Galician,
Greek, Chinese, Swahili, Slovak, Slovenian, Latin, Estonian, Polish,
Persian, Romanian, Czech, Albanian, Coptic and old French texts and is adaptable
to other languages if a lexicon and a manually tagged training corpus
are available.
Sample output:
word |
pos |
lemma |
The |
DT |
the |
TreeTagger |
NP |
TreeTagger |
is |
VBZ |
be |
easy |
JJ |
easy |
to |
TO |
to |
use |
VB |
use |
. |
SENT |
. |
The TreeTagger can also be used as a chunker for English, German,
French, and Spanish.
The tagger is described in the following two papers:
Download
Executable code for PC-Linux, Windows, Mac-OS, and ARM
and parameter files for various languages can be downloaded
via the links below.
This software is freely available for research, education and
evaluation. For commercial and other licenses, please contact the developer via the email
address at the bottom of the page.
Please read
the license
terms, before you download the software! By downloading the
software, you agree to the terms stated there.
The following steps are necessary to install the TreeTagger (see
below for the Windows version). Download the files by
right-clicking on the link. Then select "save file as". All files should be
stored in the same directory.
-
Download the tagger package for your system
(PC-Linux,
Mac OS-X (Intel),
Mac OS-X (M1),
ARM64,
ARMHF,
ARM-Android,
PPC64le-Linux).
If you have problems with your Linux kernel version, download this
older Linux version and
rename it to tree-tagger-linux-3.2.5.tar.gz.
-
Download the tagging
scripts into the same directory.
-
Download the installation script install-tagger.sh.
-
Download the parameter files for the languages you want to
process.
-
Open a terminal window and run the installation script in the
directory where you have downloaded the files:
sh install-tagger.sh
-
Make a test, e.g.
echo 'Hello world!' | cmd/tree-tagger-english
or
echo 'Das ist ein Test.' | cmd/tagger-chunker-german
- You also might want to have a look at my new part-of-speech tagger RNNTagger.
Make sure that the installation path contains no blanks and that the files are not automatically unzipped i.e. that the
file ending .gz is still present. If you have difficulties with the
installation, have a look at
the installation hints (kindly
provided by Joachim Wagner).
Parameter files
-
Albanian
parameter file (gzip compressed, UTF-8, trained on
Albanian POS)
-
Belarusian
parameter file (gzip compressed, UTF-8, trained on
the UD Treebank)
-
Bulgarian
parameter file (gzip compressed, UTF-8, tagset documentation, trained on
the Bulgarian
Treebank)
-
Catalan
parameter file (gzip compressed, UTF8, tagset documentation)
-
A Chinese parameter file and tokenizer created by Serge Sharoff are available here
-
A Coptic parameter file created by Amir Zeldes is available here
-
Czech
parameter file (gzip compressed, UTF-8, trained on
the Czech Academic Corpus)
-
Danish
parameter file trained on the ePAROLE corpus (gzip compressed, UTF-8, tagset documentation)
-
Dutch
parameter file (gzip compressed, UTF-8, tagset documentation)
-
Another Dutch
parameter file (gzip compressed, UTF8, trained on the
Eindhoven corpus, tagset
documentation (starts on page 9))
-
English
parameter file (PENN tagset) (gzip compressed,
UTF8, tagset documentation, trained
on the Penn treebank)
-
English
parameter file (BNC tagset) (gzip compressed,
UTF8, tagset documentation, trained
on the British National Corpus)
-
Estonian
parameter file (gzip compressed, UTF-8, tagset documentation)
-
Finnish parameter file
trained on
the Finnish
Treebank (gzip compressed,
UTF-8, tagset
documentation).
-
French
parameter file (gzip compressed, UTF-8, tagset documentation) trained on data kindly provided by Prof. Achim Stein
-
Spoken French
parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Perceo corpus
-
A parameter file for spoken French texts can be
found here
-
Old French
parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Base de Français Médiéval
-
Galician
parameter file (gzip compressed, UTF8, tagset documentation)
-
German
parameter file (gzip compressed, UTF-8, tagset documentation)
-
Spoken German
parameter file (gzip compressed, Latin-1, tagset documentation)
trained on the FOLK corpus provided by the Institut für Deutsche Sprache (IDS) Mannheim
-
Middle High German
parameter file trained by Sarah Schulz on
the Middle High German
Conceptual Database (gzip compressed, UTF-8, paper (in German))
-
Greek
parameter file trained on the INTERA corpus (gzip compressed, UTF8, tagset documentation)
-
Ancient Greek parameter file
(UTF8 encoding
or beta
encoding) trained on the PROIEL
and Perseus
treebanks and kindly provided by Alessandro
Vatri and Barbara McGillivray (gzip compressed, no lemmas, tagset documentation)
-
A Hausa parameter file created by Amir Zeldes is available here
-
Hungarian
parameter file (gzip compressed, UTF8, trained on data annotated with magyarlanc)
-
The Indonesian
parameter file (gzip compressed, UTF8, tagset documentation) has been trained by Prihantoro on the UI corpus using lexical information from the Kateglo dictionary.
-
Italian
parameter file (gzip compressed, UTF8, tagset documentation) trained on data kindly provided by Prof. Achim Stein
-
Marco Baroni's Italian
parameter file (gzip compressed, Latin1, tagset documentation)
-
Korean
parameter file (gzip compressed,
UTF8, tagset documentation). This
parameter file was created in joint work with Prof. Lee Minhaeng on data
kindly provided by the KLPLAB
headed by Prof. Ock Cheol-Young.
-
Latin
parameter file (gzip compressed, tagset info in Italian)
The corpus and
lexicon for training the Latin parameter file have been compiled by
Gabriele Brandolini from
various resources
-
Another Latin
parameter file (gzip compressed, tagset
info) which has been trained on
the Index
Thomisticus Treebank which was kindly provided by Marco Passarotti.
-
Mongolian
parameter file (gzip compressed) created from a small Mongolian corpus by Khuder Altangerel.
-
Norwegian (Bokmaal)
parameter file trained on the Norwegian Dependency Treebank (gzip compressed, UTF-8) with tags mapped to the universal dependency tagset
-
Persian (Farsi) parameter file
trained on the Persian Dependency Treebank
(gzip compressed, UTF8, tagset description).
-
Persian (Farsi) parameter file with coarse tagset
trained on the Persian Dependency Treebank
(gzip compressed, UTF8, tagset description).
-
Polish parameter file
trained on the Polish National Corpus
(gzip compressed, UTF8, tagset description).
-
Portuguese parameter file
provided by Pablo Gamallo
(gzip compressed, UTF8, tagset description).
-
Portuguese
parameter file with fine-grained tagset
provided by Pablo Gamallo
(gzip compressed, UTF8, tagset description).
-
Another Portuguese parameter file
trained on
the Floresta
Sintá(c)tica corpus and the Unitex lexicon (gzip compressed, UTF8).
-
Romanian
parameter file (gzip compressed, UTF8, tagset,
created with the help of Cristian Chirita using a
MULTEXT-East corpus
and lexicon)
-
Russian
parameter file (gzip compressed, UTF8, tagset
trained on a corpus created
by Serge Sharoff)
-
Slovak
parameter file (gzip compressed, UTF8)
The Slovak parameter file was trained on the Slovak National
Corpus. The tagset was simplified.
-
Slovak
parameter file (full tags) (gzip compressed, UTF-8)
The Slovak parameter file was trained on the Slovak National
Corpus. The tagset was not simplified (just a marker for typos was
removed). Many thanks to Vladimir Benko for suggesting to train on the full
tagset and also for his bug reports.
-
Slovenian
parameter file (gzip compressed, UTF-8)
The Slovenian parameter file was trained on the ssj500k 1.3
training corpus. The tagset is documented here.
-
Spanish
parameter file (gzip compressed, UTF-8, tagset documentation)
-
Spanish
parameter file trained on the Ancora corpus (gzip compressed, UTF-8, tagset documentation)
-
Swahili
parameter file (gzip compressed)
The Swahili parameter file was trained on the Helsinki Corpus of
Swahili (HCS) and uses a simplified version of the SALAMA tagset. The HCS
was created by Prof. Arvi Hurskainen by means of his Swahili Language Manager
(SALAMA) which uses Lingsoft's TWOL compiler for constructing morphological
analysers and Connexor's CG2 parser for syntactic disambiguation. The creation
of the parameter file was joint work with Gabriele Brandolini.
-
Swedish
parameter file (gzip compressed, UTF-8, tagset documentation) trained on the Talbanken corpus and the Stockholm University Strindberg Corpus
-
Ukrainian
parameter file (gzip compressed, UTF-8, trained on
the UD Treebank)
Chunker parameter files for PC (Linux, Windows, and Mac-Intel)
-
English
chunker parameter file (gzip compressed, UTF8, tagset info)
Note: The English tagger parameter file is needed, as well.
-
French
chunker parameter file (gzip compressed, UTF-8)
The chunker was trained on the French treebank whose annotation guidelines
are documented here.
Note: The French tagger parameter file is needed, as well.
-
German
chunker parameter file (gzip compressed, UTF-8, tagset info)
Note: The German tagger parameter file is needed, as well.
-
Spanish
chunker parameter file (gzip compressed, UTF-8)
Note: The Spanish tagger parameter file is needed, as well.
Windows version
Download the Windows
TreeTagger package. Unpack the zip file and follow the
instructions in the INSTALL.txt file. The parameter files have to be
downloaded separately. The tagger has to be invoked from a (Windows,
cygwin, msys) shell. Therefore, you might want to install
the graphical interface kindly provided by
Ciarán Ó Duibhín.
Acknowledgments
The Russian parameter file was created on a corpus provided by Serge
Sharoff. He has a webpage with various
resources for Russian NLP.
The French and the Italian parameter files are provided by Achim
Stein.
The parameter file for the French chunker was created by Michel Généreux.
The second Italian parameter files was provided by Marco Baroni.
The English parameter file was trained on
the PENN
treebank and uses the English morphological database created by Karp,
Schabes, Zaidel and Egedi.
The Spanish parameter file was trained on
the Spanish CRATER corpus and uses the Spanish lexicon
of the CALLHOME corpus of
the LDC.
The Spanish chunker was trained on
the IULA Spanish treebank.
The Galician parameter file was trained on
the Xiada corpus provided by the Centro Ramón Piñeiro para a Investigación en Humanidades
The Bulgarian parameter file was created by Julien Nioche on
the Bulgarian
Treebank. It uses UTF-8 encoding and
the BulTreeBank tagset.
Michel Généreux created the
parameter file for the French chunker.
The Estonian parameter file was trained on
the Tartu Morphologically disambiguated corpus. Thanks
to Mark Fishel for pointing me to this data!
Many thanks to Marco Baroni, Pablo Gamallo,
Julien Nioche, Serge Sharoff, Michel Généreux, and Achim
Stein for making their parameter files publicly available! Also thanks
to Holger Wunsch and Cassio Binkowski for compiling the TreeTagger on MacOS and to Florian Bemmann for compiling it for Android systems!
Links
Please send questions, comments, suggestions and bug reports to Helmut
Schmid at LastName@cis.lmu.de.