Project spotlight: Slovenian part-of-speech tagger for Python

Jernej Virag

July 12, 2014

Slovenian language has a terrible lack of widely available tools for text processing so about a year ago I had to build my own part-of-speech tagger based on widely available IJS JOS-1M corpus with help of NLTK library. Afterwards I published it on my GitHub account as a collection of scripts to train a model which wasn't really widely useful.

So now I've fixed the situation by publishing a pre-built version of the POS tagger on PyPi with updated usage and documentation.

To use it, just install it with pip:

pip install slopos

This will install the tagger and it's dependencies NLTK and PyYAML.

To use just call tag method on slopos module:

import slopos
tags = slopos.tag("Jaz sem iz okolice Ljubljane")

print tags
[('Jaz', 'ZOP-EI'),
 ('sem', 'GP-SPE-N'),
 ('iz', 'DR'),
 ('okolice', 'SOZER'),
 ('Ljubljane', 'SLZER.')]

Note that import slopos will take awhile - it has to load and unpack the tagger.

The tagger tokenizes the sentence automatically and returns the result in form of (word, tag) tuples. Tags are constituted from sequence of letters, where each subsequent letter gives more detailed classification of the word. The first letter always denotes general word class (e.g. S - noun, G - verb, etc.). Words that cannot be classified are marked with tag -None-.

Full tag reference is available in tag_reference-sl.txt file on GitHub.

The project is still available on GitHub under LGPLv2.1 license. Any contribution and bug fixes in form of a pull request is very appreciated.