Slovenian language has a terrible lack of widely available tools for text processing so about a year ago I had to build my own part-of-speech tagger based on widely available IJS JOS-1M corpus with help of NLTK library. Afterwards I published it on my GitHub account as a collection of scripts to train a model which wasn’t really widely useful. So now I’ve fixed the situation by publishing a pre-built version of the POS tagger on PyPi with updated usage and documentation.
This week, on april 26th, I gave a talk about Solr basics on WebCamp Ljubljana. Here I’m listing the slides, tips and the relevant links for anyone starting up with Solr. Here I’m listing relevant links from the slide-deck which are good starting point for Solr deployment.
There are numerous Python/Solr libraries out there, each having a different subset of functionality. Obviously, as per Murphy’s law, none of them had a set of features I required. So I rolled my own - PySolarized! I wrote PySolarized because I needed a Solr connector which would dispatch and query documents to multiple cores.
I’ve just uploaded 1.1 update for Lemmagen lemmatizer for Solr, which is now a pure Java .JAR library and does not require installation of any additional files on your server. New version also updates package name and configuration attribute to be more consistent.
In last days I’ve managed to finish my wrestling with Pythons awful packaging systems and have managed to publish Lemmagen lemmatizer to Python Pypi repository. To install it just run pip install Lemmagen inside your favorite Python/virtualenv environment. Note that installation requires a working C++ compiler for your platform. Then to use it, instantiate Lemmatizer class and call lemmatize() on it. By default the lemmatizer is instantiated with slovenian dictionary, others are avaiable via dictionary keyword argument to the constructor.
I’ve been spending my time hacking on my slovenian news parser called news-bus buddy (source on Bitbucket. Since the “recent news” page looked a little empty with only news titles, I needed an algorithm to get summary of an article from database.
Apache Solr is a popular full-text search engine with RESTful interface, which makes it perfect search engine with most type of web sites. However, the quality of search results is dependent on language filters, with a good lemmatizer being the most essential. That’s why I’ve created a Solr module, which uses JSIs LemmaGen lemmatizer for Solr index building and search queries. The source to the lemmatizer and module is available on Bitbucket slovene_lemmatizer repository.