Apache Solr is a popular full-text search engine with RESTful interface, which makes it perfect search engine with most type of web sites. However, the quality of search results is dependent on language filters, with a good lemmatizer being the most essential.
That’s why I’ve created a Solr module, which uses JSIs LemmaGen lemmatizer for Solr index building and search queries. The source to the lemmatizer and module is available on Bitbucket slovene_lemmatizer repository.
1. Download lemmatizer files
Compile the files yourself of grab pre-compiled binaries from downloads section of the repository:
Solr 4.x / Linux x64
2. Install lemmatizer native library
libLemmatizer.so to a path in your application servers
java.library.path. You can either copy it to a default search directory (e.g.
/usr/local/lib) or pass
-Djava.library.path=/path/to/libLemmatizer.so to your application server startup script.
For Tomcat 6 on Debian/Ubuntu edit /etc/default/tomcat6 and upadte JAVA_OPTS to:
JAVA_OPTS="-Djava.awt.headless=true -Xmx128m -XX:+UseConcMarkSweepGC -Djava.library.path=/path/to/folder/with/liblemmatizer/so"
3. Install lemmatizer Solr module
sl_lemmatizer-1.0.jar to your cores lib directory. E.g. if your core is located in
/var/solr/core, create a
lib folder next to
data folders of the core and copy the lemmatizer there.
4. Copy lemmatizer dictionary to folder accessible by Tomcat
Copy .bin file for your selected language from dictionaries folder to a folder readable by Tomcat. I usually put it into Solrs Catalina home directory. Dictionary file for slovenian language is named
Lemmatizer is a Solr filter, so you use it by passing a
<filter> parameter into index and query
<analyzer> parts in
schema.xml. The lemmatizer requires path to dictionary file as a parameter:
<filter class="si.virag.solr.RdrLemmatizerFactory" dictionary="/path/to/lem-me-sl.bin" />
The full Solr configuration example with schema.xml is available on my GitHub in solr_example repository. More information can also be found on the slides for my Solr talk in Kiberpipa.
Contributions to the library or examples welcome :)