The text indexer used in Gjallar right now is Swish-e. Generally it:
It also has some downsides like not (yet) supporting multibyte characters like utf8 and being a bit awkward when it comes to changing/deleting documents or in other words incremental updates.
- is quite fast
- creates small index files
- scales to at least a few million documents
- is available for the main platforms (Linx/Unix, Win32, MacOS)
- is available under a suitable license (GPL v2 with a special rule)
- and has plenty of features
Some interesting links:
Gjallar interfaces with Swish-e using OSProcess/CommandShell on Unix/Linux and using some bat-files on Win32. Swish-e is used in unmodified form. Ideally we could later use FFI or a plugin and go straight for the C api since this API is available to use even though Swish-e is under GPL (addendum to the license).
We might also want to look into other suitable alternatives where I believe Xapian to be the main contender, especially since developers of Swish3 (Swish-e version 3) are pointing at it as the "wheel" they don't see the point of reinventing.
Note: Currently we do not use the aggregator nor any of the filters of Swish-e (but filters may be interesting to look into in order to index attachments too).