About Download Docs Changes

Word Search is Useless for Japanese

As mentioned on moongift, Hatta's indexed search doesn't seem to work for the Japanese language. Actually it does work, but in a completely useless way.

It seems that the Japanese don't usually use spaces to separate words – the grammar lets them see where words end without it, similar to how it worked ancient Latin. But Hatta's word indexer uses spaces and punctuation to separate words for indexing – so Japanese sentences, or even whole paragraphs, get indexed as single words. This is obviously pretty useless.

I can think of several possible solutions, but none of them seem to be within my reach without a help from someone who knows Japanese or similar languages:

I would be very interested in any information about how it is usually done in Japanese and other languages/scripts that don't use spaces for separating words. – Radomir Dopieralski


The ejSplitter was very efficient for COREblog (a blog engine for Zope). It should be trivial to port ejSplitter to Hatta. – Klaus Alexander Seistrup


Thank you, this is exactly what I was looking for. Unfortunately the whole ejSplitter required some Zope modules, so I just cut out the code I needed and put it in SplitJapanese.py file. Whenever Hatta finds this file in its import path, it enables indexing of Japanese text. – Radomir Dopieralski


Fixed Bugs