Note: This indexer - search-form combination,
works on the file system;
The indexer doesn't crawl! Consequently, it needs to run on the system it indexes.
I use Linklint as an internal
link-checker. It looks for broken links on my own website. One of the reports
it produces is a file called 'file.txt'. Which contains a list of all the
files linked on my website. I use a shell script to generate a list containing
all plain-text-, HTML- and PDF files on my site. Pdftotext is used to generate
text versions of PDF files. There is a file with the extension '.title' for each
The file-list file has following format;
|/var/www/time/T4224.txt||/time/T4224.pdf||Temic U4224B Time Code Receiver|
|/home/rob/WWW/index.html||/~rob/||Rob's home page|
Other shell scripts use this file to generate the
sitemap.html and sitemap.xml.
A combination of a shell script and some C programs is used to index my site. The indexer creates a word-list, a word to document index and an abstracts file. These are used by the web-form.
The software assumes that all HTML files have the '.html' extension.
If this is not the case, you need to modify the shell scripts and C-sources,
to include other extensions.
Furthermore, the software assumes that the charset is UTF-8.
There is no need for a database server. The software maintains the files on it's own.
The functionality of this software is quite limited. But it's also very fast; If I run the indexer form the prompt, the prompt returns right away. Having indexed some 16000 words from ca. 190 documents.