Rob's search thingy

Description

Note: This indexer - search-form combination, works on the file system;
The indexer doesn't crawl! Consequently, it needs to run on the system it indexes.

I use Linklint as an internal link-checker. It looks for broken links on my own website. One of the reports it produces is a file called 'file.txt'. Which contains a list of all the files linked on my website. I use a shell script to generate a list containing all plain-text-, HTML- and PDF files on my site. Pdftotext is used to generate text versions of PDF files. There is a file with the extension '.title' for each PDF file.
The file-list file has following format;

/Path/File<Tab>URL<Tab>Title<LF>

E.G.:

FileURLTitle
 /var/www/index.html  /  Rob's server
 /var/www/time/T4224.txt  /time/T4224.pdf   Temic U4224B Time Code Receiver 
 /home/rob/WWW/index.html   /~rob/  Rob's home page

Other shell scripts use this file to generate the sitemap.html and sitemap.xml.
A combination of a shell script and some C programs is used to index my site. The indexer creates a word-list, a word to document index and an abstracts file. These are used by the web-form.

The software assumes that all HTML files have the '.html' extension. If this is not the case, you need to modify the shell scripts and C-sources, to include other extensions. Furthermore, the software assumes that the charset is UTF-8!
There is no need for a database server. The software maintains the files on it's own.

The functionality of this software is quite limited. But it's also very fast; If I run the indexer from the prompt, the prompt returns right away. Having indexed some 16000 words from ca. 190 documents.

File formats

Description of file formats.

Header files

Some notes on the contents of some of the header files

Download and install

Options

Options