Rob's search thingy

Note: This indexer - search-form combination, works on the file system;
The indexer doesn't crawl! Consequently, it needs to run on the system it indexes.
The search-form may produce abstracts. It needs to access the files in order to do this.

I use Linklint as an internal link-checker. It looks for broken links on my own website. One of the reports it produces is a file called 'file.txt'. Which contains a list of all the files linked on my website. I use a shell script to generate a list containing all plain-text-, HTML- and PDF files on my site. Pdftotext is used to generate text versions of PDF files. There is a file with the extension '.title' for each PDF file.
The file-list file has following format;

/Path/File<Tab>URL<Tab>Title<LF>

E.G.:

FileURLTitle
 /var/www/index.html  /  Rob's server
 /var/www/time/T4224.txt  /time/T4224.pdf   Temic U4224B Time Code Receiver 
 /home/rob/WWW/index.html   /~rob/  Rob's home page

Other shell scripts use this file to generate the sitemap.html and sitemap.xml.
A combination of a shell script and some C programs is used to index my site. The indexer creates a word-list and a word to document index. These are used by the web-form.

The software assumes that all HTML files have the '.html' extension. If this is not the case, you need to modify the shell scripts and C-sources, to include other extensions.
There is no need for a database server. The software maintains the files on it's own.
The functionality of this software is quite limited. But it's also very fast; If I run the indexer form the prompt, the prompt returns right away. Having indexed some 16000 words from ca. 190 documents.

Source: websearch.tar.gz