Rob's search thingy

Description

Note: This indexer - search-form combination, works on the file system;
The indexer doesn't crawl! Consequently, it needs to run on the system it indexes.
I you want a crawler instead, look here.

Indexer

I use Linklint as an internal link-checker. It looks for broken links on my own website. One of the reports it produces is a file called 'file.txt'. Which contains a list of all the files linked on my website. I use a shell script to generate a list containing all plain-text-, HTML- and PDF files on my site. Pdftotext is used to generate text versions of PDF files. There is a file with the extension '.title' for each PDF file.
The file-list file has following format;

/Path/File<Tab>URL<Tab>Title<LF>

E.G.:

FileURLTitle
 /var/www/index.html  /  Rob's server
 /var/www/time/T4224.txt  /time/T4224.pdf   Temic U4224B Time Code Receiver 
 /home/rob/WWW/index.html   /~rob/  Rob's home page

Other shell scripts use this file to generate the sitemap.html and sitemap.xml.
A combination of a shell script and some C programs is used to index my site. The indexer creates a word-list, a word to document index and an abstracts file. These are used by the web-form.

The software assumes that all HTML files have the '.html' extension. If this is not the case, you need to modify the shell scripts and C-sources, to include other extensions. Furthermore, the software assumes that the charset is UTF-8!
There is no need for a database server. The software maintains the files on it's own.

Tag handling

Except text in alt-tags, names and titles, the software ignores all text in HTML tags. Text in alt-tags, names and iframe-titles needs to be in quotes (alt="Some text").
The software considers tags to be word delimiters. This does not apply to '<a href="Url">', '</a>', bold, italic and underline. So if only part of a word is clickable, it still gets indexed as one word.
Numeric- and SGML entities are converted to UTF-8 before indexing. Some 2400 HTML, SGML and XML entities are supported.

Non ASCII

When indexing 'garät' it indexes 'garat' as well. This way, if search words are entered without accents, both accented an non accented forms are found.
Note: This only works for latin scripts.

Limits

Internal arithmetic is done with 32-bit signed integers, which limits the size of files and number of unique search words.
Furthermore, document numbers are unsigned 16-bit integers, which limits the number of indexed documents.
Data is processed on a per line basis. Lines may not be longer then 4095 bytes, including newline.
When indexing non-ASCII, only UTF-8 is supported.

The functionality of this software is quite limited. But it's also very fast; If I run the indexer from the prompt, the prompt returns right away. Having indexed some 20000 words from ca. 200 documents.

Search

The search-form is very simple too. It produces links to all the pages which contain all of the searched words. All on one page!
When more then one word is entered, it produces abstracts too. Each abstract contains the first 94 words of the document.
Searched words are highlighted in the abstracts.

File formats

Description of file formats.

Header files

Some notes on the contents of some of the header files

Download and install

Options

Options

Version

Search: 2022-01-20 19:15:56 UTC
Indexer:2022-01-20 19:01:41 UTC

Updates

DateChange
2021-03-08  Max word size from 23 to 31 bytes.
2021-09-20  Updated no-alnum.h and wc2num.h in order to support Unicode 14.
2021-10-31  Minor change to cgi-index man page.
Minor changes to cgi-index help.
2021-11-06  cgi-index now checks '-f' option.
2021-11-08  cgi-index now checks 64k and 2G boundaries.
2021-11-09  Increased max path from 255 to 2047.
2021-11-12  Minor cosmetic fix of cgi-index.
2021-11-30  Lots of changes to cgi-index;
Renamed wc2asc.h to wc2str.h
Renamed cgi-index.h to wc2asc.h
Moved stuff from cgi-index.c to new cgi-index.h
Commented out unused stuff in cgi-search.h
Minor cosmetic fix.
Tar contains wc2num.h for glyphs > 64k. I accidentally packaged the small version at 2021-09-20.
2021-12-11  Reduced max path from 2047 to 2031 bytes.
Cleaner tag handling.
2021-12-13  Fixed minor abstract bug.
2021-12-14  Added some decomposed support.
2021-12-19  Fixed typos in the file section of man pages.
2021-12-24  Better tag handling.
2021-12-25  Better tag handling.
2021-12-26  Fixed typo.
2022-01-06  Tar now extracts in it's own directory 'index'. I should have done this a long time ago.
2022-01-10  Better tag handling.
2022-01-20  Compact words-list file format.