Rob's search thingy File formats

Binary formats

Binary formats are used by the search form.

word─list

Alphabetical list of words.
32-bit (4-byte) word number followed by word. Word is max 31 bytes followed by a terminating NULL (before 2021-03-08 this used to be max 23 bytes).

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word number                       │
    ├────────┼────────┼────────┼────────┤
  4 │ Word                              │
    ├────────┼────────┼────────┼────────┤
  8 │                                   │
    ├────────┼────────┼────────┼────────┤
 12 │                                   │
    ├────────┼────────┼────────┼────────┤
 16 │                                   │
    ├────────┼────────┼────────┼────────┤
 20 │                                   │
    ├────────┼────────┼────────┼────────┤
 24 │                                   │
    ├────────┼────────┼────────┼────────┤
 28 │                                   │
    ├────────┼────────┼────────┼────────┤
 32 │                            0      │
    ├────────┼────────┼────────┼────────┤
 36 │ 0        0        0        0      │
    └────────┴────────┴────────┴────────┘

Records are fixed length with zero padding. There is an additional 4 byte pad, which is zero. Word numbers start at one, not zero. Charset is UTF-8. The sort order is ascending unsigned byte value, not Unicode code points.

index─list

Lists per word in which documents these words occur. E.G.: Word number 3 is present in documents 6, 7 and 8.
Lower 16 bits of word number followed by one or more 16─bit document numbers. Terminated with a 16─bit NULL.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
    │ Word number     │ Document number │
    ├────────┼────────┼────────┼────────┤
    │ Document number │ Document number │
    ├────────┼────────┼────────┼────────┤
     ...
    ├────────┼────────┼────────┼────────┤
    │ Document number │ 0               │ 
    └────────┴────────┴────────┴────────┘

Records are variable length. Document numbers start at one, not zero. The word number can be used for debugging and file consistency checks.

index.idx

Index to index─list.
Array of structs: 32─bit word number, followed by offset to record, followed by record length.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word number                       │
    ├────────┼────────┼────────┼────────┤
  4 │ Pointer / Offset (bytes)          │
    ├────────┼────────┼────────┼────────┤
  8 │ Length (bytes)                    │
    └────────┴────────┴────────┴────────┘

The length does not include the terminating NULL. The word number can be used for debugging and file consistency checks. Sort order is word number.

abstr-list

Lists per document the first 94 words. Generated by 'cgi-index -a'.
32-bit document number, followed by one or more word numbers. Terminated with a 32-bit NULL.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
    │ Document number                   │
    ├────────┼────────┼────────┼────────┤
    │ Word number                       │
    ├────────┼────────┼────────┼────────┤
     ...
    ├────────┼────────┼────────┼────────┤
    │ Word number                       │
    ├────────┼────────┼────────┼────────┤
    │ 0                                 │
    └────────┴────────┴────────┴────────┘

Records are fixed length with zero padding. Word numbers start at one, not zero. The document number can be used for debugging and file consistency checks. Sort order is document number.

links─list

Contains URLs and their titles; <a href="$URL">$TITLE</a>
UTF-8 NULL terminated character string.

    ┌────────┬──────     ──────┬────────┐
    │ Link           ...         0      │
    └────────┴──────     ──────┴────────┘

Records are variable length.

links.idx

Indexes to links─list.
Array of structs: 32─bit document number, followed by offset to record, followed by record length.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Document number                   │
    ├────────┼────────┼────────┼────────┤
  4 │ Pointer / Offset (bytes)          │
    ├────────┼────────┼────────┼────────┤
  8 │ Length (bytes)                    │
    └────────┴────────┴────────┴────────┘

The length does not include the terminating NULL. The document number can be used for debugging and file consistency checks. Sort order is document number.

Text formats

Text formats are used for debugging or to generate binary files.

num─words.list

Generated by 'cgi─index ─t'. Alphabetical list of words.
Hex word number followed by word.

  XXXX<SP>Word<LF>

E.G.:

  102B caffeine

With the '─l' option;

  0000102B caffeine

index.list

Generated by 'cgi─index ─t'. Lists per word in which documents these words occur. E.G.: Word number 3 is present in documents 6, 7 and 8.
Hex word number followed by one or more hex document numbers separated by spaces.

  XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>

E.G.:

  0003 0006 0007 0008

num-abstr.list

Generated by 'cgi-index -a -t'. Lists per document the first 94 words.
Hex document number followed by one or more hex word numbers separated by spaces.

  XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>

E.G.:

  0001 3029 31DA 3BAD

With the '-l' option;

  00000001 00003029 000031DA 00003BAD

num─links.list

Used to generate links─list and links.idx.
Hex document number followed by tab or a single space followed by link.

  XXXX<Tab_Or_Space>Link<LF>

Example

  0001	<a href="/">Rob's server</a>
  0002	<a href="/~g%C3%BCnter/">Günter's homepage</a>

'gen─num─index $NAME' converts num─$NAME.list to $NAME─list and $NAME.idx
Document numbers start at one, not zero.

Character set and file names

The charset of the links file is UTF-8. This applies to the file system as well!
Note how the u-umlaut / u-diaeresis ('ü') is escaped in the above example ('%C3%B'); Each byte in the UTF-8 multi-byte sequence is replaced by percent-hex-value.
Charsets other then UTF-8 will not work!
Futhermore, do not use shell meta characters (E.G.: space) in file-names. This will not work!