Rob's search thingy File formats

General

Word numbers are signed 32-bit. Only word numbers > zero are valid.
Document numbers are 1 to 65530 (unsigned 16-bit).

Binary formats

Binary formats are used by the search form.

words─list

This is either a compact or a non-compact format. The compact format is probably more efficient is case of a large words-list. The compact format needs an additional index file.

Non-compact file format

Alphabetical list of words.
32-bit (4-byte) word number followed by word. Word is max 31 bytes followed by a terminating NULL (before 2021-03-08 this used to be max 23 bytes).

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word number                       │
    ├────────┼────────┼────────┼────────┤
  4 │ Word                              │
    ├────────┼────────┼────────┼────────┤
  8 │                                   │
    ├────────┼────────┼────────┼────────┤
 12 │                                   │
    ├────────┼────────┼────────┼────────┤
 16 │                                   │
    ├────────┼────────┼────────┼────────┤
 20 │                                   │
    ├────────┼────────┼────────┼────────┤
 24 │                                   │
    ├────────┼────────┼────────┼────────┤
 28 │                                   │
    ├────────┼────────┼────────┼────────┤
 32 │                            0      │
    ├────────┼────────┼────────┼────────┤
 36 │ 0        0        0        0      │
    └────────┴────────┴────────┴────────┘

Records are fixed length with zero padding. There is an additional 4 byte pad, which is zero. Word numbers start at one, not zero. Charset is UTF-8. The sort order is ascending unsigned byte value, not Unicode code points.
Just over 26 million words are supported (1073741820 / 40). With the '-k' option this may be higher.

Compact file format

Alphabetical list of words. Generated by 'cgi-index -k'.
UTF-8 NULL terminated character strings.

    ┌────────┬──────     ──────┬────────┐
    │ Word           ...         0      │
    └────────┴──────     ──────┴────────┘

Records are variable length. The sort order is ascending unsigned byte value.

words.idx

Index to words-list.
Array of structs: 32-bit word number, followed by offset to word, followed by word length.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word number                       │
    ├────────┼────────┼────────┼────────┤
  4 │ Pointer / Offset (bytes)          │
    ├────────┼────────┼────────┼────────┤
  8 │ Length (bytes)                    │
    └────────┴────────┴────────┴────────┘

The length does not include the terminating NULL.
Note: This file is only present in case of compact words-list. This file should not exist when the compact words list file format isn't used. When words.idx does exist the software will ASSUME that the words list file format is compact! And the compact file format is completely incompatible with the non-compact file format!

index─list

Lists per word in which documents these words occur. E.G.: Word number 3 is present in documents 6, 7 and 8.
Lower 16 bits of word number followed by one or more unsigned 16─bit document numbers. Terminated with a 16─bit NULL.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
    │ Word number     │ Document number │
    ├────────┼────────┼────────┼────────┤
    │ Document number │ Document number │
    ├────────┼────────┼────────┼────────┤
     ...
    ├────────┼────────┼────────┼────────┤
    │ Document number │ 0               │ 
    └────────┴────────┴────────┴────────┘

Records are variable length. Document numbers start at one, not zero. The word number can be used for debugging and file consistency checks.

index.idx

Index to index─list.
Array of structs: 32─bit word number, followed by offset to record, followed by record length.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word number                       │
    ├────────┼────────┼────────┼────────┤
  4 │ Pointer / Offset (bytes)          │
    ├────────┼────────┼────────┼────────┤
  8 │ Length (bytes)                    │
    └────────┴────────┴────────┴────────┘

The length does not include the terminating NULL. The word number can be used for debugging and file consistency checks. Sort order is word number.
Note: Offset and length are bytes, not number of data elements.

abstr-list

Lists per document the first 94 words. Generated by 'cgi-index -a'.
32-bit document number, followed by one or more word numbers. Terminated with a 32-bit NULL.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
    │ Document number                   │
    ├────────┼────────┼────────┼────────┤
    │ Word number                       │
    ├────────┼────────┼────────┼────────┤
     ...
    ├────────┼────────┼────────┼────────┤
    │ Word number                       │
    ├────────┼────────┼────────┼────────┤
    │ 0                                 │
    └────────┴────────┴────────┴────────┘

Records are fixed length with zero padding. Word numbers start at one, not zero. The document number can be used for debugging and file consistency checks. Sort order is document number.

links─list

Contains URLs and their titles; <a href="$URL">$TITLE</a>
UTF-8 NULL terminated character strings.

    ┌────────┬──────     ──────┬────────┐
    │ Link           ...         0      │
    └────────┴──────     ──────┴────────┘

Records are variable length.

links.idx

Indexes to links─list.
Array of structs: 32─bit document number, followed by offset to record, followed by record length.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Document number                   │
    ├────────┼────────┼────────┼────────┤
  4 │ Pointer / Offset (bytes)          │
    ├────────┼────────┼────────┼────────┤
  8 │ Length (bytes)                    │
    └────────┴────────┴────────┴────────┘

The length does not include the terminating NULL. The document number can be used for debugging and file consistency checks. Sort order is document number.

Text formats

Text formats are used for debugging or to generate binary files.

num─words.list

Generated by 'cgi─index ─t'. Alphabetical list of words.
Hex word number followed by word.

  XXXX<SP>Word<LF>

E.G.:

  102B caffeine

With the '─l' option;

  0000102B caffeine

index.list

Generated by 'cgi─index ─t'. Lists per word in which documents these words occur. E.G.: Word number 3 is present in documents 6, 7 and 8.
Hex word number followed by one or more hex document numbers separated by spaces.

  XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>

E.G.:

  0003 0006 0007 0008

num-abstr.list

Generated by 'cgi-index -a -t'. Lists per document the first 94 words.
Hex document number followed by one or more hex word numbers separated by spaces.

  XXXX<SP>XXXX<SP>XXXX ... XXXX<LF>

E.G.:

  0001 3029 31DA 3BAD

With the '-l' option;

  00000001 00003029 000031DA 00003BAD

num─links.list

Used to generate links─list and links.idx.
Hex document number followed by tab or a single space followed by link.

  XXXX<Tab_Or_Space>Link<LF>

Example

  0001	<a href="/">Rob's server</a>
  0002	<a href="/~g%C3%BCnter/">Günter's homepage</a>

'gen─num─index $NAME' converts num─$NAME.list to $NAME─list and $NAME.idx
Document numbers start at one, not zero.

Character set and file names

The charset of the links file is UTF-8. This applies to the file system as well!
Note how the u-umlaut / u-diaeresis ('ü') is escaped in the above example ('%C3%B'); Each byte in the UTF-8 multi-byte sequence is replaced by percent-hex-value.
Charsets other than UTF-8 will not work!
Futhermore, do not use shell meta characters (E.G.: space) in file-names. This will not work!

Synonym file formats

This is actually an US <-> GB conversion; A US spelling search will also look up GB spelled words. A GB spelling search will also look up US spelled words. The current lookup table is based on an Aspell dump;
gb2us.tsv.gz is a gzipped GB to US spelling TSV file. It contains more than 2500 GB - US word pairs.
Switching the columns yields a US to GB conversion. And combining the two does both.

synonyms.list

List of words and their synonyms. One pair per line;

  Word<Single Space or Tab>Synonym<LF>

Example:

center	centre
centre	center
color	colour
colour	color
fiber	fibre
fibre	fiber

This file is in alphabetical order.
It's used by 'gensynontab' to generate the files 'synonyms-list' and 'synonyms.idx'.
synonyms-list and synonyms.idx are used by the indexer. When synonyms are enabled (-o option), it will index synonyms as if they were part of the text.
Note: Synonyms do not show up in abstracts. Only the words that are actually in the text do.

synonyms-list

List of words and their synonyms.
Lower case NULL-terminated UTF-8 strings without padding. So records have no fixed length.

    ┌────────┬──────     ──────┬────────┐
    │ Word or synonym  ...       0      │
    └────────┴──────     ──────┴────────┘

Example:

center
centre
color
colour
fiber
fibre

This file is in alphabetical order.

synonyms.idx

Index to synonyms-list.
Word - synonym pair lookup table. Each record contains two 32-bit signed integers. The first points to a word. The second to it's synonym.

              1        2        3
    ┌────────┬────────┬────────┬────────┐
  0 │ Word offset (bytes)               │
    ├────────┼────────┼────────┼────────┤
  4 │ Synonym offset (bytes)            │
    └────────┴────────┴────────┴────────┘

Note that there is just one synonym per word.