Rob's search thingy Crawler options
- Generate abstract file.
Crawl instead of index.
This also enables charset conversion.
Broken server: Do a combined HEAD and GET.
Some broken servers respond with a GET to a HEAD request.
Others with a 403 or 500 error.
The default behaviour of the software is to do a HEAD, and if the
content-type turns out to be text, HTML or PDF, then a GET.
With the '-b' option, the software will do a GET and abort the download if
the content-type isn't text, HTML or PDF.
These aborted download are reported as: 'Operation was aborted by an
The file may be downloaded completely anyway.
Especially small files downloaded over fast links.
The abort works best when downloading large files over slow links.
- Enable debug.
- Enable external link report.
- -f List_Of_Sites
List of files to be indexed.
With '-c'; List of websites to be crawled.
- Allow more then 64k words.
Index PDF files.
pdftotext needs to be installed for this.
- Re-use old wordlist.
Update this list to become the new wordlist.
- Print word stats.
Warning: Long list!
- Text output.
Can be used for debugging.
- Index non-ASCII.
This assumes UTF-8.
Note: Without this option the indexer will treat all non-ASCII as word
- Print version and exit.
- -w Wait_Time
Wait time between docs (s). ms resolution.
Default: 1 s.