Rob's search thingy Crawler

Description

This crawler is based on my indexer. This page primarily describes the differences from the indexer, so you may want to read this page first.

This software both crawls and indexes. So you don't any additional binaries or scripts (the supplied script can be handy though).
You just need a text file with a list of websites to crawl;

http://www.example.org/
https://www.example.com/

You need a trailing slash ('/') for each site!
Furthermore, the software doesn't understand that http://www.example.org/ and https://www.example.org/ are the same thing. Websites need to be completely HTTP OR HTTPS. Of course, one site being HTTP and an other HTTPS is OK.

Limits

In addition to the limits of the indexer, this crawler has the following limitations;

Charsets

The software will try to convert legacy charsets to UTF-8. This may reduce the maximum line length to as little as 1023 bytes and the maximum URL and title sizes to as little as 507 bytes.
A lot of people get their charset wrong. So in case of ISO-8859-1, Windows-1252 is assumed. And in case of ISO-8859-9, Windows-1254 is assumed.
Initially the charset is derived from a HTTP response header. If this doesn't specify the charset, the software will try to get the charset from a HTML HEAD meta http-equiv or meta charset. The meta needs to be before the title; On reading the title, the charset is 'locked'.
If the charset isn't specified, UTF-8 is assumed.
Note: The software will only convert charsets while crawling!

PDF

With the '-p' option the software will index PDF files as well. To this end, PDF files are temporary saved and then converted to text by pdftotext. The output of pdftotext is then indexed. Unless debug ('-d' option) is used, PDF files are removed after indexing.

Files

The software generates all the files the web-form needs. It tries to find the titles of the indexed web-pages. If this fails, it uses the full URL instead.
The default behaviour is to do a HTTP HEAD before a HTTP GET. If the document isn't text, HTML or PDF it won't do a HTTP GET. This means that all the websites crawled this way should support HEAD requests.
With the '-b' option it will do a GET, analyse the HTTP response header and abort the download if the Content-Type isn't Text HTML or PDF.

Non-fetched documents get a 'fake title' in the file 'links-list' (or num-links.list in case of a text dump);

<a href="Url">XXXX XXXX DDD Type</a>

These are:

The software only keeps track of four content-types; Text, Html, Pdf and Other.
Text is text/* except text/html. So that's sources, diffs, scripts, makefiles, man pages, anything text.

These 'fake titles' are for debugging purposes and should never appear in URL lists produced by the search form. If they do there is something wrong.
Sites which are not in the URL list file shouldn't be in links-list at all.

Abstracts

With '-a' abstracts are produced. Each abstract is 384 bytes and contains a four-byte document number, 94 four-byte word numbers and a four-byte terminating null. Abstracts have-zero padding.
Abstracts of web-documents that aren't indexed (E.G.: images), have an abstract consisting of 384 null bytes. In text dumps these show up as "0000".

Reports

Internal link report

Alphabetical list of all internal links.
Tab delimited file of: URL number, Redirect number, HTTP response code, Content-type, Charset and URL.
When the charset is not specified or the charset is UTF-8 it lists 'Default'. In case of ISO-8859-1 it lists 'Windows-1252' and in case of ISO-8859-9 'Windows-1254'.
Dead internal links show up with HTTP response code 404.
Redirects to external URLs have redirect number 0000.

External link report

Alphabetical list of all external links.
Enabled with '-e' option.
The file can be fed to an external links checker. This way you can keep track of dead links on your website. The script 'chk-rem-lnk.sh' can be used for this purpose. You need both Lynx and Curl for this script.

Download and install

Options

Options

Version

Search: 2021-04-16 08:26:11 UTC
Crawler:2022-01-10 14:47:39 UTC

Updates

DateChange
2021-12-19  Fixed typos in the file section of man pages.
2021-12-20  Added PDF support.
2021-12-22  More relaxed URL handling.
2021-12-24  Better tag handling.
Added charset conversion.
2021-12-25  Better tag handling.
Added internal and external link reports.
2021-12-26  Fixed typo.
Added meta http-equiv charset support.
2021-12-27  Added meta charset support.
2021-12-28  Improved redirection support.
2021-12-31  Fixed help typo.
2022-01-05  Added support HEAD-less servers.
Added external link check script.
2022-01-06  Fixed chk-rem-lnk.sh dependency bug.
Tar now extracts in dir 'crawl'.
2022-01-10  Better tag handling.
Fixed crawl more then one site bug.