SputNlBot - Rob's search thingy Crawler

Description

This crawler is based on my indexer. This page primarily describes the differences from the indexer, so you may want to read this page first.

This software both crawls and indexes. So you don't any additional binaries or scripts (the supplied script can be handy though).
You just need a text file with a list of websites to crawl;

http://www.example.org/
https://www.example.com/

You need a trailing slash ('/') for each site! It it's not there the software will append one.
Furthermore, the software doesn't understand that http://www.example.org/ and https://www.example.org/ are the same thing. Websites need to be completely HTTP OR HTTPS. Of course, one site being HTTP and an other HTTPS is OK.
Each site should be in the list only ONCE!

Limits

In addition to the limits of the indexer, this crawler has the following limitations;

It only understands <a href="...">, <area href="...">, <frame src="..."> and <iframe src="..."> links.
Links that require scripting do not work. In fact, anything in script tags is completely ignored!
Furthermore, links need to be in double quotes ("..."). Single quotes do not work!
The maximum length of URLs and titles is 2031 bytes. URLs are truncated at hash ('#').
robots.txt is not supported. Neither is canonical link, nofollow or noindex. The software can lookup sitemaps in robots.txt though.

CMS

Content management systems have their own problems. Some use thousands of different URLs to refer to just a few documents.
It may help to use the '-m' option. This enables the use of sitemaps; If the software finds a sitemap, it will not crawl the website but index each file in the sitemap instead.

Charsets

The software will try to convert legacy charsets to UTF-8. This may reduce the maximum line length to as little as 1023 bytes and the maximum URL and title sizes to as little as 507 bytes.
A lot of people get their charset wrong. So in case of ISO-8859-1, Windows-1252 is assumed. And in case of ISO-8859-9, Windows-1254 is assumed.
Initially the charset is derived from a HTTP response header. If this doesn't specify the charset, the software will try to get the charset from a HTML HEAD meta http-equiv or meta charset. The meta needs to be before the title; On reading the title, the charset is 'locked'.
If the charset isn't specified, UTF-8 is assumed.
Note: The software will only convert charsets while crawling!

PDF

With the '-p' option the software will index PDF files as well. To this end, PDF files are temporary saved and then converted to text by pdftotext. The output of pdftotext is then indexed. Unless debug ('-d' option) is used, PDF files are removed after indexing.

Files

The software generates all the files the web-form needs. It tries to find the titles of the indexed web-pages. If this fails, it uses the full URL instead.
The default behaviour is to do a HTTP HEAD before a HTTP GET. If the document isn't text, HTML, sitemap, or PDF it won't do a HTTP GET. This means that all the websites crawled this way should support HEAD requests.
With the '-b' option it will do a GET, analyse the HTTP response header and abort the download if the Content-Type isn't Text, HTML, robots.txt, sitemap-xml, or PDF.

Non-fetched and non-indexed documents get a 'fake title' in the file 'links-list' (or num-links.list in case of a text dump);

<a href="Url">XXXX XXXX DDD Type</a>

These are:

A four digit hex document number.
In case of a redirect, the number of the redirect, otherwise "0000".
The HTTP response code. So that's 200 for OK, 301 or 302 for redirect, 404 for not found, 500 for server error, etc.
The content-type.

The software only keeps track of six content-types; Text, Html, Robots, SiteMap, Pdf and Other.
Text is text/* except text/html. So that's sources, diffs, scripts, makefiles, man pages, anything text.
Robots is '/robots.txt'.
SiteMap is '/sitemap.xml' or a sitemap-xml file referred to in '/robots.txt'. Some sitemaps refer to other sitemaps. These are supported as well.
Note: In case of a HTTP response code other than 200, the web server may state the content-type as 'html' even though it isn't.

These 'fake titles' are for debugging purposes and should never appear in URL lists produced by the search form. If they do there is something wrong.
Sites which are not in the URL list file shouldn't be in links-list at all.

Abstracts

With '-a' abstracts are produced. Each abstract is 384 bytes and contains a four-byte document number, 94 four-byte word numbers and a four-byte terminating null. Abstracts have zero-padding.
Abstracts of web-documents that aren't indexed (E.G.: images), have an abstract consisting of 384 null bytes. In text dumps these show up as "0000".

Reports

Internal link report

Alphabetical list of all internal links.
Tab delimited file of: URL number, Redirect number, HTTP response code, Content-type, Charset and URL.
When the charset is not specified or the charset is UTF-8 it lists 'Default'. In case of ISO-8859-1 it lists 'Windows-1252' and in case of ISO-8859-9 'Windows-1254'.
Dead internal links show up with HTTP response code 404.
Redirects to external URLs have redirect number 0000.

External link report

Alphabetical list of all external links.
Enabled with '-e' option.
The file can be fed to an external links checker. This way you can keep track of dead links on your website. The script 'chk-rem-lnk.sh' can be used for this purpose. You need both Lynx and Curl for this script.

Download and install

Compilation and installation
Source: webcrawl.tar.gz