Rob's search thingy Crawler installation

Download

Download: source: webcrawl.tar.gz
Extract with 'tar xvfz webcrawl.tar.gz

Compilation

Crawler

You can compile the source into a mere indexer;

cc -O2 -Wall -o cgi-index cgi-crawl.c

Using '-D' to set a define, you get a crawler instead;

cc -O2 -Wall -DCGS_WITH_HTTP -lcurl -o cgi-crawl cgi-crawl.c

For this to work, you need to have lib-curl-devel and all of it's dependencies installed. The ldd-s below clearly show the difference;

~$ ldd cgi-index
	linux-vdso.so.1 (0x00007ffdb9108000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3b8d254000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f3b8d823000)

~$ ldd cgi-crawl
	linux-vdso.so.1 (0x00007ffd733e5000)
	libcurl-gnutls.so.4 => /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4 (0x00007f9993956000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f99935b7000)
	libnghttp2.so.14 => /usr/lib/x86_64-linux-gnu/libnghttp2.so.14 (0x00007f9993390000)
	libidn2.so.0 => /usr/lib/x86_64-linux-gnu/libidn2.so.0 (0x00007f999316e000)
	librtmp.so.1 => /usr/lib/x86_64-linux-gnu/librtmp.so.1 (0x00007f9992f51000)
	libssh2.so.1 => /usr/lib/x86_64-linux-gnu/libssh2.so.1 (0x00007f9992d24000)
	libpsl.so.5 => /usr/lib/x86_64-linux-gnu/libpsl.so.5 (0x00007f9992b16000)
	libnettle.so.6 => /usr/lib/x86_64-linux-gnu/libnettle.so.6 (0x00007f99928df000)
	libgnutls.so.30 => /usr/lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007f9992546000)
	libgssapi_krb5.so.2 => /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007f99922fb000)
	libkrb5.so.3 => /usr/lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007f9992021000)
	libk5crypto.so.3 => /usr/lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007f9991dee000)
	libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007f9991bea000)
	liblber-2.4.so.2 => /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2 (0x00007f99919db000)
	libldap_r-2.4.so.2 => /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2 (0x00007f999178a000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9991570000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9991353000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f9993e4b000)
	libunistring.so.0 => /usr/lib/x86_64-linux-gnu/libunistring.so.0 (0x00007f999103c000)
	libhogweed.so.4 => /usr/lib/x86_64-linux-gnu/libhogweed.so.4 (0x00007f9990e07000)
	libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f9990b84000)
	libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007f9990874000)
	libp11-kit.so.0 => /usr/lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007f999060f000)
	libidn.so.11 => /lib/x86_64-linux-gnu/libidn.so.11 (0x00007f99903db000)
	libtasn1.so.6 => /usr/lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007f99901c8000)
	libkrb5support.so.0 => /usr/lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007f998ffbc000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f998fdb8000)
	libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007f998fbb4000)
	libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f998f99d000)
	libsasl2.so.2 => /usr/lib/x86_64-linux-gnu/libsasl2.so.2 (0x00007f998f782000)
	libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007f998f56e000)
	libffi.so.6 => /usr/lib/x86_64-linux-gnu/libffi.so.6 (0x00007f998f365000)

The crawler has an additional '-c' option, which makes it crawl instead of just index.
If you want to index PDF ('-p' option), you also need pdftotext, which is part of poppler-utils.

Search

cc -O2 -Wall -o cgi-search cgi-search.c

Util

cc -O2 -Wall -o findwebpath findwebpath.c

Installation

Do this as root.

Binaries

cp cgi-crawl /usr/local/bin/
cp cgi-search /usr/lib/cgi-bin/
cp findwebpath /usr/local/bin/

Scripts

cp chk-rem-lnk.sh /usr/local/bin/
cp gen-cgi-crawl.sh /usr/local/bin/

The remote link check script needs both Lynx and Curl.

Man pages

cp cgi-crawl.1 /usr/local/share/man/man1/
cp cgi-search.1 /usr/local/share/man/man1/
cp findwebpath.1 /usr/local/share/man/man1/
gzip /usr/local/share/man/man1/cgi-crawl.1
gzip /usr/local/share/man/man1/cgi-search.1
gzip /usr/local/share/man/man1/findwebpath.1

Documentation

cp file-formats /usr/local/share/doc/websearch/
cp headers /usr/local/share/doc/websearch/
cp README /usr/local/share/doc/websearch/

Example webform

cp index.html /var/www/search/

Directory needs to exist.

Use

The following directories need to exist;

They have to be (group) writable by the crawler process owner.

Put a list of sites you want to crawl in a list ('urls.list' in the example script);

http://www.exmaple.org/
https://www.example.com/

And run the crawl script.
Don't run the script as root.

Indexer

The installation and use of the indexer is the same as before.