Site map  

SputNlBot installation

Download

Download: source: webcrawl.tar.gz
Extract with 'tar xvfz webcrawl.tar.gz'.

Compilation

Crawler vs indexer

You can compile the crawler source into a mere indexer;

~$ cc -O2 -Wall -o cgi-index cgi-crawl.c

Using '-D' to set a define, you get a crawler instead;

~$ cc -O2 -Wall -DCGS_WITH_HTTP -lcurl -o cgi-crawl cgi-crawl.c

For this to work, you need to have lib-curl-devel and all of it's dependencies installed. The ldd-s below clearly show the difference;

~$ ldd cgi-index
	linux-vdso.so.1 (0x00007ffdb9108000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3b8d254000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f3b8d823000)

~$ ldd cgi-crawl
	linux-vdso.so.1 (0x00007ffd733e5000)
	libcurl-gnutls.so.4 => /usr/lib/x86_64-linux-gnu/libcurl-gnutls.so.4 (0x00007f9993956000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f99935b7000)
	libnghttp2.so.14 => /usr/lib/x86_64-linux-gnu/libnghttp2.so.14 (0x00007f9993390000)
	libidn2.so.0 => /usr/lib/x86_64-linux-gnu/libidn2.so.0 (0x00007f999316e000)
	librtmp.so.1 => /usr/lib/x86_64-linux-gnu/librtmp.so.1 (0x00007f9992f51000)
	libssh2.so.1 => /usr/lib/x86_64-linux-gnu/libssh2.so.1 (0x00007f9992d24000)
	libpsl.so.5 => /usr/lib/x86_64-linux-gnu/libpsl.so.5 (0x00007f9992b16000)
	libnettle.so.6 => /usr/lib/x86_64-linux-gnu/libnettle.so.6 (0x00007f99928df000)
	libgnutls.so.30 => /usr/lib/x86_64-linux-gnu/libgnutls.so.30 (0x00007f9992546000)
	libgssapi_krb5.so.2 => /usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2 (0x00007f99922fb000)
	libkrb5.so.3 => /usr/lib/x86_64-linux-gnu/libkrb5.so.3 (0x00007f9992021000)
	libk5crypto.so.3 => /usr/lib/x86_64-linux-gnu/libk5crypto.so.3 (0x00007f9991dee000)
	libcom_err.so.2 => /lib/x86_64-linux-gnu/libcom_err.so.2 (0x00007f9991bea000)
	liblber-2.4.so.2 => /usr/lib/x86_64-linux-gnu/liblber-2.4.so.2 (0x00007f99919db000)
	libldap_r-2.4.so.2 => /usr/lib/x86_64-linux-gnu/libldap_r-2.4.so.2 (0x00007f999178a000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f9991570000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9991353000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f9993e4b000)
	libunistring.so.0 => /usr/lib/x86_64-linux-gnu/libunistring.so.0 (0x00007f999103c000)
	libhogweed.so.4 => /usr/lib/x86_64-linux-gnu/libhogweed.so.4 (0x00007f9990e07000)
	libgmp.so.10 => /usr/lib/x86_64-linux-gnu/libgmp.so.10 (0x00007f9990b84000)
	libgcrypt.so.20 => /lib/x86_64-linux-gnu/libgcrypt.so.20 (0x00007f9990874000)
	libp11-kit.so.0 => /usr/lib/x86_64-linux-gnu/libp11-kit.so.0 (0x00007f999060f000)
	libidn.so.11 => /lib/x86_64-linux-gnu/libidn.so.11 (0x00007f99903db000)
	libtasn1.so.6 => /usr/lib/x86_64-linux-gnu/libtasn1.so.6 (0x00007f99901c8000)
	libkrb5support.so.0 => /usr/lib/x86_64-linux-gnu/libkrb5support.so.0 (0x00007f998ffbc000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f998fdb8000)
	libkeyutils.so.1 => /lib/x86_64-linux-gnu/libkeyutils.so.1 (0x00007f998fbb4000)
	libresolv.so.2 => /lib/x86_64-linux-gnu/libresolv.so.2 (0x00007f998f99d000)
	libsasl2.so.2 => /usr/lib/x86_64-linux-gnu/libsasl2.so.2 (0x00007f998f782000)
	libgpg-error.so.0 => /lib/x86_64-linux-gnu/libgpg-error.so.0 (0x00007f998f56e000)
	libffi.so.6 => /usr/lib/x86_64-linux-gnu/libffi.so.6 (0x00007f998f365000)

The crawler has an additional '-c' option, which makes it crawl instead of just index.
If you want to index PDF ('-p' option), you also need pdftotext, which is part of poppler-utils.

Using make

The software comes with a small Makefile. A 'make' will compile all of the binaries;

~$ make
cc -O2 -Wall -o cgi-index cgi-crawl.c
cc -O2 -Wall -DCGS_WITH_HTTP -lcurl -o cgi-crawl cgi-crawl.c
cc -O2 -Wall -o cgi-search cgi-search.c
cc -O2 -Wall -o findwebpath findwebpath.c
cc -O2 -Wall -o fndtitle fndtitle.c
cc -O2 -Wall -o gen-num-index gen-num-index.c
cc -O2 -Wall -o gensynontab gensynontab.c
cc -O2 -Wall -o url2file url2file.c

If the compilation of cgi-search causes problems, see: custom.h.

Installation

A 'make install' will run the install script. Do this as root;

~# make install

Files

Binaries and scripts are installed in '/usr/local/bin/'.
Man pages are installed in '/usr/local/share/man/man1/'.
Documentation is installed in '/usr/local/share/doc/websearch/'.
Searchform is installed in '/var/www/search/'.

If any of these directories do not exist, the install script will create them for you.
Files will only be copied if they do not already exist in the target directory or if the version in the target directory is older.

Directories

The following directories are created by the install script;

You need to set the ownership to these directories to the indexer process owner: If the indexer runs as user 'foo' group 'bar', set the following permissions;

~# cd /var/local/lib/
~# chmod g+w websearch
~# chown :bar websearch
~# cd websearch/
~# chown -R foo:bar *

Use

Put a list of sites you want to crawl in a list ('urls.list' in the example script);

http://www.exmaple.org/
https://www.example.com/

And run the crawl script.
Don't run the script as root.

Indexer

The installation and use of the indexer is the same as before.