Indexing Web with Head-r
Head-r is a free Perl program that recursively follows links located at (HTML) Web pages hosted on an HTTP server, and performs HEAD
requests upon links of interest to the user.
The intended use for this program is to create URI lists for later selective mirroring of file-hosting sites.
Synopsis
[edit | edit source]head-r [-v|--verbose] [-j|--bzip2|-z|--gzip] [--include-re=RE] [--exclude-re=RE] [--depth=N] [--info-re=RE] [--descend-re=RE] [-i|--input=FILE]... [-o|--output=FILE] [-P|--no-proxy] [-U|--user-agent=USER-AGENT] [-w|--wait=DELAY] [--] [URI]...
Basic usage
[edit | edit source]Arguably, the most important Head-r options are --info-re=
and --descend-re=
, which determine (by means of regular expressions) which URIs will be considered for mere HEAD
requests, and which ones Head-r will try to get more URIs from.
Simplistic, no-recursion example
[edit | edit source]For the following example, we’ll use .
– a regular expression that matches any non-empty string – to allow Head-r to make HEAD
requests to both of the URIs given.
$ head-r --info-re=. \ -- http://example.org/ http://example.net/ http://example.org/ 1381334900 1 1270 200 http://example.net/ 1381334903 1 1270 200
The fields are delimited with ASCII HT (also known as TAB) codes, and are as follows:
- URI;
- timestamp (in seconds since system-dependent epoch; see also Unix time);
- recursion depth used when considering this URI;
- the length of the response in octets (as per the
Content-Length:
HTTP reply header); - HTTP status code of the reply.
Recurse once example
[edit | edit source]For the following example, we’ll also enable actual recursion (still at maximum depth of 1), by using the --descend-re=/\$
option.
$ head-r --info-re=. --descend-re=/\$ \ -- http://example.org/ http://example.net/ http://example.org/ 1381337824 1 1270 200 http://www.iana.org/domains/example 1381337829 0 200 http://example.net/ 1381337830 1 1270 200
As could be seen, at http://example.org/ Head-r found another URI to consider: http://www.iana.org/domains/example, which it followed and issued a HEAD
request for.
It’s easy to check that http://example.net/ actually also references the same URI. However, as Head-r remembers the URIs it processes (along with the recursion depth at the point) no other request was issued.
Limiting HEAD requests
[edit | edit source]Consider now that the resource we’re to recurse through references URIs that are out of our interest. For the following example, we’ll use a more selective regular expression than .
we’ve used above.
$ head-r --{info,descend}-re=wikipedia\\.org/wiki/ \ -- http://en.wikipedia.org/wiki/Main_Page http://en.wikipedia.org/wiki/Main_Page 1381339589 1 61499 200 . . . http://en.wikipedia.org/w/api.php?action=rsd http://creativecommons.org/licenses/by-sa/3.0/ . . . http://meta.wikimedia.org/ http://en.wikipedia.org/wiki/Wikipedia 1381339589 0 609859 200 http://en.wikipedia.org/wiki/Free_content 1381339589 0 124407 200 . . .
(Please note that we’ve just used the Bash {,}
expansion to pass the same regular expression to both --info-re=
and --descend-re=
. Be sure to adjust to the command line interpreter actually in use.)
In the output above, a number of URIs came without any of the usual information. These URIs were found by Head-r, but as they matched neither “info” (--info-re=
) nor “descend” (--descend-re=
) regular expressions specified, no action was done to them. The URIs are still output, however, just in case we may decide to adjust the regular expressions themselves.
Skipping unwanted URIs altogether
[edit | edit source]The --include-re=
and --exclude-re=
regular expressions are considered before all the other ones, and currently have the following semantics:
- the inclusion regular expression is applied first; the URI will be considered if it matches one;
- unless decided at the step above, the exclusion regular expression is then applied; the URI will not be considered if it matches one;
- unless decided by the rules above, the URI will be considered.
If none of these options are given, any URI will be considered by Head-r.
The following example exploits these options to further limit the output of Head-r for the case above.
$ head-r --{include,descend}-re=wikipedia\\.org/wiki/ \ --{info,exclude}-re=. \ -- http://en.wikipedia.org/wiki/Main_Page http://en.wikipedia.org/wiki/Main_Page 1381341336 1 61499 200 http://en.wikipedia.org/wiki/Wikipedia 1381341337 0 609859 200 http://en.wikipedia.org/wiki/Free_content 1381341337 0 124407 200 http://en.wikipedia.org/wiki/Encyclopedia 1381341337 0 151164 200 http://en.wikipedia.org/wiki/Wikipedia:Introduction 1381341337 0 50687 200 . . .
Saving state between sessions
[edit | edit source]Head-r is capable of reading its own output, so to avoid issuing duplicate HEAD
requests, and also to discover the URIs of the resources to recurse into.
Restoring what was saved
[edit | edit source]Let us revisit one of our previous examples, which we’ll now alter to only issue a HEAD
request to a couple of pages:
$ head-r --output=state.a \ --info-re='/(Free_content|Wikipedia)$' \ --descend-re=wikipedia\\.org/wiki/ \ -- http://en.wikipedia.org/wiki/Main_Page $ grep -E \\s < state.a http://en.wikipedia.org/wiki/Main_Page 1381417546 1 61499 200 http://en.wikipedia.org/wiki/Wikipedia 1381417546 0 609859 200 http://en.wikipedia.org/wiki/Free_content 1381417546 0 124407 200 $
Now, why not to include a few more pages, such as all the pages with the names starting with F
?
$ head-r \ --input=state.a --output=state.b \ --info-re=/wiki/F \ --descend-re=wikipedia\\.org/wiki/ $ grep -E \\s < state.b http://en.wikipedia.org/wiki/File:Diary_of_a_Nobody_first.jpg 1381417906 0 34344 200 http://en.wikipedia.org/wiki/File:Progradungula_otwayensis_cropped.png 1381417906 0 30604 200 http://en.wikipedia.org/wiki/File:AW_TW_PS.jpg 1381417907 0 33297 200 http://en.wikipedia.org/wiki/Fran%C3%A7ois_Englert 1381417907 0 87860 200 http://en.wikipedia.org/wiki/File:Washington_Monument_Dusk_Jan_2006.jpg 1381417907 0 83137 200 http://en.wikipedia.org/wiki/File:Walt_Disney_Concert_Hall,_LA,_CA,_jjron_22.03.2012.jpg 1381417907 0 67225 200 http://en.wikipedia.org/wiki/Frank_Gehry 1381417907 0 152838 200 $
Note that while our --info-re=
has obviously covered http://en.wikipedia.org/wiki/Free_content, no HEAD
request was made to the page, as our --input=state.a
file already had the relevant information.
Also, as all the URIs we wanted for Head-r to consider were already listed in state.a
, it was unnecessary to specify any URIs at the command line. When the URIs come from both command line arguments and --input=
files, those coming from command line are considered first.
Compression
[edit | edit source]As recursing through large Web sites may result in large output lists, Head-r provides support for compression of output data.
The --bzip2
(-j
) and --gzip
(-z
) options select the compression method to use for the output file (either specified with --output=
, or standard output.) Head-r, however, will exit with an error if compression is enabled and the output goes to a terminal device.
Head-r transparently decompresses the files given as inputs (--input=
), thanks to the IO::Uncompress::AnyUncompress library.
Adjusting HTTP client behavior
[edit | edit source]There’re two options which influence the behavior of the HTTP client used by Head-r: --wait=
(-w
) and --user-agent=
(-U
.)
The --wait=
option specifies the amount of time, in seconds, to wait between two consecutive HTTP requests. The default is about 2.7 seconds.
The --user-agent=
option specifies the value for the User-Agent:
header to use in HTTP requests, and may come handy should the target server block access based on this header’s data. The default is composed of the string HEAD-R-Bot/
, the Head-r’s own version, and the identity of the libwww-perl library used. For example: HEAD-R-Bot/0.1 libwww-perl/6.05
.
Bugs
[edit | edit source]Please consider reporting any bugs in the Head-r software not listed below via the CPAN RT, https://rt.cpan.org/Public/Dist/Display.html?Name=head-r. The bugs in this documentation should be reported to the respective Wikibooks Talk page – or you may actually fix them yourself!
As for any other automatic retrieval tool, it isn’t impossible to abuse Head-r to cause excessive load on third-party servers. The user is advised to consider the network environment when using the tool, and especially when lowering the --wait=
setting, and raising the maximum recursion --depth=
beyond reasonable values.
There’s currently no way to disable the /robots.txt
file processing.
The code only tries to retrieve URIs from content marked with text/html
media type, even though it seems as if the support for application/xhtml+xml
(and perhaps several other XML-based types, such as SVG) could be implemented rather easily.
The resource to retrieve URIs from is first loaded into memory, while it should be possible to process it on-the-fly.
The handling of recursion depths retrieved from --input=
files may be somewhat unintuitive, and out of the user’s control. (Although it’s still possible to edit such files using third-party tools, such as AWK.)
The code implements a trivial work-around for the long-standing Net::HTTP bug #29468.
Availability
[edit | edit source]The latest stable version of the code is available from CPAN. Check, for instance, the respective Metacpan page at https://metacpan.org/release/head-r.
The latest development version could be downloaded from a Git repository, like:
$ git clone -- \ http://am-1.org/~ivan/archives/git/head-r-2013.git/ head-r
A Gitweb interface is available at http://am-1.org/~ivan/archives/git/gitweb.cgi?p=head-r-2013.git.
Author
[edit | edit source]Head-r is written by Ivan Shmakov.
Head-r is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This documentation is a free collaborative project going on at Wikibooks, and is available under the Creative Commons Attribution/Share-Alike License (CC BY-SA) version 3.0.