Jump to content

A Link Rot Bestiary/Chapter 4 : Redirects

From Wikibooks, open books for an open world

This chapter describes redirects and variations.

Redirect

[edit | edit source]

A redirect is when a link redirects from one page to another. The destination of the redirect is ideally a working page ie. the content we are looking for. The most commonly encountered redirect is from http:// to https:// .. redirects are created and maintained by the website owners.

If the destination is not a working page, it can (also) be described as a dead link, soft 404, etc.. depending on what the final page is.

Missing redirect

[edit | edit source]

Also known as a potential redirect. A URL that is inoperable (404), but might exist on the live web at a different URL. Normally webmasters who change a site will leave redirects for old links to reach the new page. This does not always happen.

Mapped redirect

[edit | edit source]

A missing redirect that has been successfully mapped to a destination URL. Techniques for mapping redirects are discussed below.

Ruled mapped redirect

[edit | edit source]

A ruled mapped redirect is one in which the new URL is determined through a transformation rule. For example a rule that transforms this:

http://www.foxnews.com/science/2013/03/25/what-is-nothing-physicists-debate/

to this:

https://www.foxnews.com/science/what-is-nothing-physicists-debate/

Is a hard coded rule: remove the date

Inferred mapped redirect

[edit | edit source]

Inferred mapped redirects infer ("guess") what the new URL might be by parsing information from other sources, such as the |title= or |date= of a citation. Inferred mapped redirects might have multiple guesses that are added to an inference table which are checked until one is found to work.

Given this citation:

{{cite web | url = http://www.foxnews.com/entertainment/2014/07/18/weird-al-yankovic-adapint-to-digital-age/ | title = Weird Al Yankovic adapting to digital age }}

Notice the word -adapint- in the URL. This is probably an error. Looking at the |title= field, we can infer it might be "adapting". Thus a new URL is created by converting the title string into a URL (ie. transform spaces into "-", convert to lower-case, and remove punctuation). The new URL is added to the inference table, and each entry in that table is checked for status 200. Titles can also be retrieved from a Wayback Machine version of the original URL. In this way multiple possibilities can be added to the inference table.

Ruled inferred mapped redirect

[edit | edit source]

Sometimes mapped redirects can be transformed via rules, except for one pesky number. Take for example:

old: https://www.avclub.com/articles/come-fly-with-me,33816/
new: https://www.avclub.com/come-fly-with-me-1798168639

The first rule is remove "/articles/". But what about converting 33816 to 1798168639? There is no rule for that, we need a mapping. Where to find a map?

The Wayback Machine has an API service called CDX which allows for wildcard searching of URLs. It is a map of the territory.

Try:

https://web.archive.org/cdx/search/cdx?url=avclub.com/come-fly-with-me&matchType=prefix

The &matchType=prefix will match all URLs that start with 'avclub.com/come-fly-with-me' which produces output:

com,avclub)/come-fly-with-me-1798168639 20171204023942 http://www.avclub.com/come-fly-with-me-1798168639 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 938
com,avclub)/come-fly-with-me-1798168639 20210731041101 https://www.avclub.com/come-fly-with-me-1798168639 text/html 200 GX2ISK7QB4KFDBRF4LXQGCMP4MBKRVIH 40487

Of interest in these two lines are the third and fifth fields, which indicate the URL, and its status code (we are looking for "200" ie. the second line). This is how to find 798168639.

A note of caution: because it's a wildcard search, there can be multiple results. Try:

https://web.archive.org/cdx/search/cdx?url=avclub.com/stephne-colbert&matchType=prefix

Thus it is important to check for multiple different URL matches, and if so, either skip or log for manual review, because multiple different URLs can not be automatically determined.

The above contains domain-specific rules, thus it is called "ruled" mapped redirects. The rules are specific to avclub.com - And it's an "inference" (guess) because of the matchType=prefix wildcard search. Not every domain will have rules like this example, and not every domain will have inferred matches. Nevertheless with some checking and testing beforehand, one can build code for ruled inferred mapped redirect matching.