A Link Rot Bestiary/Chapter 3 : Soft 404
Soft 404 is a URL that serves content different from the original. For example, http://www.foxnews.com/us/2009/09/13/tennis-great-jack-kramer-dead/ redirects to https://www.foxnews.com/us
Soft 404s most commonly manifest as redirects, as in the foxnews example. However they can also be static pages where the content has changed, this is called content drift. The classic example is a weather reports. Other examples are sports scores, and financial prices.
Soft 404s can be domain name squatters, blank pages, content management changes, spam sites, bot blockers, rate limiters; the possibilities are endless. Conceptually, the page is returning a status of 200, but is also not returning the intended content, in-effect a 404 and thus "soft".
Detection methods
[edit | edit source]Soft 404s are notoriously difficult to detect. This section describes some methods.
Key phrases
[edit | edit source]Downloading the HTML content of a page and searching for known key phrases like "No page found". This method has limitations because of the wide variety of phrases in English, much less many other languages in the world.
URL analysis
[edit | edit source]URLs can be analyzed for soft 404s. In the above foxnews.com example, it is apparent the redirect URL has a much shorter path than the original URL. Likewise if a URL itself contains "404" such as .com/404.htm
Logging rules
[edit | edit source]Soft 404s are most often the result of redirects. This is because websites make changes but fail to leave a redirect, rather defaulting everything to a home page (foxnews.com example above). Knowing this, it is possible to query a large number of URLs within a single domain, and record the source URL and redirect URL in a 2-column table. It might look like:
- <source URL 1> <redirect URL 1>
- <source URL 2> <redirect URL 2>
- <source URL 3> <redirect URL 1>
- <source URL 4> <redirect URL 4>
- <source URL 5> <redirect URL 1>
Here we see in column 2, the "redirect URL 1" repeats 3 times. This is a strong signal of a soft 404. Once the soft 404 table is generated, rules can be added so the next time it runs it knows to treat as a dead link.
Third party packages
[edit | edit source]Third party packages for soft 404 detection:
- "soft404: a classifier for detecting soft 404 pages", machine learning