This task combats link rot by using the Internet Archive Wayback Machine to provide archive copies of now dead links in references and citations or marking them with {{dead link}} if a suitable archived copy is unavailable.
The bot currently only processes citation templates that have |url=
and |accessdate=
set. The recognized citations are: {{Citation}}, {{Cite news}}, {{Cite web}}, {{Cite journal}}, {{Cite book}}, {{Cite mailing list}}, {{Cite video}}, {{Vcite web}}, {{Vcite book}}, {{Vcite news}}, and {{Vcite journal}}. The bot will attempt to retrieve the archived copy from Wayback and add |archiveurl=
and |archivedate=
to the citation (the bot will respect whitespace formatting). The bot will also add <!-- Added by H3llBot -->
comment, so it is possible to track bot added archvies. Failing that, it will mark dead links with {{dead link|bot=H3llBot}}
or set |deadurl=yes
if it was a preemptively archived citation with |deadurl=no
.
The retrieved Wayback archive's date is either (1) the closest archived copy before the citation's |accessdate=
up to 3 month range or (2) the first archived copy after the access date up to 1 month range (used to be ±6 months). The date format is derived either from {{Use dmy dates}} or {{Use mdy dates}} templates or the citation's |accessdate=
or |date=
field.
Dead links are URLs whose HTTP status responses are 404 or 301. Other error codes or failed connections are ignored. The 404 check is carried out twice within 3 days (used to be 1 day) to make sure the link is really dead and not just down for maintenance. GET (as opposed to HEAD) requests are used and redirects followed as some servers redirect to both 404 and 200 pages.
FAQ
editQ: You marked a link as {{Dead link}}, but there is a copy on Wayback!
A: Usually, the available copies are out of the date range the bot is comfortable using. Secondly, Wayback is not always reliable. The bot uses secondary attempts if Internet Archive returns connection errors, but even that sometimes fails. I use multiple retries and delays.
Q: How many times will your bot keep coming back to the same page and making changes, can't you do them at once?
A: Wayback is not always reliable (in fact, it's quite unreliable most of the time with common timeouts). Often the retrieval fails at one time and succeeds for the same link at other time. Even the implemented retries and delays do not always work. Hopefully, return visits will mean fixing more links.
Q: The link isn't dead! You marked it as dead.
A: Some web-sites don't like bots and use various ways of determining automatic processes, simplest being a check for user-agent and referrer. The bot fakes these, but even then some sites may wrongly return a 404 not found page instead of 403 forbidden as they should.
Q: The link isn't dead! You added archive parameters to it anyway.
A: Sometimes web-sites are temporarily down and wrongly return 404s instead of 503. Even though bot retries every dead link, it may visit within this maintenance frame. Also, preserving archive copies for live links is actually not wrong, if misleading without |deadurl=no
.
Q: The linked Wayback page says there is no archive available! Why did bot add bad urls?
A: Make sure it is not a temporary problem, often individual Wayback's servers are down. Otherwise, the page was available when it was added. Internet Archive respects robots.txt and request for content removal. So any copyright holder can contact them and ask the pages to be removed. This doesn't happen often and is very time consuming to verify reliably.
Q: The bot only marked 1 or 2 links, but there are more dead, even from the same domain.
A: This is probably because the bot had seen that link with that access date before in another article, but has not yet checked all the links in this one. It should get back to this article eventually and mark the rest. This happens rarely.
Codes
editThis task covers several "sub-tasks", marked with the code (in edit summary or page link redirect) as follows:
- ADL – archive dead links: adds
|archiveurl=
to citation(s) with successfully retrieved archived copy - MCD – mark citation dead: adds {{dead link}} to citation(s) unable to get archived copy for
- MDY – mark citation expired: set
|deadurl=yes
for citation(s) now dead, but with preemptively archived copy - RDT – remove deadlink tag: removes {{dead link}} from citation(s) with successfully retrieved/added archived copy
TODO
edit- Check bare external links
- Check manually written references
- Parse revision history to find url insertion dates when accessdate is missing
- Use WebCite as alternative to Wayback
BRFA
editThe bot request for approval available at WP:BRFA/H3llBot 2. Addendums: H3llBot 2b, H3llBot 2c
Relevant links
edit- WP:DEADLINK, WP:DEADREF
- Wikipedia:WikiProject_External_links/Webcitebot2 - task force of WP:EL dedicated to link repair
- Dead link related: 1 2 3 4 5
- WebCiteBOT related: 1 2 3 4 5 6
- Access dates by bots: VP 1, VP 2, no consensus
- Other similar bot BRFAs
- Tim1357's DASHBot 11 for the same purpose bot, description.
- Ocolon's Ocobot for finding dead links
- Anomie's AnomieBOT 60 for replacing and archiving certain domains
- ThaddeusB's DeadLinkBOT for replacing certain domains
- Merlissimo's MerlLinkBot for replacing certain domains
- ThaddeusB's WebCiteBOT for preemptively archiving new links
- Blevintron's BlevintronBot -- withdrawn
- Emijrp's BOTijo 10 -- expired