Wikipedia:Bots/Requests for approval/RotlinkBot
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Withdrawn by operator.
Operator: Rotlink (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 21:04, Sunday August 18, 2013 (UTC)
Automatic, Supervised, or Manual: Supervised
Programming language(s): Scala, Sweble, Wiki.java
Source code available: No. There are actually only few lines of code (and half of them are Wiki urls and passwords), because of using two powerful frameworks which do all the work.
Function overview: Find dead links (mostly by looking for {{dead link}}
marks next to them) and try to recover them by searching web archives using Memento protocol.
Links to relevant discussions (where appropriate): User_talk:RotlinkBot, Wikipedia:Bot_owners'_noticeboard#RotlinkBot_approved.3F
Edit period(s): Daily
Estimated number of pages affected: 1000/day (perhaps a bit more in the first few days)
Exclusion compliant (Yes/No): No. It was not exclusion compliant initially and so far nobody undid any change or make complaints against it. It can be easily made exclusion compliant.
Already has a bot flag (Yes/No): No
Function details: Find dead links (mostly by looking for {{dead link}}
marks next to them) and try to recover them by searching web archives using Memento protocol.
The current version of the bot software does not work with the other, non Memento-compatible, archives (WebCite, WikiWix, Archive.pt, ...).
During the test run, about 3/4 of recovered links were found on Internet Archive (because it has the biggest and oldest database), about 1/4 on Archive.is (because of its proactive archiving of the new links appearing on the Wikis) and only few links on the other archives (because of their smaller size and regional specific).
Discussion
edit- Comment - I have a concern with this Bot in that it has a possibly unintended side affect of replacing valid HTML entities in articles with characters that result in undefined browser behavior. RotlinkBot converted ">" to ">" here and "&" to "&" here. --Marc Kupper|talk 01:41, 19 August 2013 (UTC)[reply]
- Hi.
- Yes, it was already noticed. User_talk:A930913/BracketBotArchives/_4#BracketBot_-_RotlinkBot. Anyway, although this optimisation of entities which Wiki.java framework performs on each save is harmless and do not break the layout (both rendered pages - before [1] and after [2] the change - has ">"), the resulting diffs are confusing for people. I am going to fix it to avoid weird diffs in the future.
- Also, the rendered HTML (which the browser sees and deals with) has "&" even if the Wiki source has "& without an entity name". This seems to be the reason of the framework authors: if the resulting HTML is the same it may has sense to emit the shortest possible Wiki source. But this nice intention results in the weird diffs if the Wiki source was not optimal before edit.
- Again, to be clear. Bot makes changes in the Wiki source. The browser does not deal with the Wiki source and cannot hit the "undefined behavior" case. Browser deals with the HTML which is produces from the Wiki source on the server and single "&"s are converted to "&" on this stage. The only drawback I see here is that bot makes edits out of the intended scope. And this will be fixed. Rotlink (talk) 03:49, 19 August 2013 (UTC)[reply]
I will consider the many edits already made as a long trial. I'll look through more edits as time permits.
One immediate issue is to have a much better edit summary, with a link to some page explaining the process of the bot.
For more info on how this was handled before, see H3llBot, DASHBot and BlevintronBot and their talk pages, so you are aware of the many issues that can come up.
Just to clarify, does "supervised" mean you will review every edit the bot makes, because that is what it means here? If so, I can be more lenient on corner cases as you will manually fix them all. ~~Hellknowz
- I look through the combined diff of all edits (it is usually 1-2 lines per article - only the editing point and few characters around it). Also, the number of unique introduced urls are smaller than the number of edits. ~~Rotlink
For example, how do you determine links are actually dead? What are all the ways that "mostly by looking for dead link" actually means? For example, it is consensus that the link should be revisited at least twice to confirm it is not just server downtime or incorrect temporary 404 messages. This has been an issue, as there are false positives and many corner cases. ~~Hellknowz
- You are absolutely right, detecting dead links is not trivial.
- That is why I started with the most trivial cases:
- sites which are definetely dead for long time (such as btinternet.{com|co.uk}, {linuxdevices|linuxfordevices|windowsfordevices}.com family, ...)
- Google Cache, which Wikipedia has a lot of links to and which is easy to check if a entry were removed from the cache.
- This job is almost finished, so it seems needless to submit another BRFA for it.
- I mentioned {{dead link}} as a future way to find dead links. Something like {{dead link|date=April 2012}} would be stronger signal about dead link than a 404 status of the url (which can be temporary). The idea was to collect a list of all urls marked as {{dead link}}, to find dead domains with a lot of urls and to perform on them the same fixes which had been done on btinternet.com (or, generally, not dead domains but dead prefexes, for example, all urls starting with http://www.af.mil/information/bios/bio.asp?bioID= are dead while the domain itself is not). I see no good way how to it fully automatic. The scripts can help to find such hot spots and prioritize them by the number of dead links so a single manual check can give a start to a massive replace. This also simplifies manual post-checking.~~Rotlink
- So the bot does not currently browse the webpages itself? You manually specify which domains to work on by a human-made assumption they are all dead? That sounds fine.
- As a sidenote, {{dead link|date=April 2012}} is usually a human-added tag, and humans don't double-check websites (say, if it was just downtime). In fact, a tag added by a bot, say {{dead link|date=April 2012|bot=DASHBot}}, might even be more reliable as the site was checked 2+ times over a period of time. At the same time, the bot could make a mistake because the remote website is broken while human can tag a site which appears live (200) to the bot. Just something to consider. — HELLKNOWZ ▎TALK
One issue is that the bot does not respect existing date formats [3]. |archivedate=
should be consistent, usually with |accessdate=
and definitely if other archive dates are present. They are exempt from {{use dmy dates}} and {{use mdy dates}}, although date format is a contentious issue. ~~Hellknowz
- Ok. Guessing the date format was implemented but soon disabled because of mistakes in guessing when a page has many format examples to reuse. It seems that the bot will not create new {{citeweb}} and the only point where it will need to know the date format is adding
|archivedate=
to the templates which already have|accessdate=
as the example. ~~Rotlink
Further, {{Wayback}} (or similar) is the preferred way of archiving bare external links, you should not just replace the original url without any extra info [4], this just creates extra issues as the url now permanently points to the archive instead of the original. You can create other service-specific templates for your needs, probably for Archive.is. ~~Hellknowz
- Currenly, I try to preserve the original url within the archive url when the url is replaced (like in the diff you pointed to) and the use the shortest form of archive URL when archive url is added next to the dead url. It can be a discussion topic which way is better. Original URLs were preserved inside Google Cache urls and it was very easy to recover them out.~~Rotlink
- I would say that prepending something like http://web.archive.org/web/20021113221928/ produces something less cryptic than two urls wrapped into template. Also, {{Wayback}} renders extra information besides the title. This can break narrow columns in tables [5], etc ~~Rotlink
- I admit this isn't something I have considered because I didn't work on anything other than external links inside reference tags, mainly because of all the silly cases like this, where beneficial changes actually break formatting. Yet another reason to use proper references/citations. But that isn't English wiki, and I cannot speak for them. Here we would convert all that into an actual references. I won't push for you to necessarily implement {{Wayback}} and such, but if this comes up later, I did warn you :) — HELLKNOWZ ▎TALK
Here [6] you change external url to a reference, which is afoul of WP:CITEVAR and bots should never change styles (unless that's their task). I'm guessing this is because the url sometimes becomes mangled as per previous paragraph? ~~Hellknowz
- It was a reference, it is inside <ref name=nowlebanon></ref>.
- It may look inconsistent with many other bot edits [7].
- The former is adding archive url next to existing dead url. It tries to preserve original url which means:
- if the dead url is within something like {{citeweb}} it adds archiveurl= to it (the shortest of the archive urls is used)
- if the dead url is <ref></ref> if forms {{citeweb}} with archiveurl= (the shortest of the archive urls is used)
- otherwise, it replaces dead url with archive url (using form of archive url with the dead url inside it)
- The latter just replaces one archive url (Google Cache) with another archive url. So it does not depends on the content. ~~Rotlink
- Let me try to clarify. A reference is basically anything that points to an external source. The most common way is to use the <ref>...</ref> tags, but it doesn't have to be. The most common citation syntax is the citation style 1, i.e. using {{cite web}} template family, but it doesn't have to be.
<ref>Peter A. [http://www.weather.com/day.html ''What a day!''] 2002. Weather Publishing.</ref>
is a valid reference using manual citation style.<ref>{{cite web |author=Peter A. |url=http://www.weather.com/day.html |title=What a day! |year=2002 |publisher=Weather Publishing.}}</ref>
is a valid reference using CS1.
- So if all the references in the article are manual (#1), but a bot adds a {{cite web}} template, that is modifying/changing the citation style. Even changing only bare external urls to citations is sometimes contentious, especially if a bot does that. This is what WP:CITEVAR, WP:CONTEXTBOT means. — HELLKNOWZ ▎TALK
- Is <ref>[http://www.weather.com/day.html ''What a day!'']</ref> interchangeble with <ref>{{cite web |url=http://www.weather.com/day.html |title=''What a day!''}}</ref> ? They render identically ~~Rotlink
- No, they cannot be interchanged in code if the article already uses one style or the other (unless you get consensus or have a good reason). Human editors have to follow WP:CITEVAR, let alone bots. You could, for example, convert them into CS1 citations if most other references are CS1 citations. I admit, I much prefer {{cite xxx}} templates and they make bot job easy, and I'd convert everything into these, especially for archive parameters. But we have no house style and the accepted style is whatever the article uses. That's why we even have {{Wayback}} and such. — HELLKNOWZ ▎TALK
- Is <ref>[http://www.weather.com/day.html ''What a day!'']</ref> interchangeble with <ref>{{cite web |url=http://www.weather.com/day.html |title=''What a day!''}}</ref> ? They render identically ~~Rotlink
- Let me try to clarify. A reference is basically anything that points to an external source. The most common way is to use the <ref>...</ref> tags, but it doesn't have to be. The most common citation syntax is the citation style 1, i.e. using {{cite web}} template family, but it doesn't have to be.
Other potential issues, like a {{Wayback}} template already next to citation, various possible locations of {{dead link}} (inside, outside ref). Archive parameters already in citations or partially missing. ~~Hellknowz
To clarify, does the bot use |accessdate=
, then |date=
for deciding what date the page snapshot should come from? If there are no date specified, does the bot actually parse the page history to find when the link was added and thus accessed. This is how previous bot(s) handled this. Unless consensus changes, we can't yet assume any date/copy will suffice. ~~Hellknowz
- I have experimented with parsing the page history and this tactic showed bad results. The article (or part of the article) can be transtaled from a regional wiki (or another site) and url can be already dead by the time it appears in English Wikipedia. I implemented some heuristics to pick up the right date, but 100% accurate can be only manual checking. Or, we could consider as a good deal the fixing a definetely dead link with a definetely live link which in some cases might be inaccurate (archived a bit before or after the timerange when the url had the cited content) with preserving the original dead link either within the url or as a template parameter.~~Rotlink
- It's still more accurate than assuming current date to be the one to use. What I mean is that any date before today where the citation already exists is closer to the original date, even if not the original. Translated/pasted text can probably be considered accessed at that day. Pasted text on first revision can be skipped. I won't push this logic onto you, as I think community is becoming much more lenient with archive retrieval due to sheer number of dead links, but it's something to consider. — HELLKNOWZ ▎TALK
- Current heuristic is to peek the oldest snapshot of the exact url. By "exact" I mean string equality of archived link and dead link in Wiki article (they are not always equal, because of presense and absense of www. prefix, see the urls in Memento answer on the picture below). Parsing the page history adds the knowledge of the top bound ("article existed at most at that date"). Knowledge of the top bound wouldn't help the heuristic which simply peeks the oldest snapshot. Anyway, we need more ideas here. Perhaps, all the snapshots between the oldest and the top bound have to be downloaded and analyzed (if they are similar, then bot can peek any one, otherwise the decision must be done by human... ~~Rotlink
- You didn't hear me say this, but we probably don't need such precision. It would be an interesting exercise to compare old revisions. But I think we can just use the oldest date that was found in article history and that would work for 99%+ of cases (at least I did this and I have not received any false positive reports in the past). In fact, humans hardly ever bother to do this and previous bots weren't actually required to. I personally just ignored any cases where I couldn't find the date within a few months, but later bots have pretty much extended this period to any date. — HELLKNOWZ ▎TALK
The bot does need to be exclusion compliant due to the nature of the task and the number of pages edited. You should also respect {{inuse}} templates, although that's secondary. ~~Hellknowz
Can you please give more details on how Memento actually retrieves the archived copy? What guarantees are there that it is a match, what are their time ranges? I am going through their specs, but it is important that you yourself clarify enough detail for the BRFA, as we cannot easily approve a bot solely on third-party specs that may change. While technically an outside project, you are fully responsible for the correctness of the change. ~~Hellknowz
- Actually the bot does not depend on the services provided by Memento project. Memento just defines the unified API to retrieve older versions of web pages. Many archives (and Wikis as well) support it. Without Memento, specific code need to be written to work with each archive (that's what I will do anyway to support WebCite and others in the future).
- The protocol is very simple and I think one picture could explain it much better than the long spec.
C:\>curl -i "http://web.archive.org/web/timemap/http://www.reuben.org/NewEngland/news.html" HTTP/1.1 200 OK Server: Tengine/1.4.6 Date: Mon, 19 Aug 2013 12:18:06 GMT Content-Type: application/link-format Transfer-Encoding: chunked Connection: keep-alive set-cookie: wayback_server=36; Domain=archive.org; Path=/; Expires=Wed, 18-Sep-13 12:18:06 GMT; X-Archive-Wayback-Perf: [IndexLoad: 140, IndexQueryTotal: 140, , RobotsFetchTotal: 0, , RobotsRedis: 0, RobotsTotal: 0, Total: 144, ] X-Archive-Playback: 0 X-Page-Cache: MISS <http:///www.reuben.org/NewEngland/news.html>; rel="original", <http://web.archive.org/web/timemap/link/http:///www.reuben.org/NewEngland/news.html>; rel="self"; type="application/link-format"; from="Wed, 13 Nov 2002 22:19:28 GMT"; until="Thu, 10 Feb 2005 17:57:37 GMT", <http://web.archive.org/web/http:///www.reuben.org/NewEngland/news.html>; rel="timegate", <http://web.archive.org/web/20021113221928/http://www.reuben.org/NewEngland/news.html>; rel="first memento"; datetime="Wed, 13 Nov 2002 22:19:28 GMT", <http://web.archive.org/web/20021212233113/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 12 Dec 2002 23:31:13 GMT", <http://web.archive.org/web/20030130034640/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Thu, 30 Jan 2003 03:46:40 GMT", <http://web.archive.org/web/20030322113257/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Sat, 22 Mar 2003 11:32:57 GMT", <http://web.archive.org/web/20030325210902/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Tue, 25 Mar 2003 21:09:02 GMT", <http://web.archive.org/web/20030903030855/http://reuben.org/newengland/news.html>; rel="memento"; datetime="Wed, 03 Sep 2003 03:08:55 GMT", <http://web.archive.org/web/20040107081335/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 07 Jan 2004 08:13:35 GMT", <http://web.archive.org/web/20040319134618/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Fri, 19 Mar 2004 13:46:18 GMT", <http://web.archive.org/web/20040704184155/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 04 Jul 2004 18:41:55 GMT", <http://web.archive.org/web/20040904163424/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sat, 04 Sep 2004 16:34:24 GMT", <http://web.archive.org/web/20041027085716/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Wed, 27 Oct 2004 08:57:16 GMT", <http://web.archive.org/web/20050116115009/http://www.reuben.org/NewEngland/news.html>; rel="memento"; datetime="Sun, 16 Jan 2005 11:50:09 GMT", <http://web.archive.org/web/20050210175737/http://www.reuben.org/NewEngland/news.html>; rel="last memento"; datetime="Thu, 10 Feb 2005 17:57:37 GMT"
- ~~Rotlink
- This is very cool, I might consider migrating my bot to this service, as manual parsing is... well, let's just say none of the archiving bots are running anymore. — HELLKNOWZ ▎TALK 13:49, 19 August 2013 (UTC)[reply]
Finally, what about replacing google cache with archive links? Do you intend to submit another BRFA for this? — HELLKNOWZ ▎TALK 09:34, 19 August 2013 (UTC)[reply]
- Hi.
- Thank you for so detailed comment!
- Some of the questions I am able to answer immediately, but for other I need some time to think and answer later (few hours or days) so they are skipped for a while. ~~Rotlink
- I moved your comment inline with mine, so that I can further reply without copying everything. I hope you don't mind. — HELLKNOWZ ▎TALK 13:18, 19 August 2013 (UTC)[reply]
- About "another BRFA for google cache". I united this question with another one and answered both together. But after the refactoring of the discussion tree it appears again (you can search for "another BRFA" on the page). Rotlink (talk) 16:34, 19 August 2013 (UTC)[reply]
- Oh yeah, my bad. I also thought it's a wholly separate task, whereas you are both replacing those urls same way as others and replacing IPs with a domain name. The latter one is what I am asking further clarification on. Especially, since it can produce dead links, which should at least be marked so. Of course, that implies being able to detect dead urls or reliably assume them to be. — HELLKNOWZ ▎TALK 16:46, 19 August 2013 (UTC)[reply]
- The issue about replacing dead links with other dead links by replacing IPs with a domain name (webcache.googleusercontent.com) was fixed about a week ago. New links are checked to be live. If they live, IP replaced with webcache.googleusercontent.com. Otherwise the searching on the archives begins. If nothing found, links with dead IP remain.
- So, these had been two tasks. Then they were joined together in order the bot not to save the intermediate state of pages (witch contains a dead link) and not to confuse the people who see it. Rotlink (talk) 17:04, 19 August 2013 (UTC)[reply]
- Oh yeah, my bad. I also thought it's a wholly separate task, whereas you are both replacing those urls same way as others and replacing IPs with a domain name. The latter one is what I am asking further clarification on. Especially, since it can produce dead links, which should at least be marked so. Of course, that implies being able to detect dead urls or reliably assume them to be. — HELLKNOWZ ▎TALK 16:46, 19 August 2013 (UTC)[reply]
- Because the bot operator has removed this request on the main page, I am hereby marking this as Withdrawn by operator.—cyberpower ChatOnline 19:11, 20 August 2013 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.