Wikipedia:Bots/Requests for approval/Fluxbot 6
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
Operator: Xaosflux (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 03:39, Friday, July 22, 2016 (UTC)
Automatic, Supervised, or Manual: Supervised
Programming language(s): n/a
Source code available: AWB
Function overview: HTML Fixes that are causing pages to be identified as Category:Pages using invalid self-closed HTML tags.
Links to relevant discussions (where appropriate): VPT#New maintenance category
Edit period(s): Ad-hoc batch runs
Estimated number of pages affected: open-ended :thousands of pages, edits to hundreds per run
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details:
I've been working on cleaning up Category:Pages using invalid self-closed HTML tags in advance of the upcoming code changes, mostly running from my own account. I would like to use my bot account primarily so that edits to User talk: can be made quietly using nominornewtalk
. This is primarily fixing the most common html errors:
- Self-closing div tags (<div id="a" />)
- Self-closing span tags (span id="a" />)
- Syntax errors with s,small,big,center tags (e.g. <small>..<small/>)
- This task seems prone to false positives in 1-5% of edits, so will need to be run supervised. I am open to running or not running AWB genfixes on articles if anyone has a preference. Thank you, — xaosflux Talk 03:39, 22 July 2016 (UTC)[reply]
Discussion
editComment (and Support): I have fixed a few hundred of these and have found a similar percentage of false positives in editing with an AutoEd script, no matter how well I write my regexes. I agree that a supervised run, done carefully, should work well. Would you be willing to share your proposed regexes?
Just for clarity, I would like to see this task approved for all namespaces, not just User Talk. There is a lot of work to be done in Talk, Wikipedia Talk, and Wikipedia.
Feel free to crib from my AutoEd script at User:Jonesey95/AutoEd/month.js.
Also, you might look at Wikipedia:CHECKWIKI/WPC 002 dump for examples of pathological patterns that might be seen as problems or opportunities, e.g. <div id='Myerson'/> and </blockquote/>. – Jonesey95 (talk) 05:31, 22 July 2016 (UTC)[reply]
- Pinging Tom.Reding, who has been doing this work quite effectively. – Jonesey95 (talk) 05:38, 22 July 2016 (UTC)[reply]
- Theses definitely need to be corrected. I have been running some fixes for these and also have noticed the false positives, so supervision is necessary. I don't think running with genfixes should be a problem. I assume span will be corrected using {{anchor}} and/or {{subst:anchor}} (or equivalent) and similarly for div. How will you handle unquoted and unbalanced quotes for
id=
in span and div? — JJMC89 (T·C) 06:10, 22 July 2016 (UTC)[reply]- JJMC89 so far I have not been making that assumption, and simply closing the tag as is (e.g. <span id='Myerson'/> becomes <span id='Myerson'></span>). — xaosflux Talk 11:59, 22 July 2016 (UTC)[reply]
- Regexes such as:
- (<span id=".*?" ?)\/> *to* $1></span>
- (<div style=".*?" ?)\/> *to* $1></div>
- — xaosflux Talk 12:04, 22 July 2016 (UTC)[reply]
- That is equivalent. (
{{subst:anchor|Myerson}}
gives<span id="Myerson"></span>
.) — JJMC89 (T·C) 14:36, 22 July 2016 (UTC)[reply]
- That is equivalent. (
- Regexes such as:
- JJMC89 so far I have not been making that assumption, and simply closing the tag as is (e.g. <span id='Myerson'/> becomes <span id='Myerson'></span>). — xaosflux Talk 11:59, 22 July 2016 (UTC)[reply]
- Some of my find/repalces's aren't even regex's just literal string replacement - as this is being run supervised; e.g. (changing <small/> to </small>). — xaosflux Talk 12:05, 22 July 2016 (UTC)[reply]
- Jonesey95 I'm fine running in any namespace, I've been doing cleanups anyway - the bot request is so I can basically do the same work I've been doing with my normal account without triggering the new messages warning. — xaosflux Talk 11:57, 22 July 2016 (UTC)[reply]
Support: I'm glad someone is taking on the talk space portion of the error category, which makes up ~1167/2185 entries. ~623/1167 are archived talk pages though (Will touching those raise concerns? If so, can the MediaWiki software be made to ignore archives?). Non-archived User talk: only comprise 228/2185, so I definitely support expanding to more/all talk space. ~ Tom.Reding (talk ⋅dgaf) 12:23, 22 July 2016 (UTC)[reply]
- MediaWiki doesn't really have a concept of "archive pages", they are just pages. That being said, editing user_talk/subpages does not trigger the new messages indicator, so isn't really the worry. — xaosflux Talk 14:29, 22 July 2016 (UTC)[reply]
- Archived pages should be fixed. These deprecated tags, if they are not fixed, will presumably cause pages to display improperly at some point, and we don't want archived pages to suddenly appear different (and broken). – Jonesey95 (talk) 16:22, 22 July 2016 (UTC)[reply]
- Oh I agree, I mean they are not a worry for "new message indicator" - they can be fixed at anytime without needed the
nominornewtalk
flag that only bots have. — xaosflux Talk 16:27, 22 July 2016 (UTC)[reply]
- Oh I agree, I mean they are not a worry for "new message indicator" - they can be fixed at anytime without needed the
- Archived pages should be fixed. These deprecated tags, if they are not fixed, will presumably cause pages to display improperly at some point, and we don't want archived pages to suddenly appear different (and broken). – Jonesey95 (talk) 16:22, 22 July 2016 (UTC)[reply]
- {{BAGAssistanceNeeded}} What would BAG like to see to move forward? — xaosflux Talk 12:24, 23 July 2016 (UTC)[reply]
- Safe to test this, I think, on all types of pages. Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. — Earwig talk 20:54, 23 July 2016 (UTC)[reply]
- Trial complete. The trial went about as expected, of the first 50 pages where edits would have been needed, 4 needed to be skipped as they were complex. The 25 user talk edits went off very well, while the 25 article edits all appear to be good edits, in many cases they did not solve the overall page problem, leaving some bad html tags behind - some additional regexes may help reduce the number of passes a page may need to be solved; however some of the pages are just really messy. I've got a feeling many of the "easy" ones have been cleaned up by hand already. Any thing else you would like to see The Earwig? Thanks, — xaosflux Talk 22:26, 23 July 2016 (UTC)[reply]
- Please don't change
<ref name=... />
to<ref name=...></ref>
like in [1] and many other edits. The former is both allowed and recommended to invoke a reference which is defined elsewhere. See for example Wikipedia:Citing sources#Repeated citations. It's an undocumented (as far as I know) feature that an empty<ref name=...></ref>
has the same effect. It will probably confuse most editors and I recommend reverting those changes. ref isn't even a html tag but defined by mw:Extension:Cite. Can you post a complete list of the self-closed tags the bot is coded to change? PrimeHunter (talk) 23:27, 23 July 2016 (UTC)[reply] - Another thing, [2] does not have a helpful edit summary: "replaced: <div style="margin: 0; font-family: sans-serif; font-weight: normal; font-size: 100%; border-top: 1px solid #a3b0bf; text-align: center; color: #000; margin-top: 2em; margin-bot..."
- Could it be something showing the important part and using "..." for unimportant details like: "replaced: <div style=... /> by <div style=...></div>"? PrimeHunter (talk) 23:39, 23 July 2016 (UTC)[reply]
- Please don't change
- PrimeHunter Thank you for the feedback, I'm taking
ref
out of my lists, and will revert any of those. — xaosflux Talk 00:22, 24 July 2016 (UTC)[reply]- User talk run had 1 "ref", reverted. — xaosflux Talk 00:27, 24 July 2016 (UTC)[reply]
- Article run rolled back as well - the "ref" problem was mostly here, let me know if I can re-trial with the corrections. — xaosflux Talk 00:45, 24 July 2016 (UTC)[reply]
- Thanks. I noticed an error in an earlier AWB edit [3]. A span tag in the last change was correctly closed before reaching a self-closed ref tag later on the same line. If we wanted a closing tag for the ref it shouldn't be a span. Is the current code designed to avoid such errors? PrimeHunter (talk) 00:51, 24 July 2016 (UTC)[reply]
- Yes, the spans should only strictly match the spans now, I've removed everything about ref's completely. — xaosflux Talk 00:58, 24 July 2016 (UTC)[reply]
- Thanks. I noticed an error in an earlier AWB edit [3]. A span tag in the last change was correctly closed before reaching a self-closed ref tag later on the same line. If we wanted a closing tag for the ref it shouldn't be a span. Is the current code designed to avoid such errors? PrimeHunter (talk) 00:51, 24 July 2016 (UTC)[reply]
- Except for
<ref name="name" />
→<ref name="name"></ref>
, the edits look good. You way want to adjust(<span id=".*?" ?)\/>
to(<span id="[^">]+?") ?\/>
so that the regex isn't too greedy. Also, consider not adding the replacements to the edit summary; the long example PrimeHunter pointed out isn't really helpful. — JJMC89 (T·C) 01:18, 24 July 2016 (UTC)[reply]- Agree. — xaosflux Talk 01:55, 24 July 2016 (UTC)[reply]
- I don't object to a new trial run. I suspect the article skip rate will be large based on the number of edits that only changed ref tags. Could you post a list of skipped pages so we can examine whether their problem is transclusions, something missed by your regex, or maybe something else? PrimeHunter (talk) 01:30, 24 July 2016 (UTC)[reply]
- I never expect these to be 100% solving the problem - most of "big wins" I've gotten so far were all in template. The primary use of this bot for the runs is for user_talk: so that the new message flag doesn't get set. — xaosflux Talk 01:55, 24 July 2016 (UTC)[reply]
- {{BAGAssistanceNeeded}} BAG'ers, let me know when it is OK to run another trial to validate the the issues above are resolved please. (50 edits should be fine). — xaosflux Talk 01:57, 24 July 2016 (UTC)[reply]
2nd trial
edit- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. All right, same as before. — Earwig talk 18:00, 24 July 2016 (UTC)[reply]
- User_talk: 25 User talk trial - appears to be successful; For 25 edits - 3 selected pages had to be skipped due (likely invalid) tag ordering that the regex's didn't like. Of pages that I had no regex for and were automatically skipped, the most common invalid tag is
<b />
; it seems to be about 60%/40% a typo that should be<br />
or a pointless tag that can be deleted. — xaosflux Talk 03:19, 25 July 2016 (UTC)[reply] - Other namespaces:
- 9 (main) edits - with all the "ref" stuff removed from trial one these look OK now. No manual skips were needed, many automated skips - many pages have one-off errors such as "/tag/" or "tag//" that I didn't attempt to script. Note, there are many issues of self-closing
<cite />
tags - used almost the same way as the bad "refs" in trial one - I didn't attempt to repair these as I'm not exactly sure what our "best practice" for this tag is. — xaosflux Talk 03:32, 25 July 2016 (UTC)[reply] - 16 User: to round out the 50 edits. Only 1 manual skip to get to 16 - complex page layout bad for regex; lots of automated skips - User: pages has lots of odd issues such as tags out of order that also include bad tags, fixing the one bad tag won't really fix the page so they will need more attention. — xaosflux Talk 03:39, 25 July 2016 (UTC)[reply]
- 9 (main) edits - with all the "ref" stuff removed from trial one these look OK now. No manual skips were needed, many automated skips - many pages have one-off errors such as "/tag/" or "tag//" that I didn't attempt to script. Note, there are many issues of self-closing
- Trial complete. Please let me know if you see any errors. — xaosflux Talk 03:39, 25 July 2016 (UTC)[reply]
- This looks like a very good result. Any conservative script built to deal with these errors should skip 10 to 30% of pages with errors, depending on the namespace, since some of the errors require human eyes to scan them and others require more elaborate fixes than a simple script can provide. If the script-assisted edits can clear 70% or more articles from the error category, humans will be able to work on the more interesting cases. I recommend approval. – Jonesey95 (talk) 04:57, 25 July 2016 (UTC)[reply]
- Thanks Jonesey95, event when running this, there are no plans to run "automatically" - the FP rate is too high. — xaosflux Talk 13:50, 25 July 2016 (UTC)[reply]
- This looks like a very good result. Any conservative script built to deal with these errors should skip 10 to 30% of pages with errors, depending on the namespace, since some of the errors require human eyes to scan them and others require more elaborate fixes than a simple script can provide. If the script-assisted edits can clear 70% or more articles from the error category, humans will be able to work on the more interesting cases. I recommend approval. – Jonesey95 (talk) 04:57, 25 July 2016 (UTC)[reply]
- Trial edits look good.
<cite ... />
should be handled manually. In HTML 4 it is used for citations, but in HTML5 it is used to indicate the title of a work. In some cases<cite id="id" />
is being used like<span id="id" />
and should be converted to<span id="id"></span>
. Articles using<cite id="id" />
to indicate a reference should have the citation style converted to use<ref name="name" />
(list-defined references) or a shortened footnote style depending on the use. — JJMC89 (T·C) 05:20, 25 July 2016 (UTC)[reply]- That's what I was thinking, these need to be manually evaluated and repaired depending on the usage. — xaosflux Talk 13:50, 25 July 2016 (UTC)[reply]
- Agree with both of the above re cite tags. I have also seen
<cite name="Foo" />
, where an editor typed "cite" instead of "ref". – Jonesey95 (talk) 15:09, 26 July 2016 (UTC)[reply]
- Agree with both of the above re cite tags. I have also seen
- That's what I was thinking, these need to be manually evaluated and repaired depending on the usage. — xaosflux Talk 13:50, 25 July 2016 (UTC)[reply]
- Trial edits look good.
- {{BAGAssistanceNeeded}} Hi BAG, anything else you want to see? — xaosflux Talk 00:46, 27 July 2016 (UTC)[reply]
- Approved. Looks good to me. — Earwig talk 19:00, 29 July 2016 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.