User:John of Reading/Typo fixing with AutoWikiBrowser
This user page may contain an excessive amount of intricate detail that may only interest a specific audience. (February 2014) |
If anyone's interested, here is how I do bulk typo fixing:
Creating the list
[edit]I download and uncompress a 100GB database dump every couple of months, and use the AWB database scanner to create my main work lists. This is the only method I know that:
- Allows precise control over the search using regular expressions
- Returns the full list of results
- Does not try to guess what I really meant
I am happy to be asked to create article lists for other editors. My latest download is enwiki-20241120-pages-articles.xml.bz2 watch for a new download.
Using the Database Scanner
[edit]I use a regular expression that describes several typos at once, so that I get a long list of articles that need a variety of fixes.
- I like to search for a prefix rather than a whole word, so that I find occurrences where editors have mangled the ending of a word as well as the beginning.
- If the list is going to be a long one, I'll run the first 1 or 2 percent of the scan and see if I can tweak the regular expression to eliminate some of the false positives.
- When working on the Lists of common misspellings, I use a regular expression that searches for all the words in an entire list. For example,
\be(?:ached|achother|...(several hundred more omitted)...|yasr)
from the "E" list.
To use the AWB database scanner, select "Database dump" from the drop-down control just above the article list, then press "Make list". Within the scanner dialog, I choose these options before clicking "Start":
- On the "Database" tab, I browse to my download folder and select the uncompressed database dump.
- On the "Namespace" tab, I tick the first "Content" checkbox, which ticks everything in the first list box, then untick "Draft". These database dumps do not include any user pages, so that tick is not relevant.
- On the "Title" tab, I skip blocks of titles within the Wikipedia namespace. I tick "Not contains", "Regex" and "Case sensitive", and paste the long regular expression, below, into the second text box.
- On the "Text" tab, I tick "Contains", "Regex" and "Ignore comments", and paste my current search targets into the first text box.
Titles skipped
|
---|
(?:~~ARTICLES~~|Charles Magauran|Commonly misspelled English words|Cut Spelling|Date and time notation in the United Kingdom|Drexel\s+4\d\d\d|Early Cornish texts|English orthography|Henry Marshall Furman|Interspel|List of On Cinema episodes|List of the Dead Daisies members|Nairai\b|Otte Rud|SoundSpel|Transposed letter effect|~~OTHERS~~|Abuse reports|Abuse response/|Academic studies of Wikipedia|ACF Regionals answers/|Administrators' noticeboard|AMA IRC Meeting log|Adopt-a-typo|Arbitration Committee Elections|Arbitration/|Archived deletion|articles by quality log|Articles for|Articles with UK Geocodes|Attached KML/List of power stations in New Zealand|AutoWikiBrowser/Typos|BillboardEncode/|BillboardID/|Categories for|Catholic Encyclopedia topics/|Centralized discussion/|Changing username/|CHECKWIKI/|Contributor copyright investigations/|Copyright problems/|Correct typos in one click|Coverage of Mathworld topics/|Database reports/|Deleted articles with freaky titles|Deletion log/|Deletion log archive|Deletion review|Did you know nominations/|Disambiguation pages with links/|Editor review/|Featured article|Featured list|Featured picture|Featured portal|Featured topic|Files for|Find a Grave famous people/|GLAM/NHMandSM|GLAM/Your paintings|Goings-on/|Good article reassessment|In the news/|India Education Program/Courses/|Jewish Encyclopedia topics/|Jimbo Wales discussion|List of encyclopedia topics/|List of Wikipedians by|Lists of common misspellings|Main Page history/|Mediation Cabal/|Meetup/|Miscellany for|Move review/|New user log/|Pfam2pdb|Pfam2PDBsum|Picture peer review|Possibly unfree|Recent additions|Redirects for|Reference desk archive|Requested articles|Requests for|Sandbox/|School and university projects/|Shortcut table/|Sockpuppet investigations/|Stub types for deletion|Suspected copyright violations/|Suspected sock puppets|Templates for|Templates with red links|Tyop Contest|Typo Team|Unwanted Cinema cover.png|Upload log archive|Votes for deletion|Wiki Ed/|Wiki Guides/|Wikipedia Signpost/2|Wikipedia Signpost/Special|WikiProject Academic Journals/|WikiProject Chemicals/Log/|WikiProject Chemistry/IRC|WikiProject Directory/Description|WikiProject Editor Retention/|WikiProject Fix common mistakes/|WikiProject History Merge/|WikiProject Intertranswiki/|WikiProject Languages/|WikiProject London Transport/The Metropolitan/|WikiProject Missing encyclopedic articles/|WikiProject Pharmacology/Log/|WikiProject Red Link Recovery/|WikiProject Short descriptions/wd/|WikiProject Spam/|~~SLASH~~|/All discussions|/[Aa]rchive|/Article alerts|/Article list|/Article Talk list|/Articles|/Assessment|/Cleanup listing|/CurrentTranscriptions|/[Dd]ata|/Deletion archive|/Did you know|/Discussions?|/DYK|/Encyclopedic articles|/Example generated lists|/[Ff]eedback|/Fundraising|/ICC valuations|/Internet Relay Chat|/IRC|/List of all portals|/List of biographies|/List of mountains|/Listeria|/Listing by project|/Lists of pages|/Members|/Metrics/|/Newsletter|/Participants|/Peer review|/Popular pages|/Prospectus|/[Pp]ublicwatchlist|/Recent changes|/Recognized [Cc]ontent|/[Rr]edlinks|/Rename template parameters|/[Ss]andbox|/Settings/|/Stale drafts|/Stats|/Statistics|/Talk|/Translation task force|/Unpatrolled|/Watchall|/[Ww]atchlist) Yes, there are a few article titles in this list. Some of these contain many false positives, others are where I don't wish to repeat a mistake, others are where I am avoiding a slow-motion edit war. |
Settings within AutoWikiBrowser
[edit]- I tick "Find & Replace"; and within the configuration dialog:
- I do not tick the "ignore" checkboxes, so that I force the correction of the current misspelling in the entire page.
- I tick "Add replacements to edit summary" so that the edit summary is as helpful as possible. For this to work properly, the "Find" strings must match all four brackets of a [[Link]].
- I usually start with my accumulated list of over 4,000 spelling rules.
- Below the spelling rules, I start with a dummy "Find & Replace" rule that finds the exact regular expression that I used for the database scan and replaces it with "INVESTIGATE".
- I tick "Skip if no replacement".
- I tick "Skip if only minor replacements made"; within the "Find & Replace" dialog I use the "Minor" checkbox to mark rules that make a change that is valid but, I think, not worth saving by itself
I'm currently running with General Fixes turned off because this discussion has not reached a conclusion.
I gave up on RegExpTypoFix some years ago. Although there are lots of good spelling rules there, I prefer to leave MOS fixes to editors who are prepared to defend them.
Checking each proposed edit
[edit]Then it's up to me to check each proposed edit.
- If the text looks like vandalism or a WP:BLP violation, then I jump out to look at the article history.
- If I can't understand what the text is trying to say, I don't try to fix it.
- If my "Find & Replace" has damaged correct text, then I may pause to think about changing the re-spelling rules to avoid the false positive.
- If my "Find & Replace" has identified incorrect text by changing it the word "INVESTIGATE", then I'll either add or adjust a re-spelling rule and try again, or make a one-off edit to the article text.
- If my "Find & Replace" has made an incorrect fix, I'll either adjust the settings and try again, or make a one-off edit to the article text.
- If the changes are part of quoted text or something like a book title, I'll jump out to another window to try to check the source.
I may make other edits in the AWB edit box, fixing additional typos or correcting syntax errors that AWB has identified but not fixed automatically.
I pick one of a handful of pre-configured edit summaries, and then modify it if necessary to describe the edits I actually made.
I try to remember to clear the "Minor edit" checkbox if I've done anything more than simple typo-fixing or if the diff seems very long; the danger is that I forget to tick it again afterwards.
Then it's "Save" and on to the next article.
There is a danger that I'll accidentally save the word "INVESTIGATE" in an article. I check for this kind of error by running this search every day or two.
Editing quotes, book titles and such like
[edit]My regular expressions run on the whole page including quotations, book titles and so on. If I edit these, I try to leave a helpful edit summary:
- replaced: foo → bar per source
- I found the source and was able to verify that the version at Wikipedia was incorrect. Either an earlier editor miscopied it, or, perhaps, the source has been corrected after it was copied.
- replaced: foo → bar per book cover image at Amazon/Abebooks/etc.
- I found an image of the book with enough pixels for the words to be read clearly.
- replaced: foo → bar per a search at Amazon/Abebooks/etc.
- I didn't find a usable cover image, but these external sites support the correction.
- replaced: foo → bar - MOS:QUOTE recommends fixing "insignificant" errors in quoted text
- I found the source also has the error, but I've made an editorial decision to apply MOS:QUOTE.
- replaced: foo → bar - In a quote, but I'm assuming this was a copying error
- I haven't found the source, but to me it looks like a copying error. Or perhaps MOS:QUOTE might apply.
- replaced: foo → bar for legibility
- I didn't bother to check the source, as the change is small and the incorrect version is hard to read - something like WIlliam > William
- replaced: foo → bar
- Oh dear. Perhaps I didn't spot that I was about to edit a quote, or I neglected to adjust the edit summary. Please revert if necessary, but it's possible that MOS:QUOTE might apply.
Skipping false positives
[edit]The best way to skip false positives is to use regular expressions with lookahead/lookbehind. This method is especially useful when doing the initial database scan, since it means the articles don't even appear in the list.
I've developed a few standard suffixes that arrange for some common false positives to be skipped. I'll tack some of them on to the long regular expression when doing the initial search, and sometimes tack them on to individual find+replace rules when needed.
Suffix | Skip matches... |
---|---|
(?(?<![\.\-]\w*)|(?!\w*[\.\-]))
|
...inside hyphens or dots, probably part of a URL or domain name |
(?(?<!"\w*)|(?!\w*"))
|
...where a single word is inside double quotes |
(?(?<!\[\[\w*)|(?!\w*\]\]))
|
...where a single word is inside a wikilink |
(?![ \(\)\.\,\;\-\'\"\+\&\%\w\d]*\.(?i:(?:gif|jpe?g|ogg|ogv|pdf|png|svg|tiff?|webm))\b)
|
...inside an image file name |
(?!(?:<sup>|[|</?nowiki>|\W)+(?i:Sic)\b)(?<!{{(?:[Aa]s\s+written|[Nn]at|[Nn]ot\s+typo|[Nn]ot\s*a\s*typo|[Pp]roper\s*name|[Ss]ic\??|[Ss]IC|[Tt]ypo)\|[^{}]+)
|
...inside a {{Sic}} template, or closely followed by the word "sic" |
(?<!\<\s*ref\s+name\s*=\s*(?:"|'|)[\w\s\:\-\.\/]{0,99})
|
...inside a reference name |
(?<!https?://[^ \|\{\}\[\]\<\>]*)
|
...inside a URL |
(?<!\b(?<!trans-)title\d*\s*=[^\|\{\}]{0,255})
|
...inside a title parameter, but not a trans-title |
I'll typically save edits to around 40% of the articles that turn up in my list, so it is important that the other 60% are skipped efficiently.
For example, the "E" list says that "exercice" may be a misspelling of "exercise". I actually searched for \bexercic
so that I found "exerciced", "exercicing" and so on. As I worked through the list I gradually expanded the rule to
exercic(?!(i|io|ios|is|o)\b)(?!es?\s+(anarchistes|au|comme|commun|d|dans|de|des|divertissants?|du|en|et|journaliers|modulé|ou|par|participatif|phénoménologique|pour|pratiques|préparatoires|prepar[eé]es|progressifs|spirituels?|sur|terminé)\b)(?<!\b(d|l)['’]exercic)(?<!\b(avec|ces|cet|des|douze|en|en\s+\d+|et|les|mes|ou|plein|son|un)\s+exercic)
Fragment | Meaning |
---|---|
(?!(i|io|ios|is|o)\b)
|
Skip if the word is "exercici" (Latin) or other foreign words |
(?!es?\s+(anarchistes|...|phénoménologique)\b)
|
Skip if the word is "exercice" or "exercices" followed by something indicating we're in French-language text |
(?<!\b(d|l)['’]exercic)
|
Skip if the word is immediately preceded by d' or l', again indicating French-language text |
(?<!\b(avec|...|un)\s+exercic)
|
Skip if the previous word tells us we're in French-language text |
Alternatively I use the "Minor replacement" feature. If I can write a regular expression that describes a set of false positives, I'll add a respelling rule to change that to "FALSE" and mark the rule as "minor"
Namespaces
[edit]I will happily fix typos in most non-talk namespaces.
Namespace | Comment |
---|---|
Draft | I don't touch these |
File | I consider file descriptions and fair use rationales to be part of the encyclopedia, and fair game for typo fixing. However, I try not to fix descriptions written in the first person. Some files contain lists of old edit summaries (example) inside <nowiki>...</nowiki> tags. I skip those efficiently by ticking the first "Ignore" checkbox at the top left of the "Find & Replace" dialog while working on the File namespace
|
Module | I'll have a look at them, but I'm most unlikely to make any edits |
Portal | With care; some portal pages are used for discussion, and shouldn't be fixed; others are archives and probably shouldn't be fixed |
Template | I'll fix typos in template documentation, with care; I'll even fix typos in templates sometimes, with great care |
User | I don't touch these. They are not included in the database dumps so don't turn up in my lists. |
Wikipedia | I aim to fix typos only on pages which are still being used. On the "Skip" tab, I have a regular expression (\{\{([Ff]ailed|[Hh]istorical|[Rr]ejected)(\||\}\})|\[\[User:|\[\[User\s+talk:|^(?s:.{499999})) which I turn on while working on the Wikipedia namespace. This skips most discussion pages, some other inactive pages, and huge pages that can cause AWB to hang.
|
Common misspellings
[edit]May 2023: In terms of effort per fix, this approach is no longer efficient.
Each of these lists contains a mixture of spellings. Some are easy, in the sense that most articles that contains that spelling need to be fixed. Others are not easy, because although they are incorrect spellings of English words, they are valid foreign-language words, surnames, brand names, and so on. Back in 2012 there was a backlog of easy errors which I was able to fix. Nowadays I find that other editors keep on top of the easy errors, and, despite my efforts to eliminate the false positives automatically, I'm looking through a list where most of the matches shouldn't be fixed but cannot easily be skipped automatically.
List | Start date | Time | Edits | Notes |
---|---|---|---|---|
A | March 2012 | 2 months | 5500 |
My (somewhat naive) database scans created lists totalling 60,000 articles
|
February 2015 | 3 weeks | 4400 |
| |
November 2016 | 3 weeks | 2800 | ||
October 2018 | 3 weeks | 2700 | ||
November 2020 | 3 weeks | 1900 |
| |
November 2022 | 3 weeks | 2270 |
| |
B | May 2012 | 3 weeks | 1600 |
|
September 2014 | 6 days | 500 |
Reassuringly faster with improved regular expressions
| |
December 2016 | 6 days | 600 |
| |
March 2019 | 1 week | 620 | ||
December 2020 | 1 week | 400 | ||
December 2022 | 4 days | 360 |
| |
C | June 2012 | 5 weeks | 5600 |
|
October 2014 | 2 weeks | 1900 |
| |
January 2017 | 3 weeks | 2500 |
| |
March 2019 | 4 weeks | 2400 | ||
January 2021 | 3 weeks | 2000 |
| |
January 2023 | 3 weeks | 1850 |
| |
D | October 2012 | 3 weeks | 3500 |
|
December 2014 | 1 week | 850 | ||
February 2017 | 1 week | 1000 | ||
May 2019 | 3 weeks | 1300 | ||
March 2021 | 1 week | 800 | ||
January 2023 | 1 week | 800 |
| |
E | November 2012 | 5 weeks | 4650 |
|
March 2015 | 2 weeks | 1400 | ||
March 2017 | 1 week | 1300 |
| |
June 2019 | 2 weeks | 1400 |
| |
May 2021 | 1 week | 800 | ||
March 2023 | 1 week | 700 |
| |
F | January 2013 | 3 weeks | 2750 |
|
June 2015 | 2 weeks | 1500 | ||
April 2017 | 1 week | 900 |
| |
July 2019 | 2 weeks | 1050 |
| |
July 2021 | 2 weeks | 950 |
| |
April 2023 | 2 weeks | 590 |
| |
G | February 2013 | 8 days | 1050 |
|
August 2015 | 2 weeks | 500 | ||
May 2017 | 5 days | 300 | ||
August 2019 | 1 week | 500 | ||
August 2021 | 1 week | 350 | ||
May 2023 | 1 week | 290 |
| |
H | March 2013 | 11 days | 1100 |
|
September 2015 | 1 week | 550 |
| |
June 2017 | 4 days | 500 |
| |
September 2019 | 11 days | 650 | Delayed by phab:T232491 | |
September 2021 | 1 week | 500 |
| |
July 2023 | 5 days | 380 | ||
I | April 2013 | 3 weeks | 3500 |
|
September 2015 | 2 weeks | 1250 |
| |
July 2017 | 2 weeks | 1300 | ||
November 2019 | 3 weeks | 1700 | ||
October 2021 | 2 weeks | 800 | ||
September 2023 | 1 week | 900 |
| |
J | June 2013 | 1 day | 150 |
|
October 2015 | 1 day | 60 | ||
July 2017 | 1 day | 50 | ||
December 2019 | 1 day | 60 | ||
November 2021 | 1 day | 100 | ||
September 2023 | 1 day | 90 | ||
K | June 2013 | 1 day | 160 |
|
October 2015 | 1 day | 70 | ||
August 2017 | 1 day | 110 | ||
December 2019 | 1 day | 60 | ||
November 2021 | 1 day | 60 | ||
November 2023 | 1 day | 40 | ||
L | June 2013 | 3 weeks | 2000 |
|
October 2015 | 10 days | 1100 |
An overall 90% false positive rate despite last time's regex | |
September 2017 | 2 weeks | 1000 |
| |
January 2019 | 2 weeks | 1000 |
| |
November 2021 | 1 week | 550 |
| |
November 2023 | 2 weeks | 430 |
Slowed by off-wiki distractions | |
M | September 2013 | 5 weeks | 4100 |
I didn't tackle "manouver" and its variants very thoroughly. In many cases it is hard to decide whether to correct to the British or American spelling; and the British spelling is frankly silly. I'll wait for the next edition of the Concise Oxford. |
December 2015 | 3 weeks | 3600 |
| |
September 2017 | 3 weeks | 1900 |
Still very slow, and only a 20% hit rate, thanks to...
| |
January 2020 | 4 weeks | 2050 |
Did I mention that this one is slow?
| |
January 2022 | 3 weeks | 1550 |
How about omitting Malcom, Michael and Millenium next time? | |
January 2024 | 2 weeks | 1200 |
| |
N | December 2013 | 4 days | 500 |
|
February 2016 | 2 days | 300 | ||
October 2017 | 3 days | 480 | ||
March 2020 | 5 days | 340 |
| |
March 2022 | 3 days | 390 | ||
March 2024 | 3 days | 290 |
| |
O | January 2014 | 8 days | 1300 |
|
February 2016 | 6 days | 840 | ||
November 2017 | 6 days | 700 |
| |
March 2020 | 5 days | 580 | ||
March 2022 | 1 week | 740 | ||
March 2024 | 1 week | 660 |
| |
P | March 2014 | 4 weeks | 4000 |
|
March 2016 | 3 weeks | 1900 | ||
November 2017 | 3 weeks | 2200 | ||
March 2020 | 3 weeks | 1600 | ||
March 2022 | 2 weeks | 1400 | ||
June 2024 | 2 weeks | 1350 | ||
Q | August 2013 | 3 days | 300 |
|
April 2016 | 1 day | 120 | ||
December 2017 | 2 days | 200 | ||
April 2020 | 1 day | 100 | ||
April 2022 | 1 day | 90 | ||
June 2024 | 1 day | 80 | ||
R | April 2014 | 2 weeks | 1700 |
My thanks to Arjayay for regularly tackling most of the entries in this list. |
May 2016 | 1 week | 750 | ||
January 2018 | 1 week | 650 |
| |
May 2020 | 5 days | 660 | ||
May 2022 | 1 week | 700 | ||
May 2024 | 1 week | 580 | ||
S | June 2014 | 4 weeks | 4300 |
You'd be amazed how many ways there are to spell "specification" |
June 2016 | 2 weeks | 1250 |
| |
February 2018 | 4 weeks | 2500 |
Hindsight says I didn't build the list correctly last time
| |
May 2020 | 2 weeks | 1500 |
| |
May 2022 | 3 weeks | 1600 | ||
August 2024 | 3 weeks | 1550 |
Ridiculously inefficient
| |
T | March 2010 | 3 weeks | 2100 |
I began with "T" because I assumed most editors would begin with "A". |
September 2014 | 3 weeks | 2900 |
| |
August 2016 | 10 days | 1300 |
| |
April 2018 | 2 weeks | 1300 | ||
June 2020 | 2 weeks | 1100 |
| |
July 2022 | 11 days | 950 |
| |
November 2024 | 12 days | 620 | ||
U | March 2010 | 2 weeks | 1200 | |
December 2014 | 1 week | 1250 |
| |
October 2016 | 5 days | 700 | ||
June 2018 | 1 week | 840 | ||
September 2020 | 4 days | 500 |
| |
August 2022 | 3 days | 350 | ||
V | April 2010 | 10 days | 850 | |
December 2014 | 1 week | 1750 |
| |
October 2016 | 4 days | 620 | ||
June 2018 | 1 week | 735 |
| |
September 2020 | 1 week | 500 | ||
September 2022 | 1 week | 580 | ||
W | April 2010 | 2 weeks | 1200 | |
January 2015 | 2 weeks | 2100 |
| |
November 2016 | 1 week | 820 |
| |
August 2018 | 1 week | 940 |
| |
September 2020 | 1 week | 700 |
| |
September 2022 | 1 week | 800 | ||
X | April 2010 | 1 day | 60 | |
February 2015 | 1 day | 175 | ||
November 2016 | 1 day | 75 | ||
August 2018 | 1 day | 90 | ||
September 2020 | 1 day | 50 | ||
September 2022 | 1 day | 40 | ||
Y | May 2010 | 1 day | 60 | |
February 2015 | 1 day | 75 |
| |
November 2016 | 1 day | 50 | ||
August 2018 | 1 day | 50 | ||
September 2020 | 1 day | 50 | ||
September 2022 | 1 day | 30 | ||
Z | May 2010 | 1 day | 40 | |
February 2015 | 1 day | 20 | ||
November 2016 | 1 day | 2 | ||
September 2018 | 1 day | 8 | ||
September 2020 | 1 day | 8 | ||
September 2022 | 1 day | 8 |
| |
Repetitions | May 2010 | 1 month | 3500 | Using only the Google search, so I must have missed many. |
Grammar and Misc | June 2010 | 19 months | 70000 |
|
Repeated words
[edit]The the
[edit]- The settings file is here
I like to scan for "the the" errors whenever I download a new database dump. My regular expression searches for
- Either "The" or "the", followed by "the"
- Perhaps with quotes or apostrophes in between
- ...said it was the "the greatest thing since sliced bread"
- ...announced the the sale of the century
- Perhaps where the second "the" is an article title or in a piped link
- ...the worst outrage since the [[the Troubles]]
- ...did well in the [[1969–70 NBA season|the previous season]]
I don't search for "The The" or "the The" where the second "The" is uppercase. I used to, but after a while I couldn't decide whether "The The Who tour..." or "...the The Times reporter..." looked wrong or not.
More generally
[edit]After each "List of common misspellings", I've been scanning for repeated words beginning with that letter. Here is the main part of the regexp for the letter "U"...
\b[Uu](?<!https?://[\w\.\,\:\/\?\&\%\+\=\-\#_]+)([a-z]+)(\s|’|'|`|"|\]\]|\[\[(?!Category:)[^\[\]\|]*\||\[\[(?!Category:)(?=[^\[\]\|]*\]\]))+u\1\b
...which searches for a word beginning with "U" or "u", followed by the same word beginning with "u". I found that a search for two uppercase words found too many false positives in book/film/song titles. The words may appear inside wikilinks and may be separated by various kinds of quote mark.
The main false positives are species names. I began by telling the database scanner to skip any article containing a {{Taxobox}} or {{Automatic taxobox}}, and added a rule that guessed that any Latin-like word ending was a false positive. I later decided this was a mistake, and now deal with these more thoroughly.
Many other false positives turn up, so I add additional rules as needed.