Wikipedia:Bots/Requests for approval/SVnaGBot1

The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was

Approved.

SVnaGBot1

Operator: User A1 (talk)

Automatic or Manually Assisted: Automatic

Programming Language(s):

Bash
Python (pywikipedia)

Function Overview:

Scans this page to find regularly used images that originate on commons and have {{vector version available}} tags, finds the pages on en wiki that use these images and then posts a new section to the talk pages of these articles reminding people to substitute the SVG

Edit period(s): Single slow runs that are manually initiated. Predicted run rate would be monthly or less.

Already has a bot flag (Y/N): No

Function Details:

Download this page of frequently used vector graphic images, then make a list of image pages that have been tagged with {{vector version available}} tags. Bot will ignore anything below a given X-Y pixel size for the full scale raster image currently set to 180x180
Visit each of these pages, reading the "what links here" section of the page. Compiles a list of all pages that do not contain a ":" character (talk pages, things inside namespaces etc).
Go to the talk page of each of these pages, and download that.
If the bot decides it has not visited the talk page before for this image, by checking a local list of pages it has visited and parsing the talk page, then the bot will post a nag notice that the raster in use should be updated to use the SVG if suitable. The bot does not make the change, as some SVG images are not the same as their raster counterparts (eg Image:Human_body_features.png -- adaptable, but not automatically replaceable; theoretically one could do image comparisons, but that's outside the scope of this bot). It will mark the edited talk page as visited for this particular image, and will ignore the page if the page-image pair is met in future.

The bot will sleep for a delay after each get request (currently 6 seconds)
The bot will sleep for a delay after each write (currently 100 seconds)
All curl errors are checked by return code, the bot will abort if any non-zero error code is returned from any request (not just writes)
Bot checks return code from python page editing script, which in turn traps pywiki errors in page.put and get. Non-existant talk pages get skipped

Discussion

The bot uses the pywikipedia framework to make writes to wikipedia. Reading is done by cURL and writing is done using pywikipedia. Majority of the bot's parsing is done using sed and grep. User A1 (talk) 15:23, 22 April 2009 (UTC)[reply]

Sounds good to me. – Quadell ^(talk) 19:40, 22 April 2009 (UTC)[reply]

While I don't doubt this would work absolutely fine, I'm surprised by your using a combination of bash and Python. Purely out of curiosity, can I ask why not do it entirely in Python using the libcurl bindings?

Other comments...

Why skip non-existent talk pages? People might be watching them from having edited the article.
You can do a better job of getting main namespace pages only than skipping any with colons in the title. There are many articles in the main namespace with colons in the title. You could filter them either using Special:Whatlinkshere using the namespace filter (e.g. http://en.wikipedia.org/w/index.php?title=Special%3AWhatLinksHere&target=File%3AF1+driver+template.gif&namespace=0) or using the API (e.g. http://en.wikipedia.org/w/api.php?action=query&list=imageusage&iutitle=File:F1%20driver%20template.gif&iunamespace=0).
Sleeping for 100 seconds is, I think, unnecessary! Most bots will edit with multiple edits per minute, implementing maxlag to look after the servers. Is there a particular reason for the slow speed?

[[Sam Korn]] ^(smoddy) 20:03, 22 April 2009 (UTC)[reply]

I used python because I have only recently started learning how to use python, I am far more familiar with c++ and bash than python. I couldn't get the curl -F working, so I tried pywikipedia which worked first time, and has better error handling. The python stuff is < 200 lines of my own, in total -- its a script that creates a section using the command line to pass a text file, a page title and edit summary. The low update rate was because 100 seconds is a round number, pywikipedia does implement max lag, which is in addition to the programmed sleeps. The sleep delay is a parameter, i can change it, but also there is no rush :) a low update rate allows me to manually check the bot's work, even when running in automatic mode for the first few checks. Once I am happier that it is tested, ill speed it up. I will implement the improved namespace filtering method, thanks for the hints.

There is no good reason to skip non-existant talk pages, in truth, but most articles have a talk page pretty quickly. I can disable the talk page check. I'll write back here when I do it (probably not for 48 hrs). User A1 (talk) 00:29, 23 April 2009 (UTC)[reply]

Approved for trial (25 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. There's really no need to sleep more than a second between reads, or more than ten seconds between writes, unless you just want to for your own reasons. – Quadell ^(talk) 18:11, 23 April 2009 (UTC)[reply]

Just to make life complex, I nuke the file with the bot password, and cannot remember it for the life of me. User:SVnaGBot1 is the replacement. Same code, different channel... User A1 (talk)

The trial run has been completed, please let me know what you think of the results. In truth the bot mostly completed its run of those 200 images in the page as many of the images in the page do not have replacements. If accepted I would like to expand the bot to examine more than the top 200 images using Inkwina's code (say the top 1000 or so on en.wiki). User A1 (talk) 10:56, 26 April 2009 (UTC)[reply]

Some more comments:

Initially the bot trusted the "what links here" as absolute. Now it performs a scan for the filename in the wikitext, in case the what links here may be out-dated User A1 (talk) 11:06, 26 April 2009 (UTC)[reply]
Scanning for the filename use is as simple as a grep. I truncate the initial File: or Image: to allow matches for template usage eg {{template image|MyImage.png}}; this can have a false positive in theory, but the author would have to put "MyImage.png" somewhere in the wikitext, where it is not being used as an image. The chances of anyone would do this in the article namespace seems sufficiently low that I think it is safe to allow for the occasional false nag due to this. User A1 (talk) 11:07, 26 April 2009 (UTC)[reply]

Having another look at the results, it seems that I have missed any file that uses the shorthand {{vva}} template. I will include this in the next run, when approved. User A1 (talk) 08:45, 27 April 2009 (UTC)[reply]

Should this add a category? Something like Category:Articles that should use SVG images instead of PNG images? – Quadell ^(talk) 13:38, 27 April 2009 (UTC)[reply]

I think the "Hello!" at the beginning isn't useful. – Quadell ^(talk) 13:38, 27 April 2009 (UTC)[reply]

I can remove the "Hello!" if its not wanted -- dead easy :) What is the purpose of the cat? The bot's progress can be tracked via its contribs, Adding a cat just means that users on talk pages have to navigate the cat at the end of the article (or at the front, probably easier). I guess I am not clear as to the purpose of catting the talk page..User A1 (talk) 13:50, 27 April 2009 (UTC)[reply]

Oh, just to make it easier for human volunteers to go through those. Anyone else have an opinion on this? – Quadell ^(talk) 14:56, 27 April 2009 (UTC)[reply]

Guess not. User A1, you can add a category, or not, at your own discretion. – Quadell ^(talk) 13:29, 29 April 2009 (UTC)[reply]

Approved. Looks good. – Quadell ^(talk) 13:29, 29 April 2009 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.