User:Faebot/Geograph

From Wikimedia Commons, the free media repository
Jump to navigation Jump to search

Faebot has a scope that covers categorization and fixes to Geograph uploads (a UK and Ireland project). As projects come up that may require more explanation or a consensus for changes, I will include a summary here.

If you have suggestions for improvement or would like to raise issues, please do so on my talk page or by email rather than here. -- (talk) 10:37, 10 October 2012 (UTC)

Project A: Geograph user categorization

[edit]

The following table shows the hidden user categories that Faebot is categorizing Geograph images into. There are several benefits to having these categories available from being able to follow an interesting photographer for related images (such of postboxes, train stations or wildlife) that someone else categorizing images would find useful to add to, through to fixing licensing problems apparent from the same Geograph uploader.

As a semi-manual check, the categories are created manually as Faebot is generating them. This means they may be created a few days in advance or shortly after the categories start getting added to images. See below for an analysis of potential duplicates. -- (talk) 09:51, 7 October 2012 (UTC)

User categories for Geograph images

Analysis of potential duplicate categories

[edit]

As there is no firm consensus of when to use the words "Files", "Images" or "Photographs" in a user category of images, some duplication may occur. The following analysis was done to compare possible duplication when creating "Images by" hidden categories for Geograph uploaders. -- (talk) 09:51, 7 October 2012 (UTC)

Analysis of duplicate and potential duplicate categories:

Project B: Categorizing by decade and year

[edit]

Faebot is running a python script to populate Category:Geograph images by year which has children categories of decade and then each year. This has value such as:

  • ensuring that pre-1940s images say a bit more than just the name of the Geograph uploader, who is unlikely to be the original photographer and so the license may require context.
  • the ability for viewers to follow a particular photographer by year, including showing their photographs on a map. Particularly useful to identify high quality images in certain locations compared to the 'norm' on Geograph.
  • the ability to find images by decade or year to assist with other more detailed categorization such as transport by year, or county/place by year.
  • the ability to find images of places before redevelopment or for images of buildings now demolished (and potentially work out which year they were demolished in).
Script detail

The script uses a local dump of Commons pages to avoid strain on the Wikimedia servers. The python script gets tweaked to fit, but here is an example populating the 1950s, 1960s and 1980s ("o989455922912834566234o" is just a freakishly unlikely random string to avoid problems with back-references in the regex):


import subprocess, time
omycall = ["python",'replace.py','-xml://Volumes/<local location>/commonswiki.xml','-namespace:6','-dotall','-regex']
for decade in [r'5',r'6',r'8']:
	mycall=omycall[:]
	for y in [r'0',r'1',r'2',r'3',r'4',r'5',r'6',r'7',r'8',r'9']:
		year=r'19'+decade+y
		cat=r"ategory:"+year+r" Geograph images"
		mycall.append( r'([Ss]ource\s*=.*?www.geograph.org.uk.*?[Dd]ate\s*=)\s*([^\n]*?)\b'+year+r'(\b.*$)')
		mycall.append( r"\1\2o989455922912834566234o"+year+r"\3\n[[Category:"+year+r" Geograph images]]")
		mycall.append( r'([Dd]ate\s*=)\s*([^\n]*?)\b'+year+r'(\b.*?[Ss]ource\s*=.*?www.geograph.org.uk.*$)')
		mycall.append( r"\1\2o989455922912834566234o"+year+r"\3\n[[Category:"+year+r" Geograph images]]")
		mycall.append( r'(\[\[[Cc]'+cat+r'\]\].*)\n\[\[[Cc]'+cat+r'\]\]')
		mycall.append( r'\1' )
	mycall.append( r'o989455922912834566234o(.)')
	mycall.append( r'\1' )
	mycall.append( '-summary:Add to [[Category:19'+decade+'0s Geograph images]]')
	subprocess.call(mycall)

Progress
  • 1930s to 1980s ✓ Done
  • 1990s ✓ Done
  • 2000s - will consider the option of further breaking into month subcats. 2000 and 2001 under way rather than doing the whole decade in one mouthful as there are likely to be large numbers involved. Working

Project C: Geograph regional categorization (London borough / Ireland county / Scotland council area)

[edit]

For background discussion see Commons:Bots/Work_requests/Archive 7#Project C: Adding UK counties/district categories and test reports at User:Faebot/SandboxG (general sample of regional categorization from Google Maps data) and User:Faebot/SandboxL (London related tests).

Stage 1 will be to test out the concepts and then run categorization for Geograph images geotagged in Greater London (as defined by all London boroughs). It is then thought that the rest of England & Wales, Scotland and Ireland will follow as separate later stages. The project is expected to take several months.

Example Python source code can be found at User:Faebot/Geograph/Code.

Benefits

‡ Note that by "county" I mean the "second level administrative area" which is the most appropriate regional breakdown after country. In England this is (often) called county, in Wales there are principle areas, in Scotland this is council area and in Ireland this is county. As these are political boundaries, they are subject to change but have reasonable stability over time. The definition to be taken of boundaries for this project on Commons, will be the most pragmatic based on the on-line databases available. Where there are sufficient good reasons to do so, a breakdown to the "third level" may be done—as has been down to borough level for London.

Issues
  • On 2 November 2012 the source website for OS data using xml queries, http://www.uk-postcodes.com, was closed down, apparently indefinitely [a day later it was up again, but I have lost confidence in it as an available source]. This will mean testing out another site or re-writing the scripts and further updates will be delayed as a result, however they will be able to re-start where they left off.
Alternative of using MapIt

I will probably switch over to Mapit run by mySociety which is free to use and in turn is a service underpinned by Ordnance Survey open data. I have some tests working (I was trying to avoid using JSON calls as I find these harder to work with in Python than xml) but need to revise and test out the scripts properly. Example test results:

  • The categorization for Dumfries and Galloway has been restarted with mysociety.org JSON data rather than uk-postcodes.com xml data. The OS data being used are fields giving Unitary Authority and UK Parliament constituency which appear to be the most fitting though this may vary, in particular there are special fields that would need to chosen for handling London.

Summary

[edit]
Country Region Project Status
England Greater London #C1 ✓ Done
England Cornwall #C2b ✓ Done
England Devon #C2b ✓ Done
England Dorset #C2b ✓ Done
England Hampshire #C3c ✓ Done
England Somerset #C2b ✓ Done
England West Midlands (county) #C3b ✓ Done
England Isle of Wight #C3c ✓ Done
England South East: Kent, Medway, East Sussex, Brighton and Hove, West Sussex, Surrey #C4a Working
Wales (all) #C3d Working
Scotland Orkney Islands #C2a ✓ Done
Scotland Shetland Islands #C2a ✓ Done
Scotland Dumfries and Galloway #C3a ✓ Done

Pause for retesting on 1 December 2012 (restart on 20 December)

[edit]

I have paused the categorization, hopefully for just a few days, to enable some retesting and a quality check. Open Street Map seems to be introducing most of the faulty matches, whilst the Ordnance Survey data underpinning MapIt seems far more reliable, probably giving an error rate below 0.01%. A simple change of the logic of how me might use MapIt with OSM+GMaps as the backup may reduce the current error rate significantly; from <0.15% to <0.01% perhaps? The remaining error rate may be with the way Geograph images have been processed rather than with the OS data being used (if someone could work this out at some point, it might be a useful improvement to Geograph).

The test sample is to cover the region around Monmouthshire/Newport/Torfaen and something of Gloucestershire which might also recategorize/improve some of the Wales work already done and on pause.

Step 1
Calculate bounding box and create source file
  • Bounding box for -3.197,51.479,-2.373,51.944
  • Matching regex to create source file: python replace.py -xml:"//Volumes/Fae_32GB/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(51\.[5-8]|51\.479|51\.48|51\.9[0-4])\d*\s*\|\s*(-2\.[4-9]|-2\.37[3-9]|-3\.[0-1]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(51\.[5-8]|51\.479|51\.48|51\.9[0-4])\d*\s*\|\s*(-2\.[4-9]|-2\.37[3-9]|-3\.[0-1]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/Fae_32GB/Geograph/GeoboxTestMonmouth.txt" -ns:6
  • 25,300 images were found inside the bounding box and titles saved to a file.
Step 2
Test wanted fields using a soak test

The soak test ran over 4,000 images and provided the following counties in results:

  • [['Herefordshire', 567], ['Gloucestershire', 1326], ['Monmouthshire', 1280], ['Newport', 199], ['Cardiff', 149], ['Torfaen', 80], ['Blaenau Gwent', 24], ['Powys', 117], ['Caerphilly', 103], ['North Somerset', 37], ['Bristol', 115]]

"Bristol City" appeared in the source data and was mapped to "Bristol".

Step 3
Categorize region

The wanted list limits results to be added to:

Any "Geograph images in" categories will be swapped if they exist currently, on the basis that the new logic is going to be more accurate. This should be noticeable for Newport where images have been incorrectly categorized under Monmouthshire, for example this change.

When there are no visible categories, the images are added to (need to be hand checked due to inconsistent naming and potential conflicts with disambiguation pages):

✓ Done 25,300 images checked, with the test set being processed from 8—17 December 2012.

Step 4
Analysis of results

Success. The error rate seems to be running at 0.03%, which is only constrained by the accuracy of Ordnance Survey data and unavoidable issues such as where the camera is (giving us the GPS data) and what is being photographed. Refer to User_talk:Fæ/2013#Geograph_again.

Stage 1: London boroughs

[edit]
✓ Done · Populate London

The purpose of this project is to add Geograph hidden categories identifying images in all 32 London boroughs (plus the City of London) using hidden child categories of Category:Geograph images in London. There will be a test report stage then a beta test on a few thousand images to demonstrate the concept. It is expect to run this categorization process slowly, probably fewer than 1,000 images being changed per day (Geograph has more than 2,000,000 images, it is not known how many are geotagged in London). If there are future re-runs to update the categorization these would be rare, no more than once a year would be expected.

The London boroughs are fairly easily defined and relatively neutral in terms of the regional politics of naming, consequently this seemed a good choice for a first stage if the principles and scripts used to run this categorization are to apply to other UK regions. Note, the "County of London" was replaced by "Greater London" in 1965, with the London boroughs being the next sub-division of the region.

Update Beta test complete on 2,000+ images, and it appears that at least 46,000+ images are in the London bounding box (the nearest rectangle that can cover London on the map). Using their given coordinates, these are being checked for borough names against Open Street Map, double checked on http://www.uk-postcodes.com (a front end for OS OpenData), and (where a third opinion is needed) on Google Maps. Images found not to be named as in a London borough or where there are too many discrepancies are left uncategorized.

Bug—26 Oct 2012—Fixed
  • I have noticed one incorrectly categorized file. I have halted the routine that might have been responsible and checking how many affected files there are. I suspect this may have been from an earlier version of the script which I ran on my own image uploads, and so I doubt many files have this problem, but I'm being cautious. The example failure is File:St Pauls through Olympic rings.jpg which has been put under a location category but is not a Geograph file (please don't fix it by hand, I would like the script to recover itself). An initial search seems to show only a very small percentage of files could be affected, my guess is that this might be a result of a missing double check when using a file dump to decide which files to change (having an internet connection that drops out several times an hour is not helping!).
  • All Geograph categorization by Faebot is halted for the moment as a precaution. I believe only a list of files based on Category:Listed buildings in London was affected and these will be easily checked and fixed over the weekend (when I have a bit of time). I do not want to rush a fix without checking the facts and ensuring a fix is tested. 23:23, 26 October 2012 (UTC)
  • I have a fix script doing a trial run without committing changes. If it looks okay, I'll run it and commit changes tomorrow (so long as my modem does not flake out again - spontaneously lost all settings today, so I expect it is suffering from old age hardware problems). 22:32, 27 October 2012 (UTC)
  • ✓ Done If any similar errors are spotted, please raise a note immediately on my talk page. -- (talk) 14:48, 28 October 2012 (UTC)
Sources
Pseudocode
  1. Find likely candidate images from a breakdown of Geograph
  2. Get categories for candidate image using API call [on error: wait, try again using increasingly longer periods]
  3. If candidate image is already categorized against a Geograph borough then next
  4. Get image page text and extract data from Object location dec or Location dec templates [not found: error log, next†]
  5. For each image test if within OSM bounding box [if not: next]
  6. Get OSM address data [on error: wait, retry then add to error log]
  7. Test if the OSM data gives the county as London [if not: next]‡
  8. Map OSM given borough (=locality) to existing Commons category in Category:London boroughs
  9. If the number of visible non-Geograph categories on the image are 0, then
    1. Add an existing, visible, Commons London borough category
    2. If template exists, then remove Uncategorized-Geograph
    3. Add Check categories-Geograph
  10. Add hidden Geograph by London borough category
  11. Write updated image page to Commons
  12. Write record to local log
† - all Geograph images are supposed to be imported as geotagged
‡ - the borough name is checked against another site and if a mismatch a third is then used to create a poll. The resulting borough name should therefore be highly reliable, certainly more than OSM data can provide alone (which may sometimes return a blank, a higher region name - "London", or may appear incorrect compared to the postcode)
  • Note, I have removed dealing with Uncategorized-Geograph templates for a separate exercise.

Generating candidate images

[edit]

The following call quickly generated a file of 2,950 images that were categorized under any category with "Geograph" in the name, and appeared to be inside a bounding box for London using the coordinates in {{Location dec}}.

London search 1
// Find images in geo box: lat > 51.28676 and lat < 51.69188 and lon > -0.51104 and lon < 0.33402
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*ategory[^\]]+Geograph |ategory[^\]]+Geograph .*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3])' 'doesnotmatter' -savenew:"//Volumes/<local>/GeoboxLondonList.txt" -ns:6

This call picks up on the use of the {{Geograph}} template, many of which are not listed in other Geograph categories. This generated 80,538 image file names inside the same London bounding box, representing about 4% of all Geograph images on Commons:

London search 2
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall 'Location dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*\{\{[Gg]eograph\||\{\{[Gg]eograph\| .*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3])' 'doesnotmatter' -nocase -savenew:"//Volumes/<local>/GeoboxLondonList.txt" -ns:6

My current view, after some experimentation, is that both searching the image page for "geograph.org.uk" and for the template {{Geograph}} is necessary — if either one matches, then this can be assumed to be a Geograph project photograph. There are examples of (mostly older) images where no source link is quoted but there is a valid link to the Geograph user page via template, and there are examples of pages with no Geograph categories but the images are linked back to Geograph as a source. Similarly, to avoid bugs like the one identified above, it is necessary that one or other of these features are double-checked as being on an image page before a bot makes any assumption about the image being suitable for a Geograph category.

A third search including the Geograph URL and template resulted in 80,746 matches. I have restarted the categorization script based on this new source file.

python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/Stage1/GeoboxLondonList2.txt" -ns:6

Second run, February 2013

[edit]

After retesting, there have been several important improvements to the code, so I'm re-running from scratch. This included re-doing the bounding box, the new figure was that 81,089 needed checking.

The default categories are:

  1. Category:City of London
  2. Category:City of Westminster
  3. Category:Royal Borough of Kensington and Chelsea
  4. Category:London Borough of Hammersmith and Fulham
  5. Category:London Borough of Wandsworth
  6. Category:London Borough of Lambeth
  7. Category:London Borough of Southwark
  8. Category:London Borough of Tower Hamlets
  9. Category:London Borough of Hackney
  10. Category:London Borough of Islington
  11. Category:London Borough of Camden
  12. Category:London Borough of Brent
  13. Category:London Borough of Ealing
  14. Category:London Borough of Hounslow
  15. Category:London Borough of Richmond upon Thames
  16. Category:Royal Borough of Kingston upon Thames
  17. Category:London Borough of Merton
  18. Category:London Borough of Sutton
  19. Category:London Borough of Croydon
  20. Category:London Borough of Bromley
  21. Category:London Borough of Lewisham
  22. Category:Royal Borough of Greenwich
  23. Category:London Borough of Bexley
  24. Category:London Borough of Havering
  25. Category:London Borough of Barking and Dagenham
  26. Category:London Borough of Redbridge
  27. Category:London Borough of Newham
  28. Category:London Borough of Waltham Forest
  29. Category:London Borough of Haringey
  30. Category:London Borough of Enfield
  31. Category:London Borough of Barnet
  32. Category:London Borough of Harrow
  33. Category:London Borough of Hillingdon

Stage 2: Easy boxes

[edit]

2a: Orkney Islands and Shetland Islands

[edit]
✓ Done · Populate Orkney Islands and Shetland Islands

Using the query below for a latitude above 58.7, I get 12,628 images matching (~0.6% of all uploaded Geograph images). There are only the two regions, nicely bounded by the sea, so no issues with complaints about observer vs. the observed object location.

Interestingly, an automated count of the files in Category:Orkney Islands shows 3,146 files and Category:Shetland shows 4,878; a total of 8,024 (this may double count some images and does count files that are not Geograph). This means significantly more than 4,603 from the above 12,600+ Geograph files are not categorized at all under their region; a clear benefit from this categorization project.

Python regex detail for finding photographs in the Orkneys and Shetlands
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '(Location dec\s*\|\s*(58\.[7-9]|59\.|6[01]\.).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(58\.[7-9]|59\.|6[01]\.))' '\1FAEBOT-MARKER-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/GeoboxVeryNorth.txt" -ns:6


2b: Cornwall, Devon, Somerset, Dorset

[edit]
Status Working · Populate Cornwall, Devon, Somerset, North Somerset and Dorset ... plus Isles of Scilly

Using this bounding box I generously hit these South West England counties with some overlap with other counties - in fact my matches back from OSM queries look like ['Devon' , 'Somerset' , 'Cornwall' , 'Dorset' , 'Wiltshire' , 'Vale of Glamorgan' , 'South Glamorgan' ] (still checking, will test another 1,000 images). To keep things simple, I'll probably throw away any matches outside of the four completely covered counties. I was hoping to hit Jersey and Guernsey, but these do not seem to be under Geograph. The search (regex) matched exactly 54,000 images, so I may need to run again as the number seems too rounded to be true.

Python regex detail for finding "geograph.org.uk" in South West England
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(49\.[1-9]|50\.|51\.[0-3])\d*\s*\|\s*(-1\.[8-9]|-[2-6]\.|-7\.0).*[Ss]ource\s*=[^\n]+geograph\.org\.uk|[Ss]ource\s*=[^\n]+geograph\.org\.uk.*[Ll]ocation dec\s*\|\s*(49\.[1-9]|50\.|51\.[0-3])\d*\s*\|\s*(-1\.[8-9]|-[2-6]\.|-7\.0))' 'FAEBOT-MARKER-FAEBOT' -nocase -query:120 -savenew:"//Volumes/<local>/GeoboxCornwallList.txt" -ns:6

A re-run found 147,956 (note, this may not be all potential matching images, I still need to run a query using "{{Geograph|" rather than just looking for pages with reference directly to "geograph.org.uk"). At my current rate of progress (assuming half get matched for saving), I think testing c.148,000 images would take 37 days of continuous processing to complete. Seems do-able, though I may need to break up the source list into smaller runnable tranches to avoid slowing down Python by having it all in memory at the same time (running as one large batch, Python was fine :-) ) Considering the number is large, I'll consider how sub-region categories might be useful, though it might well be pragmatic to do the top level categorization before breaking regions such as Cornwall into something debatable like its parliamentary constituencies.

I have started a re-run after finding that the Isles of Scilly tends to be unmatched both in OSM data and the OS data and are not matched as Cornwall in the Google Maps data (being identified with "Isles of Scilly" as administrative level 2, which is theoretically correct). Politically the Isles currently fall under Cornwall, however following the fact they are a special case in ceremonial counties, I have separated them out into Category:Geograph images in the Isles of Scilly.

When there are no visible categories, the following apply: Category:Devon, Category:Cornwall, Category:Somerset, Category:North Somerset, Category:Dorset, Category:Isles of Scilly.

Case: Lundy
[edit]

I noticed this example of odd map data - File:North West Lundy - geograph.org.uk - 15444.jpg. Lundy is shown in an xml query to Open Street Map as being in Pembrokeshire (Wales) while Google Maps and MapIt show it as being in Devon. Oddly when I go directly to OSM and look up Lundy, it is shown as in Devon correctly.

Stage 3: Priority areas

[edit]

3a: Dumfries and Galloway

[edit]
✓ Done · Populated Dumfries and Galloway

Werespielchequers raised the problem of photographs in Wigtownshire being categorized in Northern Ireland. Prioritizing Dumfries and Galloway Council would provide a handy category to do a check for images both geo-located in Scotland and with contradicting categories in Northern Ireland.

This is a Council and the source data breaks this into 3 historic counties: Wigtownshire, Kirkcudbrightshire and Dumfriesshire. For the time being these are being mapped back to the Council level for the purposes of the hidden Geograph categorization. Note, Openstreetmap has county identified but the Ordnance Survey query I am currently using is limited to the Westminster Constituency of Dumfries and Galloway. With the OS data appearing more accurate than OSM, there seems good reason to stick to this level.

Python regex detail for Dumfries and Galloway
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(54\.[5-9]|55\.[0-5])\d*\s*\|\s*(-2\.[6-9]|-[3-4]\.|-5\.[0-5]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(54\.[5-9]|55\.[0-5])\d*\s*\|\s*(-2\.[6-9]|-[3-4]\.|-5\.[0-5]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/Stage3/GeoboxDumfriesAndGalloway.txt" -ns:6

The search provided 61,073 images in the bounding box, though this includes a large number of matches with Cumbria, Scottish Borders, South Lanarkshire, East Ayrshire and South Ayrshire. With an initial 2,000+ images checked, positive matches to Dumfries and Galloway look to be around 1/4 (15,000) and time to process these is likely to be at least ten days (taking into account that the Python thread is deliberately slowed down to reduce the number of transactions per day on the source data websites, and is running in parallel with other tasks).

As an example of the benefits of this categorization, with only a few hundred images in the category, 5 were shown as both in Dumfries and Galloway and South Lanarkshire at the same time and after 6,000 images were categorized this had risen to 42 incorrectly categorized images; see Cat Scan.

Case: OpenStreetMap and data reliability
[edit]

The checks in Dumfries and Galloway yielded a good example case of the OSM reliability problem. Using the uploaded image geodata from File:The scar - loch ryan.jpg, here are the relevant comparisons:

  1. OpenStreetMap gives "county" as South Ayrshire (the neighbouring council area).
  2. Google Maps gives the address as Dumfries and Galloway (and the related JSON query gives "county" as Dumfries and Galloway).
  3. UK-postcodes (from OS data) gives "constituency" as Dumfries and Galloway.
  4. MapIT (from OS data again) gives Dumfries and Galloway Council as the "Unitary Authority (UTA), Scotland".

To be fair, the boundary between Dumfries and Galloway and South Ayrshire is close to this coordinate, however the distance from the boundary is certainly more than 500m (see GMap's boundary line). Consequently OSM must be considered questionable for county boundaries and I have already seen a pattern of significant unreliability for postcode districts.

My script spotted the inconsistency between OSM and UK-postcodes, went to Google Maps for a third opinion (this only happens for inconsistencies), and has consequently correctly categorised the image under Category:Geograph images in Dumfries and Galloway based on the majority.

3b: West Midlands (county)

[edit]
✓ Done · Populated West Midlands (county)

The West Midlands region is composed of Herefordshire, Shropshire, Staffordshire, Warwickshire and Worcestershire along with the city conurbation. Photographs in the region are often mis-categorized.

Python regex detail
python replace.py -xml:"//Volumes/Fae_32GB/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(52\.3[4-9]|52\.[4-5]|52\.6[0-7])\d*\s*\|\s*(-2\.2[2-9]|-2\.[0-1]|-1\.[5-9]|-1\.4[2-9]|-1\.41[8-9]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(52\.3[4-9]|52\.[4-5]|52\.6[0-7])\d*\s*\|\s*(-2\.2[2-9]|-2\.[0-1]|-1\.[5-9]|-1\.4[2-9]|-1\.41[8-9]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/Fae_32GB/Geograph/Stage3/GeoboxWestMidlands2.txt" -ns:6

The regex above gives 24,510 matching images in the bounding box (just over half are likely to be matches based on the difficult boundary shape).

Getting a match to the ceremonial county of "West Midlands" is a bit tricky. At first I tried the following fields:

  • (gmapWanted, mapitWanted, osmWanted)=("administrative_area_level_2", "European region", "state_district")

However this gave matches of only East Midlands versus West Midlands and considering the bounding box, this means the matches were for the West Midlands region rather than county. I then examined:

  • (gmapWanted, mapitWanted, osmWanted)=("administrative_area_level_3","Middle Layer Super Output Area (Generalised)","city")

This gives the next level down which would need to be mapped to the ceremonial county. I would expect to map Birmingham, Coventry, Wolverhampton, Dudley, Sandwell, Solihull, and Walsall to the West Midlands (county) but I'm running a soak test to see what the data actually provides.

The matches in the bounding box for the first 1,000 images [<region>,<number of matches>] are:

[['Cannock Chase', 6], ['South Staffordshire', 20], ['Solihull', 82], ['Wolverhampton', 35], ['Sandwell', 37], ['Birmingham', 154], ['North Warwickshire', 91], ['Dudley', 58], ['Walsall', 42], ['Bromsgrove', 135], ['Nuneaton and Bedworth', 16], ['Warwick District', 60], ['Lichfield', 35], ['Coventry', 63], ['Tamworth', 25], ['Wyre Forest', 72], ['North Warwickshire District', 2], ['Stratford-on-Avon District', 5], ['Hinckley and Bosworth', 17], ['Lichfield District', 4], ['Warwick', 6], ['Tamworth District', 1], ['Solihull District', 3], ['North West Leicestershire', 3], ['Rugby District', 24], ['Wychavon District', 1], ['South Staffordshire District', 1], ['Nuneaton and Bedworth District', 1]]

This routine will match all images with region names in bold and categorize them under Category:Geograph images in the West Midlands (county). The oddity of Solihull District appears to be another name for the Metropolitan Borough of Solihull, though this should have no bearing on the validity of the mapping to West Midlands (county).

3c: Hampshire and Isle of Wight

[edit]
✓ Done · Populate Hampshire and the Isle of Wight

This bounding box gets 104,210 Geograph image matches on Commons.

For mapping, I'm using the fields from Google Maps, MapIt and Open Street Map of:

  • gmapWanted, mapitWanted, osmWanted = "administrative_area_level_2", "County council", "county"

A sample of 2,000 images from the bounding box gives results: [['Hampshire', 1288], ['West Sussex', 226], ['Surrey', 122], ['Berkshire', 74], ['Isle of Wight', 117], ['Wiltshire', 54], ['Dorset', 113], ['Wokingham', 5], ['Royal Borough of Windsor and Maidenhead', 2]]; which seems good enough to separate Hampshire from the Isle of Wight at least. As previously, unwanted places will be discarded rather than categorizing partial areas, to avoid later confusion.

If these are a good pattern, this mapping of fields might be okay for all other English counties where the common use of 'county' matches the ceremonial county, which seems the case with Hampshire.

3d: Wales

[edit]

✓ Done · Populate subcategories of Category:Geograph images in Wales
Breakdown of unitary authorities of Wales.

I'm going for a much large region, the whole of Wales, at one chunk. A preliminary test of 2,000+ (out of 178,000) images, with a few mapping tweaks for name variations, gives me matches for:

Anglesey, Blaenau Gwent, Bridgend, Caerphilly, Cardiff, Carmarthenshire, Ceredigion, Conwy, Denbighshire, Flintshire, Gwynedd, Merthyr Tydfil, Monmouthshire, Neath Port Talbot, Newport, Pembrokeshire, Powys, Rhondda Cynon Taf, Swansea, Torfaen, Vale of Glamorgan, Wrexham

If there are some poor name choices here, behaviour should be consistent, so it should be straight-forward to move images to better named categories, or merge, should corrections be needed.

Based on advice from Nilfanion, I'm not taking this direction; parked.
Breakdown of the 13 Historic Counties

I am rethinking these categories to roll up to the 13 historic counties of Wales:

  1. Monmouthshire
  2. Glamorganshire
  3. Carmarthenshire
  4. Pembrokeshire
  5. Cardiganshire
  6. Brecknockshire
  7. Radnorshire
  8. Montgomeryshire
  9. Denbighshire
  10. Flintshire
  11. Merionethshire
  12. Caernarfonshire
  13. Anglesey

Now re-running. Based on comments raised during testing, the additional mapping of categories was added:

mappings2=[
("Caerphilly","Caerphilly County Borough"),
("Newport","Newport, Wales")
]

This does not change the (hidden) Geograph category but is used to choose an general visible category where no current categories exist on an image.

The new run is going from scratch, so old images are being swapped over to (hopefully) more accurate categories, where previously Open Street Map or Google Maps were over-riding Ordnance Survey (in practice the most accurate) or resulting in the image being skipped. Examples [1], [2], [3] and [4] - this last one is a classic of photographs at county boundaries.

I have been forced to add Category:Geograph images in Wirral West as an extra option as these were being forced into Flintshire by Google Maps if the OS data was ignored. Example

Stage 4: England

[edit]

Refer to Metropolitan and non-metropolitan counties of England.

4a: South East

[edit]
Status ✓ Done · Populate Category:Geograph images in Medway, Category:Geograph images in Kent, Category:Geograph images in East Sussex, Category:Geograph images in Brighton and Hove, Category:Geograph images in West Sussex and Category:Geograph images in Surrey

With London and Hampshire done, the South East is a logical next step.

OSM bounding box: -1.077,50.716,1.524,51.5

This bounding box yielded over 218,400 image matches, however this includes most of South London which will be excluded, probably something on the order of 30,000 photos. On starting the process I note that chunks of Hampshire and even Oxfordshire have been covered, again these will count against the final total categorized as they will be skipped.

The counties are:

  1. Category:Medway
  2. Category:Kent
  3. Category:East Sussex
  4. Category:Brighton and Hove
  5. Category:West Sussex
  6. Category:Surrey

4b: Middle England

[edit]
Status Working · Category:Geograph images in England

This is a wide swathe of the middle of England, taking a generalized approach to running a wider net. One advantage (to the bot operator) is that Faebot can be left to churn through this large net for several months without needing intervention.

Category mapping test:

OS name Commons name
The City of Brighton and Hove Brighton and Hove
Bath & North East Somerset Bath and North East Somerset
West Berkshire Berkshire
Newbury Berkshire
South Gloucestershire Gloucestershire
Cheshire East Cheshire
Cheshire West and Chester Cheshire
Warrington Cheshire
Sheffield South Yorkshire
Medway Kent
Blackburn with Darwen Lancashire
Telford and Wrekin Shropshire
Knowsley Merseyside
Stoke-on-Trent Staffordshire
Southend-on-Sea Essex
Halton-with-Aughton Lancashire
Halton East North Yorkshire
Halton Gill North Yorkshire
Halton Holegate Lincolnshire
Halton Lea Gate Northumberland
Halton West North Yorkshire
Halton Cheshire
Swindon Wiltshire
Yorkshire and the Humber West Yorkshire
Derby Derbyshire
# Geograph cat Visible cat (optional)
1 Category:Geograph images in Berkshire Category:Berkshire
2 Category:Geograph images in Kent Category:Kent
3 Category:Geograph images in North Somerset Category:North Somerset
4 Category:Geograph images in Somerset Category:Somerset
5 Category:Geograph images in Staffordshire Category:Staffordshire
6 Category:Geograph images in Gloucestershire Category:Gloucestershire
7 Category:Geograph images in Lancashire Category:Lancashire
8 Category:Geograph images in Shropshire Category:Shropshire
9 Category:Geograph images in Herefordshire Category:Herefordshire
10 Category:Geograph images in Worcestershire Category:Worcestershire
11 Category:Geograph images in Nottinghamshire Category:Nottinghamshire
12 Category:Geograph images in Warwickshire Category:Warwickshire
13 Category:Geograph images in Warrington Category:Warrington
14 Category:Geograph images in Merseyside Category:Merseyside
15 Category:Geograph images in West Midlands Category:West Midlands
16 Category:Geograph images in West Yorkshire Category:West Yorkshire
17 Category:Geograph images in Yorkshire and the Humber Category:Yorkshire and the Humber
18 Category:Geograph images in North Somerset Category:North Somerset
19 Category:Geograph images in Cheshire Category:Cheshire
20 Category:Geograph images in Oxfordshire Category:Oxfordshire
21 Category:Geograph images in Essex Category:Essex
22 Category:Geograph images in Cambridgeshire Category:Cambridgeshire
23 Category:Geograph images in Derbyshire Category:Derbyshire
24 Category:Geograph images in Swansea Category:Swansea
25 Category:Geograph images in South Yorkshire Category:South Yorkshire
26 Category:Geograph images in Hampshire Category:Hampshire
27 Category:Geograph images in Denbighshire Category:Denbighshire
28 Category:Geograph images in North Yorkshire Category:North Yorkshire
29 Category:Geograph images in East Sussex Category:East Sussex
30 Category:Geograph images in Norfolk Category:Norfolk
31 Category:Geograph images in Greater Manchester Category:Greater Manchester
32 Category:Geograph images in Leicestershire Category:Leicestershire
33 Category:Geograph images in Wiltshire Category:Wiltshire
34 Category:Geograph images in Dorset Category:Dorset
35 Category:Geograph images in Blackburn with Darwen Category:Blackburn with Darwen
36 Category:Geograph images in Telford and Wrekin Category:Telford and Wrekin
37 Category:Geograph images in Northamptonshire Category:Northamptonshire
38 Category:Geograph images in Stoke-on-Trent Category:Stoke-on-Trent
39 Category:Geograph images in Buckinghamshire Category:Buckinghamshire
40 Category:Geograph images in Southend-on-Sea Category:Southend-on-Sea
41 Category:Geograph images in Suffolk Category:Suffolk
42 Category:Geograph images in Halton-with-Aughton Category:Halton-with-Aughton
43 Category:Geograph images in Halton East Category:Halton East
44 Category:Geograph images in Halton Gill Category:Halton Gill
45 Category:Geograph images in Halton Holegate Category:Halton Holegate
46 Category:Geograph images in Halton Lea Gate Category:Halton Lea Gate
47 Category:Geograph images in Halton West Category:Halton West
48 Category:Geograph images in Halton Category:Halton
49 Category:Geograph images in Swindon Category:Swindon
50 Category:Geograph images in Hertfordshire Category:Hertfordshire
51 Category:Geograph images in Bath and North East Somerset Category:Bath and North East Somerset
52 Category:Geograph images in the East Riding of Yorkshire Category:East Riding of Yorkshire
53 Category:Geograph images in Bristol Category:Bristol
54 Category:Geograph images in York Category:York
55 Category:Geograph images in Bury Category:Bury
56 Category:Geograph images in Knowsley Category:Knowsley
57 Category:Geograph images in Sheffield Category:Sheffield
58 Category:Geograph images in Derby Category:Derby
59 Category:Geograph images in Newbury Category:Newbury

Stage 5: Scotland

[edit]
Status Working · Category:Geograph images in Scotland

In the same way as Wales, I have decided to try doing Scotland in one big gulp. There are 541,081 titles matched in a bounding box around Scotland, so other areas are partly covered. I would estimate more than 450,000 of these are likely to be within Scotland after analysis.

Here's the mini milestone plan:

  • Step 1 - grab list of candidate images. ✓ Done
  • Step 2 - sample data to check for OS region naming, small test runs. ✓ Done
  • Step 3 - review list of regions to be matched. ✓ Done
  • Step 4 - review large test run. ✓ 4,500 completed, apart from the 'net' capturing a fair chunk of Northern Ireland, which means they get noted in my terminal window but there are no changes on-wiki, this seems to have run without any issues.
  • Step 5 - monitor run and fix any bugs (likely to take 1 or 2 months to complete). Working

Region names to test for—note a few bordering regions are listed towards the end as wanted, this is not a mistake, just a convenience and overlaps with the England categorization work:

OS name Commons name
City of Edinburgh Edinburgh
Dundee City Dundee
Glasgow City Glasgow
Highland Highland (council area)
Aberdeen City Aberdeen
Stirling Stirling council area
Shetland Islands Shetland
Borough Edinburgh Edinburgh
Sunderland Tyne and Wear
# Geograph cat Visible cat (optional)
1 Category:Geograph images in Dumfries and Galloway Category:Dumfries and Galloway
2 Category:Geograph images in West Lothian Category:West Lothian
3 Category:Geograph images in Edinburgh Category:Edinburgh
4 Category:Geograph images in Dundee Category:Dundee
5 Category:Geograph images in Glasgow Category:Glasgow
6 Category:Geograph images in the Highland (council area) Category:Highland (council area)
7 Category:Geograph images in Aberdeenshire Category:Aberdeenshire
8 Category:Geograph images in Aberdeen Category:Aberdeen
9 Category:Geograph images in Na h-Eileanan an Iar Category:Na h-Eileanan an Iar
10 Category:Geograph images in the Scottish Borders Category:Scottish Borders
11 Category:Geograph images in the Stirling council area Category:Stirling council area
12 Category:Geograph images in Fife Category:Fife
13 Category:Geograph images in Perth and Kinross Category:Perth and Kinross
14 Category:Geograph images in Angus Category:Angus
15 Category:Geograph images in East Renfrewshire Category:East Renfrewshire
16 Category:Geograph images in Renfrewshire Category:Renfrewshire
17 Category:Geograph images in South Lanarkshire Category:South Lanarkshire
18 Category:Geograph images in West Dunbartonshire Category:West Dunbartonshire
19 Category:Geograph images in East Dunbartonshire Category:East Dunbartonshire
20 Category:Geograph images in Argyll and Bute Category:Argyll and Bute
21 Category:Geograph images in Moray Category:Moray
22 Category:Geograph images in South Ayrshire Category:South Ayrshire
23 Category:Geograph images in North Ayrshire Category:North Ayrshire
24 Category:Geograph images in East Ayrshire Category:East Ayrshire
25 Category:Geograph images in Shetland Category:Shetland
26 Category:Geograph images in Midlothian Category:Midlothian
27 Category:Geograph images in East Lothian Category:East Lothian
28 Category:Geograph images in the Orkney Islands Category:Orkney Islands
29 Category:Geograph images in Inverclyde Category:Inverclyde
30 Category:Geograph images in North Lanarkshire Category:North Lanarkshire
31 Category:Geograph images in Clackmannanshire Category:Clackmannanshire
32 Category:Geograph images in Falkirk Category:Falkirk
1 Category:Geograph images in Cumbria Category:Cumbria
2 Category:Geograph images in Northumberland Category:Northumberland
3 Category:Geograph images in North Yorkshire Category:North Yorkshire
4 Category:Geograph images in the East Riding of Yorkshire Category:East Riding of Yorkshire
5 Category:Geograph images in Tyne and Wear Category:Tyne and Wear

Stage 6: Northern Ireland

[edit]
Status Working ·Category:Geograph images in Northern Ireland by Council

There is currently a set of Geograph categories available of NI Counties, but the boundaries available within the OS Open Data appear to be limited to NI Councils (which superseded the NI Counties in 1973). The following proposed categorization reflects that, but may need to map the visible categories to parents such as those in Category:Districts in Northern Ireland.

Implementation notes:

The net caught 66,704. After seeing the initial results, I realized that the rectangle was slightly off being east of the intended area and so re-cast it.

# Geograph cat Visible cat (optional)
1 Category:Geograph images in Antrim Borough Council Category:County Antrim
2 Category:Geograph images in Ards Borough Council Category:Ards Borough Council
3 Category:Geograph images in Armagh District Council Category:Armagh
4 Category:Geograph images in Ballymena Borough Council Category:Ballymena Borough Council
5 Category:Geograph images in Ballymoney Borough Council Category:Ballymoney
6 Category:Geograph images in Banbridge District Council Category:Banbridge
7 Category:Geograph images in Belfast City Council Category:Belfast City Council
8 Category:Geograph images in Carrickfergus Borough Council Category:Carrickfergus Borough Council
9 Category:Geograph images in Castlereagh Borough Council Category:Castlereagh
10 Category:Geograph images in Coleraine Borough Council Category:Coleraine
11 Category:Geograph images in Cookstown District Council Category:Cookstown
12 Category:Geograph images in Craigavon Borough Council Category:Craigavon Borough Council
13 Category:Geograph images in Derry City Council Category:Derry City Council
14 Category:Geograph images in Down District Council Category:Down District Council
15 Category:Geograph images in Dungannon and South Tyrone Borough Council Category:Dungannon and South Tyrone Borough Council
16 Category:Geograph images in Fermanagh District Council Category:County Fermanagh
17 Category:Geograph images in Larne Borough Council Category:Larne
18 Category:Geograph images in Limavady Borough Council Category:Limavady
19 Category:Geograph images in Lisburn Borough Council Category:Lisburn
20 Category:Geograph images in Magherafelt District Council Category:Magherafelt
21 Category:Geograph images in Moyle District Council Category:Moyle District Council
22 Category:Geograph images in Newry and Mourne District Council Category:Newry and Mourne District Council
23 Category:Geograph images in Newtownabbey Borough Council Category:Newtownabbey Borough Council
24 Category:Geograph images in North Down Borough Council Category:North Down Borough Council
25 Category:Geograph images in Omagh District Council Category:Omagh District Council
26 Category:Geograph images in Strabane District Council Category:Strabane District Council