User:Faebot/Geograph
Faebot has a scope that covers categorization and fixes to Geograph uploads (a UK and Ireland project). As projects come up that may require more explanation or a consensus for changes, I will include a summary here.
If you have suggestions for improvement or would like to raise issues, please do so on my talk page or by email rather than here. --Fæ (talk) 10:37, 10 October 2012 (UTC)
Project A: Geograph user categorization
[edit]The following table shows the hidden user categories that Faebot is categorizing Geograph images into. There are several benefits to having these categories available from being able to follow an interesting photographer for related images (such of postboxes, train stations or wildlife) that someone else categorizing images would find useful to add to, through to fixing licensing problems apparent from the same Geograph uploader.
As a semi-manual check, the categories are created manually as Faebot is generating them. This means they may be created a few days in advance or shortly after the categories start getting added to images. See below for an analysis of potential duplicates. --Fæ (talk) 09:51, 7 October 2012 (UTC)
Analysis of potential duplicate categories
[edit]As there is no firm consensus of when to use the words "Files", "Images" or "Photographs" in a user category of images, some duplication may occur. The following analysis was done to compare possible duplication when creating "Images by" hidden categories for Geograph uploaders. --Fæ (talk) 09:51, 7 October 2012 (UTC)
Analysis of duplicate and potential duplicate categories:
Category:Photographs by Mark AndersonCategory:Images by Mark Anderson- Resolution: Images takes priority as it has 1000+ images while photographs had 3 and is not linked from anywhere. Done
- Category:Photographs by Dr Neil Clifton
Category:Images by Dr Neil Clifton- Resolved by discussion. Done
- The following 9 categories default to Photographs as pre-existing: Done
- Category:Photographs by Simon Carey
Category:Images by Simon Carey - Category:Photographs by Patrick Mackie
Category:Images by Patrick Mackie - Category:Photographs by Nigel Cox
Category:Images by Nigel Cox - Category:Photographs by Michael Patterson
Category:Images by Michael Patterson - Category:Photographs by Thomas Nugent
Category:Images by Thomas Nugent - Category:Photographs by David Howard
Category:Images by David Howard - Category:Photographs by Stephen Craven
Category:Images by Stephen Craven - Category:Photographs by Roger Cornfoot
Category:Images by Roger Cornfoot - Category:Photographs by Stephen Sweeney
Category:Images by Stephen Sweeney
- Category:Photographs by Simon Carey
Category:Photographs by Mike QuinnCategory:Images by Mike Quinn- Resolution: Images takes priority it has 7,000+ images and Photographs has 12 and is not linked from anywhere. Done
Category:Photographs by David LallyCategory:Images by David Lally- Resolution: Images takes priority as it had 4,000+ images while Photographs had 7. Done
- Category:Photographs by Ben Brooksbank
Category:Images by Ben Brooksbank- Resolution: Photographs takes priority as pre-existing and 4,000+ images. Done
Project B: Categorizing by decade and year
[edit]Faebot is running a python script to populate Category:Geograph images by year which has children categories of decade and then each year. This has value such as:
- ensuring that pre-1940s images say a bit more than just the name of the Geograph uploader, who is unlikely to be the original photographer and so the license may require context.
- the ability for viewers to follow a particular photographer by year, including showing their photographs on a map. Particularly useful to identify high quality images in certain locations compared to the 'norm' on Geograph.
- the ability to find images by decade or year to assist with other more detailed categorization such as transport by year, or county/place by year.
- the ability to find images of places before redevelopment or for images of buildings now demolished (and potentially work out which year they were demolished in).
Script detail
|
---|
The script uses a local dump of Commons pages to avoid strain on the Wikimedia servers. The python script gets tweaked to fit, but here is an example populating the 1950s, 1960s and 1980s ("o989455922912834566234o" is just a freakishly unlikely random string to avoid problems with back-references in the regex): import subprocess, time omycall = ["python",'replace.py','-xml://Volumes/<local location>/commonswiki.xml','-namespace:6','-dotall','-regex'] for decade in [r'5',r'6',r'8']: mycall=omycall[:] for y in [r'0',r'1',r'2',r'3',r'4',r'5',r'6',r'7',r'8',r'9']: year=r'19'+decade+y cat=r"ategory:"+year+r" Geograph images" mycall.append( r'([Ss]ource\s*=.*?www.geograph.org.uk.*?[Dd]ate\s*=)\s*([^\n]*?)\b'+year+r'(\b.*$)') mycall.append( r"\1\2o989455922912834566234o"+year+r"\3\n[[Category:"+year+r" Geograph images]]") mycall.append( r'([Dd]ate\s*=)\s*([^\n]*?)\b'+year+r'(\b.*?[Ss]ource\s*=.*?www.geograph.org.uk.*$)') mycall.append( r"\1\2o989455922912834566234o"+year+r"\3\n[[Category:"+year+r" Geograph images]]") mycall.append( r'(\[\[[Cc]'+cat+r'\]\].*)\n\[\[[Cc]'+cat+r'\]\]') mycall.append( r'\1' ) mycall.append( r'o989455922912834566234o(.)') mycall.append( r'\1' ) mycall.append( '-summary:Add to [[Category:19'+decade+'0s Geograph images]]') subprocess.call(mycall) |
- Progress
- 1930s to 1980s Done
- 1990s Done
- 2000s - will consider the option of further breaking into month subcats. 2000 and 2001 under way rather than doing the whole decade in one mouthful as there are likely to be large numbers involved. Working
Project C: Geograph regional categorization (London borough / Ireland county / Scotland council area)
[edit]For background discussion see Commons:Bots/Work_requests/Archive 7#Project C: Adding UK counties/district categories and test reports at User:Faebot/SandboxG (general sample of regional categorization from Google Maps data) and User:Faebot/SandboxL (London related tests).
Stage 1 will be to test out the concepts and then run categorization for Geograph images geotagged in Greater London (as defined by all London boroughs). It is then thought that the rest of England & Wales, Scotland and Ireland will follow as separate later stages. The project is expected to take several months.
Example Python source code can be found at User:Faebot/Geograph/Code.
- Benefits
- Act as a double check on the basic accuracy of geo-coordinate data
- Aid categorizers to correctly identify the county‡ for naming other categories such as Category:Shops in the Royal Borough of Kensington and Chelsea
- Encourage greater use of the large Geograph photo collection
‡ Note that by "county" I mean the "second level administrative area" which is the most appropriate regional breakdown after country. In England this is (often) called county, in Wales there are principle areas, in Scotland this is council area and in Ireland this is county. As these are political boundaries, they are subject to change but have reasonable stability over time. The definition to be taken of boundaries for this project on Commons, will be the most pragmatic based on the on-line databases available. Where there are sufficient good reasons to do so, a breakdown to the "third level" may be done—as has been down to borough level for London.
- Issues
- On 2 November 2012 the source website for OS data using xml queries, http://www.uk-postcodes.com, was closed down, apparently indefinitely [a day later it was up again, but I have lost confidence in it as an available source]. This will mean testing out another site or re-writing the scripts and further updates will be delayed as a result, however they will be able to re-start where they left off.
Alternative of using MapIt
|
---|
I will probably switch over to Mapit run by mySociety which is free to use and in turn is a service underpinned by Ordnance Survey open data. I have some tests working (I was trying to avoid using JSON calls as I find these harder to work with in Python than xml) but need to revise and test out the scripts properly. Example test results:
|
- The categorization for Dumfries and Galloway has been restarted with mysociety.org JSON data rather than uk-postcodes.com xml data. The OS data being used are fields giving Unitary Authority and UK Parliament constituency which appear to be the most fitting though this may vary, in particular there are special fields that would need to chosen for handling London.
Summary
[edit]Country | Region | Project | Status |
---|---|---|---|
England | Greater London | #C1 | Done |
England | Cornwall | #C2b | Done |
England | Devon | #C2b | Done |
England | Dorset | #C2b | Done |
England | Hampshire | #C3c | Done |
England | Somerset | #C2b | Done |
England | West Midlands (county) | #C3b | Done |
England | Isle of Wight | #C3c | Done |
England | South East: Kent, Medway, East Sussex, Brighton and Hove, West Sussex, Surrey | #C4a | Working |
Wales | (all) | #C3d | Working |
Scotland | Orkney Islands | #C2a | Done |
Scotland | Shetland Islands | #C2a | Done |
Scotland | Dumfries and Galloway | #C3a | Done |
Pause for retesting on 1 December 2012 (restart on 20 December)
[edit]I have paused the categorization, hopefully for just a few days, to enable some retesting and a quality check. Open Street Map seems to be introducing most of the faulty matches, whilst the Ordnance Survey data underpinning MapIt seems far more reliable, probably giving an error rate below 0.01%. A simple change of the logic of how me might use MapIt with OSM+GMaps as the backup may reduce the current error rate significantly; from <0.15% to <0.01% perhaps? The remaining error rate may be with the way Geograph images have been processed rather than with the OS data being used (if someone could work this out at some point, it might be a useful improvement to Geograph).
The test sample is to cover the region around Monmouthshire/Newport/Torfaen and something of Gloucestershire which might also recategorize/improve some of the Wales work already done and on pause.
- Step 1
- Calculate bounding box and create source file
- Bounding box for -3.197,51.479,-2.373,51.944
- Matching regex to create source file:
python replace.py -xml:"//Volumes/Fae_32GB/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(51\.[5-8]|51\.479|51\.48|51\.9[0-4])\d*\s*\|\s*(-2\.[4-9]|-2\.37[3-9]|-3\.[0-1]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(51\.[5-8]|51\.479|51\.48|51\.9[0-4])\d*\s*\|\s*(-2\.[4-9]|-2\.37[3-9]|-3\.[0-1]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/Fae_32GB/Geograph/GeoboxTestMonmouth.txt" -ns:6
- 25,300 images were found inside the bounding box and titles saved to a file.
- Step 2
- Test wanted fields using a soak test
The soak test ran over 4,000 images and provided the following counties in results:
[['Herefordshire', 567], ['Gloucestershire', 1326], ['Monmouthshire', 1280], ['Newport', 199], ['Cardiff', 149], ['Torfaen', 80], ['Blaenau Gwent', 24], ['Powys', 117], ['Caerphilly', 103], ['North Somerset', 37], ['Bristol', 115]]
"Bristol City" appeared in the source data and was mapped to "Bristol".
- Step 3
- Categorize region
The wanted list limits results to be added to:
- Geograph images in Monmouthshire
- Geograph images in Blaenau Gwent
- Geograph images in Caerphilly
- Geograph images in Cardiff
- Geograph images in Monmouthshire
- Geograph images in Newport
- Geograph images in Powys
- Geograph images in Torfaen
- Geograph images in Gloucestershire
- Geograph images in Herefordshire
- Geograph images in North Somerset
- Geograph images in Bristol
- Note, the last 4 categories in England were empty at the start of this test run.
Any "Geograph images in" categories will be swapped if they exist currently, on the basis that the new logic is going to be more accurate. This should be noticeable for Newport where images have been incorrectly categorized under Monmouthshire, for example this change.
When there are no visible categories, the images are added to (need to be hand checked due to inconsistent naming and potential conflicts with disambiguation pages):
- Monmouthshire
- Blaenau Gwent
- Caerphilly County Borough
- Cardiff
- Monmouthshire
- Newport, Wales
- Powys
- Torfaen
- Gloucestershire
- Herefordshire
- North Somerset
- Bristol
Done 25,300 images checked, with the test set being processed from 8—17 December 2012.
- Step 4
- Analysis of results
Success. The error rate seems to be running at 0.03%, which is only constrained by the accuracy of Ordnance Survey data and unavoidable issues such as where the camera is (giving us the GPS data) and what is being photographed. Refer to User_talk:Fæ/2013#Geograph_again.
Stage 1: London boroughs
[edit]- Done · Populate London
The purpose of this project is to add Geograph hidden categories identifying images in all 32 London boroughs (plus the City of London) using hidden child categories of Category:Geograph images in London. There will be a test report stage then a beta test on a few thousand images to demonstrate the concept. It is expect to run this categorization process slowly, probably fewer than 1,000 images being changed per day (Geograph has more than 2,000,000 images, it is not known how many are geotagged in London). If there are future re-runs to update the categorization these would be rare, no more than once a year would be expected.
The London boroughs are fairly easily defined and relatively neutral in terms of the regional politics of naming, consequently this seemed a good choice for a first stage if the principles and scripts used to run this categorization are to apply to other UK regions. Note, the "County of London" was replaced by "Greater London" in 1965, with the London boroughs being the next sub-division of the region.
Update Beta test complete on 2,000+ images, and it appears that at least 46,000+ images are in the London bounding box (the nearest rectangle that can cover London on the map). Using their given coordinates, these are being checked for borough names against Open Street Map, double checked on http://www.uk-postcodes.com (a front end for OS OpenData), and (where a third opinion is needed) on Google Maps. Images found not to be named as in a London borough or where there are too many discrepancies are left uncategorized.
Bug—26 Oct 2012—Fixed
|
---|
|
- Sources
- Borough definitions: List of London boroughs, Category:London boroughs
- OSM bounding box for London: OSM
- Example lat/lon to address lookup using OSM in XML format: nominatim.openstreetmap.org
- Nominatim usage policy
- OSM copyright/free reuse policy
- Pseudocode
- Find likely candidate images from a breakdown of Geograph
- Get categories for candidate image using API call [on error: wait, try again using increasingly longer periods]
- If candidate image is already categorized against a Geograph borough then next
- Get image page text and extract data from Object location dec or Location dec templates [not found: error log, next†]
- For each image test if within OSM bounding box [if not: next]
- Get OSM address data [on error: wait, retry then add to error log]
- Test if the OSM data gives the county as London [if not: next]‡
- Map OSM given borough (=locality) to existing Commons category in Category:London boroughs
- If the number of visible non-Geograph categories on the image are 0, then
- Add an existing, visible, Commons London borough category
If template exists, then remove Uncategorized-GeographAdd Check categories-Geograph
- Add hidden Geograph by London borough category
- Write updated image page to Commons
- Write record to local log
- † - all Geograph images are supposed to be imported as geotagged
- ‡ - the borough name is checked against another site and if a mismatch a third is then used to create a poll. The resulting borough name should therefore be highly reliable, certainly more than OSM data can provide alone (which may sometimes return a blank, a higher region name - "London", or may appear incorrect compared to the postcode)
- Note, I have removed dealing with Uncategorized-Geograph templates for a separate exercise.
Generating candidate images
[edit]The following call quickly generated a file of 2,950 images that were categorized under any category with "Geograph" in the name, and appeared to be inside a bounding box for London using the coordinates in {{Location dec}}.
London search 1
|
---|
// Find images in geo box: lat > 51.28676 and lat < 51.69188 and lon > -0.51104 and lon < 0.33402 python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*ategory[^\]]+Geograph |ategory[^\]]+Geograph .*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3])' 'doesnotmatter' -savenew:"//Volumes/<local>/GeoboxLondonList.txt" -ns:6 |
This call picks up on the use of the {{Geograph}} template, many of which are not listed in other Geograph categories. This generated 80,538 image file names inside the same London bounding box, representing about 4% of all Geograph images on Commons:
London search 2
|
---|
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall 'Location dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*\{\{[Gg]eograph\||\{\{[Gg]eograph\| .*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3])' 'doesnotmatter' -nocase -savenew:"//Volumes/<local>/GeoboxLondonList.txt" -ns:6 |
My current view, after some experimentation, is that both searching the image page for "geograph.org.uk" and for the template {{Geograph}} is necessary — if either one matches, then this can be assumed to be a Geograph project photograph. There are examples of (mostly older) images where no source link is quoted but there is a valid link to the Geograph user page via template, and there are examples of pages with no Geograph categories but the images are linked back to Geograph as a source. Similarly, to avoid bugs like the one identified above, it is necessary that one or other of these features are double-checked as being on an image page before a bot makes any assumption about the image being suitable for a Geograph category.
A third search including the Geograph URL and template resulted in 80,746 matches. I have restarted the categorization script based on this new source file.
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*51\.[2-6]\d*\s*\|\s*(-0\.[0-4]|-0\.51[0-1]|-0\.50|-0\.[0-4]|0\.[0-2]|0\.3[0-3]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/Stage1/GeoboxLondonList2.txt" -ns:6
Second run, February 2013
[edit]After retesting, there have been several important improvements to the code, so I'm re-running from scratch. This included re-doing the bounding box, the new figure was that 81,089 needed checking.
The default categories are:
- Category:City of London
- Category:City of Westminster
- Category:Royal Borough of Kensington and Chelsea
- Category:London Borough of Hammersmith and Fulham
- Category:London Borough of Wandsworth
- Category:London Borough of Lambeth
- Category:London Borough of Southwark
- Category:London Borough of Tower Hamlets
- Category:London Borough of Hackney
- Category:London Borough of Islington
- Category:London Borough of Camden
- Category:London Borough of Brent
- Category:London Borough of Ealing
- Category:London Borough of Hounslow
- Category:London Borough of Richmond upon Thames
- Category:Royal Borough of Kingston upon Thames
- Category:London Borough of Merton
- Category:London Borough of Sutton
- Category:London Borough of Croydon
- Category:London Borough of Bromley
- Category:London Borough of Lewisham
- Category:Royal Borough of Greenwich
- Category:London Borough of Bexley
- Category:London Borough of Havering
- Category:London Borough of Barking and Dagenham
- Category:London Borough of Redbridge
- Category:London Borough of Newham
- Category:London Borough of Waltham Forest
- Category:London Borough of Haringey
- Category:London Borough of Enfield
- Category:London Borough of Barnet
- Category:London Borough of Harrow
- Category:London Borough of Hillingdon
Stage 2: Easy boxes
[edit]
2a: Orkney Islands and Shetland Islands
[edit]
Using the query below for a latitude above 58.7, I get 12,628 images matching (~0.6% of all uploaded Geograph images). There are only the two regions, nicely bounded by the sea, so no issues with complaints about observer vs. the observed object location. Interestingly, an automated count of the files in Category:Orkney Islands shows 3,146 files and Category:Shetland shows 4,878; a total of 8,024 (this may double count some images and does count files that are not Geograph). This means significantly more than 4,603 from the above 12,600+ Geograph files are not categorized at all under their region; a clear benefit from this categorization project.
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '(Location dec\s*\|\s*(58\.[7-9]|59\.|6[01]\.).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(58\.[7-9]|59\.|6[01]\.))' '\1FAEBOT-MARKER-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/GeoboxVeryNorth.txt" -ns:6
|
2b: Cornwall, Devon, Somerset, Dorset
[edit]- Status Working · Populate Cornwall, Devon, Somerset, North Somerset and Dorset ... plus Isles of Scilly
Using this bounding box I generously hit these South West England counties with some overlap with other counties - in fact my matches back from OSM queries look like ['Devon' , 'Somerset' , 'Cornwall' , 'Dorset' , 'Wiltshire' , 'Vale of Glamorgan' , 'South Glamorgan' ] (still checking, will test another 1,000 images). To keep things simple, I'll probably throw away any matches outside of the four completely covered counties. I was hoping to hit Jersey and Guernsey, but these do not seem to be under Geograph. The search (regex) matched exactly 54,000 images, so I may need to run again as the number seems too rounded to be true.
Python regex detail for finding "geograph.org.uk" in South West England
|
---|
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(49\.[1-9]|50\.|51\.[0-3])\d*\s*\|\s*(-1\.[8-9]|-[2-6]\.|-7\.0).*[Ss]ource\s*=[^\n]+geograph\.org\.uk|[Ss]ource\s*=[^\n]+geograph\.org\.uk.*[Ll]ocation dec\s*\|\s*(49\.[1-9]|50\.|51\.[0-3])\d*\s*\|\s*(-1\.[8-9]|-[2-6]\.|-7\.0))' 'FAEBOT-MARKER-FAEBOT' -nocase -query:120 -savenew:"//Volumes/<local>/GeoboxCornwallList.txt" -ns:6 |
A re-run found 147,956 (note, this may not be all potential matching images, I still need to run a query using "{{Geograph|" rather than just looking for pages with reference directly to "geograph.org.uk"). At my current rate of progress (assuming half get matched for saving), I think testing c.148,000 images would take 37 days of continuous processing to complete. Seems do-able, though I may need to break up the source list into smaller runnable tranches to avoid slowing down Python by having it all in memory at the same time (running as one large batch, Python was fine :-) ) Considering the number is large, I'll consider how sub-region categories might be useful, though it might well be pragmatic to do the top level categorization before breaking regions such as Cornwall into something debatable like its parliamentary constituencies.
I have started a re-run after finding that the Isles of Scilly tends to be unmatched both in OSM data and the OS data and are not matched as Cornwall in the Google Maps data (being identified with "Isles of Scilly" as administrative level 2, which is theoretically correct). Politically the Isles currently fall under Cornwall, however following the fact they are a special case in ceremonial counties, I have separated them out into Category:Geograph images in the Isles of Scilly.
When there are no visible categories, the following apply: Category:Devon, Category:Cornwall, Category:Somerset, Category:North Somerset, Category:Dorset, Category:Isles of Scilly.
I noticed this example of odd map data - File:North West Lundy - geograph.org.uk - 15444.jpg. Lundy is shown in an xml query to Open Street Map as being in Pembrokeshire (Wales) while Google Maps and MapIt show it as being in Devon. Oddly when I go directly to OSM and look up Lundy, it is shown as in Devon correctly.
Stage 3: Priority areas
[edit]
3a: Dumfries and Galloway
[edit] Done · Populated Dumfries and Galloway
|
---|
Werespielchequers raised the problem of photographs in Wigtownshire being categorized in Northern Ireland. Prioritizing Dumfries and Galloway Council would provide a handy category to do a check for images both geo-located in Scotland and with contradicting categories in Northern Ireland. This is a Council and the source data breaks this into 3 historic counties: Wigtownshire, Kirkcudbrightshire and Dumfriesshire. For the time being these are being mapped back to the Council level for the purposes of the hidden Geograph categorization. Note, Openstreetmap has county identified but the Ordnance Survey query I am currently using is limited to the Westminster Constituency of Dumfries and Galloway. With the OS data appearing more accurate than OSM, there seems good reason to stick to this level.
python replace.py -xml:"//Volumes/<local>/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(54\.[5-9]|55\.[0-5])\d*\s*\|\s*(-2\.[6-9]|-[3-4]\.|-5\.[0-5]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(54\.[5-9]|55\.[0-5])\d*\s*\|\s*(-2\.[6-9]|-[3-4]\.|-5\.[0-5]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/<local>/Geograph/Stage3/GeoboxDumfriesAndGalloway.txt" -ns:6 The search provided 61,073 images in the bounding box, though this includes a large number of matches with Cumbria, Scottish Borders, South Lanarkshire, East Ayrshire and South Ayrshire. With an initial 2,000+ images checked, positive matches to Dumfries and Galloway look to be around 1/4 (15,000) and time to process these is likely to be at least ten days (taking into account that the Python thread is deliberately slowed down to reduce the number of transactions per day on the source data websites, and is running in parallel with other tasks). As an example of the benefits of this categorization, with only a few hundred images in the category, 5 were shown as both in Dumfries and Galloway and South Lanarkshire at the same time and after 6,000 images were categorized this had risen to 42 incorrectly categorized images; see Cat Scan. Case: OpenStreetMap and data reliability[edit]The checks in Dumfries and Galloway yielded a good example case of the OSM reliability problem. Using the uploaded image geodata from File:The scar - loch ryan.jpg, here are the relevant comparisons:
To be fair, the boundary between Dumfries and Galloway and South Ayrshire is close to this coordinate, however the distance from the boundary is certainly more than 500m (see GMap's boundary line). Consequently OSM must be considered questionable for county boundaries and I have already seen a pattern of significant unreliability for postcode districts. My script spotted the inconsistency between OSM and UK-postcodes, went to Google Maps for a third opinion (this only happens for inconsistencies), and has consequently correctly categorised the image under Category:Geograph images in Dumfries and Galloway based on the majority. |
3b: West Midlands (county)
[edit]- Done · Populated West Midlands (county)
- OSM bounding box 52.344 -2.22 52.667 -1.418
The West Midlands region is composed of Herefordshire, Shropshire, Staffordshire, Warwickshire and Worcestershire along with the city conurbation. Photographs in the region are often mis-categorized.
Python regex detail
|
---|
python replace.py -xml:"//Volumes/Fae_32GB/commonswiki.xml" -regex -dotall '([Ll]ocation dec\s*\|\s*(52\.3[4-9]|52\.[4-5]|52\.6[0-7])\d*\s*\|\s*(-2\.2[2-9]|-2\.[0-1]|-1\.[5-9]|-1\.4[2-9]|-1\.41[8-9]).*([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|)|([Gg]eograph\.org\.uk|\{\{[Gg]eograph\|).*[Ll]ocation dec\s*\|\s*(52\.3[4-9]|52\.[4-5]|52\.6[0-7])\d*\s*\|\s*(-2\.2[2-9]|-2\.[0-1]|-1\.[5-9]|-1\.4[2-9]|-1\.41[8-9]))' '\1FAEBOT-marker-FAEBOT' -nocase -savenew:"//Volumes/Fae_32GB/Geograph/Stage3/GeoboxWestMidlands2.txt" -ns:6 |
The regex above gives 24,510 matching images in the bounding box (just over half are likely to be matches based on the difficult boundary shape).
Getting a match to the ceremonial county of "West Midlands" is a bit tricky. At first I tried the following fields:
- (gmapWanted, mapitWanted, osmWanted)=("administrative_area_level_2", "European region", "state_district")
However this gave matches of only East Midlands versus West Midlands and considering the bounding box, this means the matches were for the West Midlands region rather than county. I then examined:
- (gmapWanted, mapitWanted, osmWanted)=("administrative_area_level_3","Middle Layer Super Output Area (Generalised)","city")
This gives the next level down which would need to be mapped to the ceremonial county. I would expect to map Birmingham, Coventry, Wolverhampton, Dudley, Sandwell, Solihull, and Walsall to the West Midlands (county) but I'm running a soak test to see what the data actually provides.
The matches in the bounding box for the first 1,000 images [<region>,<number of matches>] are:
- [['Cannock Chase', 6], ['South Staffordshire', 20], ['Solihull', 82], ['Wolverhampton', 35], ['Sandwell', 37], ['Birmingham', 154], ['North Warwickshire', 91], ['Dudley', 58], ['Walsall', 42], ['Bromsgrove', 135], ['Nuneaton and Bedworth', 16], ['Warwick District', 60], ['Lichfield', 35], ['Coventry', 63], ['Tamworth', 25], ['Wyre Forest', 72], ['North Warwickshire District', 2], ['Stratford-on-Avon District', 5], ['Hinckley and Bosworth', 17], ['Lichfield District', 4], ['Warwick', 6], ['Tamworth District', 1], ['Solihull District', 3], ['North West Leicestershire', 3], ['Rugby District', 24], ['Wychavon District', 1], ['South Staffordshire District', 1], ['Nuneaton and Bedworth District', 1]]
This routine will match all images with region names in bold and categorize them under Category:Geograph images in the West Midlands (county). The oddity of Solihull District appears to be another name for the Metropolitan Borough of Solihull, though this should have no bearing on the validity of the mapping to West Midlands (county).
3c: Hampshire and Isle of Wight
[edit]- Done · Populate Hampshire and the Isle of Wight
- OSM box -1.94,50.4,-0.61,51.38
This bounding box gets 104,210 Geograph image matches on Commons.
For mapping, I'm using the fields from Google Maps, MapIt and Open Street Map of:
- gmapWanted, mapitWanted, osmWanted = "administrative_area_level_2", "County council", "county"
A sample of 2,000 images from the bounding box gives results: [['Hampshire', 1288], ['West Sussex', 226], ['Surrey', 122], ['Berkshire', 74], ['Isle of Wight', 117], ['Wiltshire', 54], ['Dorset', 113], ['Wokingham', 5], ['Royal Borough of Windsor and Maidenhead', 2]]; which seems good enough to separate Hampshire from the Isle of Wight at least. As previously, unwanted places will be discarded rather than categorizing partial areas, to avoid later confusion.
If these are a good pattern, this mapping of fields might be okay for all other English counties where the common use of 'county' matches the ceremonial county, which seems the case with Hampshire.
3d: Wales
[edit]
- Done · Populate subcategories of Category:Geograph images in Wales
I'm going for a much large region, the whole of Wales, at one chunk. A preliminary test of 2,000+ (out of 178,000) images, with a few mapping tweaks for name variations, gives me matches for:
- Anglesey, Blaenau Gwent, Bridgend, Caerphilly, Cardiff, Carmarthenshire, Ceredigion, Conwy, Denbighshire, Flintshire, Gwynedd, Merthyr Tydfil, Monmouthshire, Neath Port Talbot, Newport, Pembrokeshire, Powys, Rhondda Cynon Taf, Swansea, Torfaen, Vale of Glamorgan, Wrexham
If there are some poor name choices here, behaviour should be consistent, so it should be straight-forward to move images to better named categories, or merge, should corrections be needed.
Based on advice from Nilfanion, I'm not taking this direction; parked.
|
---|
I am rethinking these categories to roll up to the 13 historic counties of Wales: |
Now re-running. Based on comments raised during testing, the additional mapping of categories was added:
- mappings2=[
- ("Caerphilly","Caerphilly County Borough"),
- ("Newport","Newport, Wales")
- ]
This does not change the (hidden) Geograph category but is used to choose an general visible category where no current categories exist on an image.
The new run is going from scratch, so old images are being swapped over to (hopefully) more accurate categories, where previously Open Street Map or Google Maps were over-riding Ordnance Survey (in practice the most accurate) or resulting in the image being skipped. Examples [1], [2], [3] and [4] - this last one is a classic of photographs at county boundaries.
I have been forced to add Category:Geograph images in Wirral West as an extra option as these were being forced into Flintshire by Google Maps if the OS data was ignored. Example
Stage 4: England
[edit]Refer to Metropolitan and non-metropolitan counties of England.
4a: South East
[edit]- Status Done · Populate Category:Geograph images in Medway, Category:Geograph images in Kent, Category:Geograph images in East Sussex, Category:Geograph images in Brighton and Hove, Category:Geograph images in West Sussex and Category:Geograph images in Surrey
With London and Hampshire done, the South East is a logical next step.
OSM bounding box: -1.077,50.716,1.524,51.5
- This bounding box yielded over 218,400 image matches, however this includes most of South London which will be excluded, probably something on the order of 30,000 photos. On starting the process I note that chunks of Hampshire and even Oxfordshire have been covered, again these will count against the final total categorized as they will be skipped.
The counties are:
- Category:Medway
- Category:Kent
- Category:East Sussex
- Category:Brighton and Hove
- Category:West Sussex
- Category:Surrey
4b: Middle England
[edit]- Status Working · Category:Geograph images in England
This is a wide swathe of the middle of England, taking a generalized approach to running a wider net. One advantage (to the bot operator) is that Faebot can be left to churn through this large net for several months without needing intervention.
- OSM box -3,51.0,2.999,53.999
Category mapping test:
OS name | Commons name |
---|---|
The City of Brighton and Hove | Brighton and Hove |
Bath & North East Somerset | Bath and North East Somerset |
West Berkshire | Berkshire |
Newbury | Berkshire |
South Gloucestershire | Gloucestershire |
Cheshire East | Cheshire |
Cheshire West and Chester | Cheshire |
Warrington | Cheshire |
Sheffield | South Yorkshire |
Medway | Kent |
Blackburn with Darwen | Lancashire |
Telford and Wrekin | Shropshire |
Knowsley | Merseyside |
Stoke-on-Trent | Staffordshire |
Southend-on-Sea | Essex |
Halton-with-Aughton | Lancashire |
Halton East | North Yorkshire |
Halton Gill | North Yorkshire |
Halton Holegate | Lincolnshire |
Halton Lea Gate | Northumberland |
Halton West | North Yorkshire |
Halton | Cheshire |
Swindon | Wiltshire |
Yorkshire and the Humber | West Yorkshire |
Derby | Derbyshire |
Stage 5: Scotland
[edit]- Status Working · Category:Geograph images in Scotland
In the same way as Wales, I have decided to try doing Scotland in one big gulp. There are 541,081 titles matched in a bounding box around Scotland, so other areas are partly covered. I would estimate more than 450,000 of these are likely to be within Scotland after analysis.
Here's the mini milestone plan:
- Step 1 - grab list of candidate images. Done
- Step 2 - sample data to check for OS region naming, small test runs. Done
- Step 3 - review list of regions to be matched. Done
- Step 4 - review large test run. 4,500 completed, apart from the 'net' capturing a fair chunk of Northern Ireland, which means they get noted in my terminal window but there are no changes on-wiki, this seems to have run without any issues.
- Step 5 - monitor run and fix any bugs (likely to take
1 or2 months to complete). Working
Region names to test for—note a few bordering regions are listed towards the end as wanted, this is not a mistake, just a convenience and overlaps with the England categorization work:
OS name | Commons name |
---|---|
City of Edinburgh | Edinburgh |
Dundee City | Dundee |
Glasgow City | Glasgow |
Highland | Highland (council area) |
Aberdeen City | Aberdeen |
Stirling | Stirling council area |
Shetland Islands | Shetland |
Borough Edinburgh | Edinburgh |
Sunderland | Tyne and Wear |
Stage 6: Northern Ireland
[edit]- Status Working ·Category:Geograph images in Northern Ireland by Council
There is currently a set of Geograph categories available of NI Counties, but the boundaries available within the OS Open Data appear to be limited to NI Councils (which superseded the NI Counties in 1973). The following proposed categorization reflects that, but may need to map the visible categories to parents such as those in Category:Districts in Northern Ireland.
Implementation notes:
- Adding a Geograph Council category does not replace a Geograph County category, it only supplements it.
- A visible category is added if there are no visible categories currently.
- OSM box http://www.openstreetmap.org/?box=yes&bbox=-8.2999,54.0,-5.3,55.4399
The net caught 66,704. After seeing the initial results, I realized that the rectangle was slightly off being east of the intended area and so re-cast it.