We would like to create a feed - potentially an EventStream (https://wikitech.wikimedia.org/wiki/EventStreams) - which provides a feed of all additions and removals of external links on Wikimedia projects.
Investigation of the history of external links on Wikimedia sites is currently only possible through a view of which links are present at the time a query is made. No information is collected on when links were added, who added them, or how the number of external links to a domain varies with time. This presents a number of difficulties for community members and WMF teams who would benefit from such data.
For the first stage of this, we should have a feed which posts an event when:
- An edit is made which adds a URL
- An edit is made which removes a URL
- A page containing one or more URLs is created
- A page containing one or more URLs is deleted
Each event should definitely contain:
- Information on whether the URL was added or removed*
- The URL which was modified
- Timestamp
- Recent changes ID (see open question below)
*A modified URL would naturally log both an addition and a removal
Open questions
- Would we have one event per edit, containing lists of URLs added/removed in this edit, or one event per URL, such that there are multiple events for the same edit?
- Do the above definitions make any sense for Wikidata?
- Should each event additionally contain all the same information as the recentchanges feed (e.g. username, edit summary, namespace - https://www.mediawiki.org/wiki/Manual:RCFeed), or should we just include the recentchanges ID for post-processing?
- Are EventStreams stored in a database too, or is that a separate process? Is an EventStream sufficient for the use cases below, or do we also need a data store?
Use cases
For the Wikimedia community, the primary use case would be anti-spam efforts. When a user finds that spam links have been added to a particular page, they can currently turn to Special:LinkSearch or tools such as Hay’s Tool (http://tools.wmflabs.org/hay/exturl/index.php), which will show them where links are currently present on that Wikimedia project. They can then investigate the histories of each page containing the domain to track down which user added the link. If any links have already been removed from other pages, they will remain invisible to the user, along with links placed on other Wikimedia projects. A tool which could present the editor with a list of links added to Wikimedia projects and the user adding them would be extremely beneficial and time saving.
This data is also extremely valuable to the GLAM and Wikipedia Library teams at the WMF. One of the primary metrics by which TWL can measure the success of its publisher access donation program is the number of citations being added to Wikipedia by contributors who gained access to donated content. At present, the only data available to the team is the overall number of links present, manually collected and monitored over time, along with self-reporting surveys. This is a time consuming and inexact process. Access to data which could be used to track those citations more precisely would be invaluable. TWL partners are requesting reports on this data in order to evaluate continuing partnerships, and the TWL team need it to track their performance and value to the community.
Lastly, external partners such as the Internet Archive could make use of this feed to monitor links added to Wikimedia projects. In the case of IA, this would enable immediate archiving of every citation in conjunction with InternetArchiveBot (T199193).