Page MenuHomePhabricator

Create a feed or log of changed links on Wikimedia projects
Closed, ResolvedPublic

Description

We would like to create a feed - potentially an EventStream (https://wikitech.wikimedia.org/wiki/EventStreams) - which provides a feed of all additions and removals of external links on Wikimedia projects.

Investigation of the history of external links on Wikimedia sites is currently only possible through a view of which links are present at the time a query is made. No information is collected on when links were added, who added them, or how the number of external links to a domain varies with time. This presents a number of difficulties for community members and WMF teams who would benefit from such data.

For the first stage of this, we should have a feed which posts an event when:

  • An edit is made which adds a URL
  • An edit is made which removes a URL
  • A page containing one or more URLs is created
  • A page containing one or more URLs is deleted

Each event should definitely contain:

  • Information on whether the URL was added or removed*
  • The URL which was modified
  • Timestamp
  • Recent changes ID (see open question below)

*A modified URL would naturally log both an addition and a removal

Open questions

  • Would we have one event per edit, containing lists of URLs added/removed in this edit, or one event per URL, such that there are multiple events for the same edit?
  • Do the above definitions make any sense for Wikidata?
  • Should each event additionally contain all the same information as the recentchanges feed (e.g. username, edit summary, namespace - https://www.mediawiki.org/wiki/Manual:RCFeed), or should we just include the recentchanges ID for post-processing?
  • Are EventStreams stored in a database too, or is that a separate process? Is an EventStream sufficient for the use cases below, or do we also need a data store?

Use cases
For the Wikimedia community, the primary use case would be anti-spam efforts. When a user finds that spam links have been added to a particular page, they can currently turn to Special:LinkSearch or tools such as Hay’s Tool (http://tools.wmflabs.org/hay/exturl/index.php), which will show them where links are currently present on that Wikimedia project. They can then investigate the histories of each page containing the domain to track down which user added the link. If any links have already been removed from other pages, they will remain invisible to the user, along with links placed on other Wikimedia projects. A tool which could present the editor with a list of links added to Wikimedia projects and the user adding them would be extremely beneficial and time saving.

This data is also extremely valuable to the GLAM and Wikipedia Library teams at the WMF. One of the primary metrics by which TWL can measure the success of its publisher access donation program is the number of citations being added to Wikipedia by contributors who gained access to donated content. At present, the only data available to the team is the overall number of links present, manually collected and monitored over time, along with self-reporting surveys. This is a time consuming and inexact process. Access to data which could be used to track those citations more precisely would be invaluable. TWL partners are requesting reports on this data in order to evaluate continuing partnerships, and the TWL team need it to track their performance and value to the community.

Lastly, external partners such as the Internet Archive could make use of this feed to monitor links added to Wikimedia projects. In the case of IA, this would enable immediate archiving of every citation in conjunction with InternetArchiveBot (T199193).

Related Objects

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@Samwalton9: are you taking into account the wiki size when it comes to number of events being sent?

That was on enwiki, so should be the place this gets most events. Not sure yet how many wikis we would be enabling this on. To clarify - I looked at one minute's worth of Special:RecentChanges and just counted how many links were added or removed to get that value.

Are your events being sampled?

I don't believe so.

Before enabling enwiki please calculate how many events you might be sending per second.

It's important to note, however, that the normal rate and the maximum rate will be very different in this case. If someone adds a link to a popular template, you may get hundreds of events per second. Normally though, the rate should be very slow.

Change 313065 had a related patch set uploaded (by Legoktm):
Fix Schema:ExternalLinksChange logging if no links are left on page

https://gerrit.wikimedia.org/r/313065

If someone adds a link to a popular template, you may get hundreds of events per second. Normally though, the rate should be very slow.

Well, we only log on actual edits, so links changed via template update aren't recorded.

Change 313110 had a related patch set uploaded (by Legoktm):
Trigger Schema:ExternalLinksChange logging on page deletion

https://gerrit.wikimedia.org/r/313110

\o/ thanks Lego, I'm gonna check out that code, maybe I can give these fixes a shot next time. Let us know when it's deployed to testwiki or tell us how to deploy it?

@Milimetric: changes will be deployed with next mediawiki deployment

Change 313065 merged by jenkins-bot:
Fix Schema:ExternalLinksChange logging if no links are left on page

https://gerrit.wikimedia.org/r/313065

Change 313110 merged by jenkins-bot:
Trigger Schema:ExternalLinksChange logging on page deletion

https://gerrit.wikimedia.org/r/313110

Looks like we're running into some problems, but it's hard to pinpoint why. I've been trying to run test edits (the same I was running last time), before and after the latest patches, and my edits aren't showing up in analytics-store or analytics-slave at all. Other edits are coming up, but from looking at some examples, there should be more than I'm seeing, so I'm not sure what the issue is. I can't see any discernible pattern to the logged or not logged edits.

I can try to debug if someone will sit with me and show me how. I've never really done mediawiki development and I think fumbling through the setup would be inefficient.

@Legoktm Could you please look into the above issue? I've pinged you on IRC and sent a couple of emails but haven't heard anything back.

Not having this data is starting to impact the library's progress; we're waiting on it for the purposes of research, tools, and showing program success to partners.

My only hunch is that this is caused by the edit stashing optimization skipping early and preventing it from being logged.

		// Skip blacklist checks if nothing matched during edit stashing...
		$knownNonMatchAsOf = $cache->get( $key );
		if ( $mode === 'check' ) {
			if ( $knownNonMatchAsOf ) {
				$statsd->increment( 'spamblacklist.check-stash.hit' );

				return false;
			} else {
				$statsd->increment( 'spamblacklist.check-stash.miss' );
			}
		} elseif ( $mode === 'stash' ) {
			if ( $knownNonMatchAsOf && ( time() - $knownNonMatchAsOf ) < self::STASH_AGE_DYING ) {
				return false; // OK; not about to expire soon
			}
		}

@Legoktm Interesting! Any idea how this could be resolved?

@Legoktm @kaldari @Milimetric @Krenair Hi folks. We're having a hard time moving forward with our 60+ publisher partnerships in the Wikipedia Library. We're implementing a digital library platform which will give up to 25,000 editors access to up to 80,000 unique journals--but we're not able to move forward until we have this basic metrics infrastructure in place. I know Kunal is busy; is there any way to set aside 1-2 hours to debug what's wrong and do a quick sprint on it. We flagged this issue over a year ago and our partners have been very patient, but it's becoming an obstacle to our growth and providing editors access to research.

@Ocaasi_WMF: Analytics can help troubleshoot eventlogging issues as needed be, it is real easy to test this on beta.

Now, someone needs to own the mediawiki side of publishing this metrics and the code that does so @Legoktm? @kaldari? Seems that is been hard to get a mediawiki developer to own this code and that is why this ticket is been iddle.

@Nuria the challenge has been that @Legoktm did a good start, but capacity is not there for sustained focused work. If you can help with connecting someone to solving the problem, that would be awesome.

@Sadads:Sorry i cannot be of more help but analytics team does not instrument mediawiki code, normally the teams that own the code an understand the functionality instrument it. I would talk to community liasons that maybe can help you secure resources with a mediawiki developer that can follow through until project is complete.

@bd808 @MSchottlender-WMF @awight Hi! We're a bit stuck but very close to implementing an important feature for us. Is there any chance you could spare a little time to debug the issue above? This is really close to being ready and we would appreciate any assistance you can give here. Thank you, Jake (Wikipedia Library)

All been tried before I imagine, but for what it's worth nothing after 30+ minutes in a number of namespaces (user, user talk, article) as both a logged in and logged out user

EDIT: However around an hour or so later my edits appeared in the EventLogging data..

@Samtar: that delay is most likely due to event logging replication catching up with production inserts. So that's normal. Anything beyond a few hours means maybe some maintenance is going on. And anything beyond a day or so might indicate a problem.

If there is a rc-feed of edits for testwiki, I could set up LiWa3 to feed
added links to a channel on freenode (have to figure that out, it is a
matter of changing on-wiki settings and some killing on the server, it is a
long time since I added a feed manually). I would say that if the rc-feed
processed an edit, that then the db should also be updated. In my
experience, external link searches on wikis are quickly updated (as fast as
a diff gets saved and reported to rc), and as far as I understand that
search is based on a separate table that gets updated after every diff. I
presume that that same hook would update the list of added and removed
links.

Change 342913 had a related patch set uploaded (by Milimetric):
[mediawiki/extensions/SpamBlacklist] Fix improper index access in event logging code

https://gerrit.wikimedia.org/r/342913

Change 342913 merged by Ottomata:
[mediawiki/extensions/SpamBlacklist@master] Fix improper index access in event logging code

https://gerrit.wikimedia.org/r/342913

Good news is that some (not all) of my edits from the 15th showed up in the database at some point since then. Bad news is that none of the ones I made 20 minutes ago have.

@Samwalton9, what do you mean with 'at some point'? Do you mean that this
has an enormous lag? We do see some effect in deterring spammers by acting
in real time (within minutes), many are hit and run editors, and I have
seen 'good faith spammers' with many warnings on many IPs complain that
they were never contacted ..

@Beetstra Good question. @Milimetric thinks that if/when this is working correctly the delay should be no more than a few minutes, but from the testing above the time is currently somewhere between 20 minutes (when I stopped actively monitoring) and 2 hours (when I went back to look later on). Something's still wrong though, per the odd 30% success rate above.

Data is in a database replica. The data does not appear there real time (needs to be replicated) . The "event" is sent real time and stored in master at that time.

Sam knows about the replication, that's why he knows there's a delay. But Sam, I don't think that adding the same exact google link to a page that already has that link will trigger an ExternalLinksChange. I have to read the code more closely, but I thought it did a diff on the unique links before and after. Can you please test adding 10 different links one at a time and then removing some and adding them back? Then we can do a query on that session. Are you sure the new code is deployed though? Is that why you're testing?

Sam knows about the replication, that's why he knows there's a delay. But Sam, I don't think that adding the same exact google link to a page that already has that link will trigger an ExternalLinksChange.

The data doesn't quite seem consistent with that - it has before (including in one diff from the tests I tabulated above) recorded multiple additions/removals to the same URL, but it's worth testing.

I have to read the code more closely, but I thought it did a diff on the unique links before and after. Can you please test adding 10 different links one at a time and then removing some and adding them back? Then we can do a query on that session.

Will do.

Are you sure the new code is deployed though? Is that why you're testing?

I believe so - it was deployed with 2017-03-28_1.29.0-wmf.18 which is what was live on test.wiki when I made those tests.

On a related note, any idea why we have two tables with this data in now? (ExternalLinksChange_15716074 and ExternalLinksChange_15716074_15423246) The former is currently 50 logs from only today, and the other seems to be data from before that.

The two tables are part of a maintenance on EventLogging, we needed to delete data from old tables without new data being inserted, so we duplicated the tables. We announce things like this on the analytics public list in case you want to stay in touch.

Thanks for the other info, look forward to your test results.

thanks @Samwalton9, I'll take a look and replicate in vagrant.

I've replicated the bug in vagrant. Annoyingly, there's a parameter called "preventLog" and it does what it says randomly. After a while, it seems to always be on. Something to do with caching, so I have to read the rest of the code more carefully now. I know what the problem is but I'll need a minute to fix it and I have to attend to three other more immediate tasks. I'll be able to get to it by the end of the week, please do ping me if I fail to do so.

Samwalton9-WMF renamed this task from Implement Schema:ExternalLinksChange to Create an event stream for a feed of changed links on Wikimedia projects.Jan 25 2019, 1:10 PM
Samwalton9-WMF assigned this task to bmansurov.
Samwalton9-WMF triaged this task as Medium priority.
Samwalton9-WMF updated the task description. (Show Details)
Samwalton9-WMF added a subscriber: bmansurov.
Restricted Application added a subscriber: Cyberpower678. · View Herald Transcript

I've rescoped this task to represent the work taking place in Knowledge-Integrity (T199189).

Samwalton9-WMF renamed this task from Create an event stream for a feed of changed links on Wikimedia projects to Create a feed or log of changed links on Wikimedia projects.Jan 25 2019, 4:58 PM
Samwalton9-WMF updated the task description. (Show Details)

Change 301432 abandoned by Milimetric:
[WIP] Analyze external link insertion and deletion

Reason:
This is irrelevant now, Bahodir is working on a different and much better approach based on EventLogging I think.

https://gerrit.wikimedia.org/r/301432

@Samwalton9 Can someone own the cleanup of code for the old eventlogging schema ExternalLinksChange? cc @bmansurov We will not be using that data anymore so there is no reason to run javascript/maintain schema or send events.

@Nuria a quick note that bmansurov and I discussed ExternalLinksChange and we won't be using it in Research. We see the new event stream that bmansurov is working on as a substitute for the old one (again, only considering our usecases, I understand if others have other usecases for the old event stream schema). With that, we won't be taking up on maintaining it and we leave the decision of whether to maintain it or not to your team. I hope this helps.

The maintainers are the team/people/volunteers that instrumented the code, I am just raising on this ticket the fact that if we do not use the data it should be cleaned up and cc-ing @Samwalton9 to please coordinate that cleanup happens.

Sure thing - no problem.

The code was implemented by @Legoktm with some later assistance from @Milimetric. Could either of you shine some light on the steps necessary to get the previous work on ExternalLinksChange cleaned up?

The instrumentation for ExternalLinksChange was kind of jammed into the SpamBlacklist extension. It never really worked, so it could just be pulled out entirely. It seemed hard to force it to work, but it doesn't seem hard to undo it and clean up SpamBlacklist, as I don't think there are problems with the extension itself.

This is now live at https://stream.wikimedia.org/v2/stream/page-links-change

From some initial testing it looks to be working well for links added & removed, including page creations. However, it doesn't catch links removed by being on a page that gets deleted. Undeleting the page counts as a page creation and tracks the link addition(s).

Another possible error, in addition to the two existing subtasks, that I haven't quite been able to reproduce yet:

I have a console open logging anything that comes in with User:Samwalton9 as the performer (link_event['performer']['user_text'] == 'Samwalton9') so that I can test various edits and behaviours. Just now, an event came in that didn't correspond to an edit that I, or seemingly anyone else, made at that time:

event: message
id: [{"topic":"eqiad.mediawiki.page-links-change","partition":0,"timestamp":1550573806001},{"offset":-1,"partition":0,"topic":"codfw.mediawiki.page-links-change"}]
data: {"added_links":[{"external":false,"link":"/wiki/Edge_(video_game)"},{"external":false,"link":"/wiki/Talk:Edge_(video_game)/GA1"}],"database":"enwiki","meta":{"domain":"en.wikipedia.org","dt":"2019-02-19T10:56:46+00:00","id":"0c8d2b30-3435-11e9-b0e4-1866da993d2e","request_id":"XGl-pQpAAEIAAA4B8rcAAACJ","schema_uri":"mediawiki/page/links-change/1","topic":"eqiad.mediawiki.page-links-change","uri":"https://en.wikipedia.org/wiki/Talk:Tampon_Run","partition":0,"offset":5705026},"page_id":45403409,"page_is_redirect":false,"page_namespace":1,"page_title":"Talk:Tampon_Run","performer":{"user_edit_count":20912,"user_groups":["abusefilter","sysop","*","user","autoconfirmed"],"user_id":15991542,"user_is_bot":false,"user_registration_dt":"2011-12-29T02:44:39Z","user_text":"Samwalton9"},"removed_links":[{"external":false,"link":"/wiki/Amy_Rose"},{"external":false,"link":"/wiki/Deus_Ex:_Mankind_Divided"},{"external":false,"link":"/wiki/List_of_Sonic_the_Hedgehog_characters"},{"external":false,"link":"/wiki/Talk:Deus_Ex:_Mankind_Divided/GA1"},{"external":false,"link":"/wiki/Talk:List_of_Sonic_the_Hedgehog_characters"}],"rev_id":648195620}

This event relates to a page I created (https://en.wikipedia.org/wiki/Talk:Tampon_Run), but it has received no edits since 2015. The rev_id relates to the last edit I made there (https://en.wikipedia.org/w/index.php?diff=648195620). There are no RecentChangesLinked for that timestamp, and a cursory glance through RecentChanges didn't find anything around that timestamp that would correspond to the edit. I take it, therefore, that this has something to do with the cache of the WikiProject:Video games template on that page. Indeed, some of the data in that event corresponds to a recent edit to Template:WPVG announcements, which is transcluded in the WP:VG banner (https://en.wikipedia.org/w/index.php?title=Template%3AWPVG_announcements&type=revision&diff=883776243&oldid=882982540).

I haven't yet been able to reproduce an event like this in my sandbox (it doesn't seem to be as simple as a transcluded template being updated).

EDIT: Just had another one of those events show up. I purged the cache on this page a few minutes ago attempting to reproduce the event above on a different (similarly low traffic) page. Again, the rev_id corresponds to my last edit to the template (https://en.wikipedia.org/w/index.php?diff=655548868):

{'added_links': [{'external': False, 'link': '/wiki/Edge_(video_game)'}, {'external': False, 'link': '/wiki/Talk:Edge_(video_game)/GA1'}], 'database': 'enwiki', 'meta': {'domain': 'en.wikipedia.org', 'dt': '2019-02-19T11:20:48+00:00', 'id': '680cfb1a-3438-11e9-b6ea-1866da994975', 'request_id': 'XGl-pQpAAEIAAA4B8rcAAACJ', 'schema_uri': 'mediawiki/page/links-change/1', 'topic': 'eqiad.mediawiki.page-links-change', 'uri': 'https://en.wikipedia.org/wiki/Talk:Robot_Roller-Derby_Disco_Dodgeball', 'partition': 0, 'offset': 5723202}, 'page_id': 46354530, 'page_is_redirect': False, 'page_namespace': 1, 'page_title': 'Talk:Robot_Roller-Derby_Disco_Dodgeball', 'performer': {'user_edit_count': 20917, 'user_groups': ['abusefilter', 'sysop', '*', 'user', 'autoconfirmed'], 'user_id': 15991542, 'user_is_bot': False, 'user_registration_dt': '2011-12-29T02:44:39Z', 'user_text': 'Samwalton9'}, 'removed_links': [{'external': False, 'link': '/wiki/Amy_Rose'}, {'external': False, 'link': '/wiki/Deus_Ex:_Mankind_Divided'}, {'external': False, 'link': '/wiki/List_of_Sonic_the_Hedgehog_characters'}, {'external': False, 'link': '/wiki/Talk:Deus_Ex:_Mankind_Divided/GA1'}, {'external': False, 'link': '/wiki/Talk:List_of_Sonic_the_Hedgehog_characters'}], 'rev_id': 655548868}

@Samwalton9 thanks for the above comment. It would be best if we create sub-tasks for individual bugs. That way we can focus on a single problem at a time.

Yep, fair point - I was making notes while trying to debug. Will create a task now.

Because it was something we discussed last checkin, noting that using a named ref doesn't fire a second event for links in that reference tag (desirable).

@Samwalton9 now that the main bug has been fixed, maybe we can close this task? Subtasks seem low priority and I certainly don't have bandwidth to work on them any time soon.

Yep, I think that makes sense. Thanks for your work on this!