Page MenuHomePhabricator

Restbase IDs should not conflict with other Parsoid IDs
Closed, InvalidPublic0 Estimated Story Points

Description

Currently a RESTBase ID is mw+Base64, so /mw[A-Za-z0-9_-]+/, however this conflicts with mw-reference-contents IDs and possibly others.

For T146054 we are required to strip IDs for imported content, but to do that we need to know what a RESTBase ID is.

Event Timeline

From IRC that can help clarify this better:

<subbu> edsanders, another thought ... but, won't the id assigned to the <ref> conflict with a similar <ref> id already in the doc?
<subbu> i.e. original doc "foo <ref>x</ref>" and I paste "bar <ref>y</ref>" .. won't the ids on both the refs be identical?
<edsanders> probably
<subbu> right. so, maybe this needs a different solution instead of changing the id scheme or stripping?
<edsanders> I think out document merge code might handle that
<subbu> and also look at this: <script id="mw-pagebundle" type="application/x-mw-pagebundle">{"parsoid":{"counter":2,"ids":{"mwAA":{"dsr":[0,35,0,0]},"mwAQ":{"dsr":[0,19,0,0]},"cite_ref-1":{"dsr":[5,19,5,6]},"mwAg":{"dsr":[20,34,2,2]},"mw-reference-text-cite_note-1":{}}},"mw":{"ids":{}}}</script>
<subbu> so, all ids get assigned data-parsoid (even non-parsoid generated ones), so, you need to re-assign ids everywhere.
<subbu> parsoid just reuses existing ids if they are already present.
<subbu> so, maybe you need to just disambiguate the ids by adding an unique suffix/prefix?
<edsanders> We inline the reference content, so it isn't actually a problem
<subbu> edsanders, i am confused now .. so, are you then stripping ids everywhere (even refs) and we don't need to worry about the id scheme?
<edsanders> we can't strip the ids from refs
<edsanders> you provide the reference content in a separate list - we inline it, so we can't strip those IDs before converting to linmod as they link the reference content to the reference
<edsanders> the ID stripping is done on the parsoid/restbase response - before it goes into the converter
<subbu> i see .. the id stripping is happening in 2 phases, and ref ids cannot be stripped before it goes into linmod but you are saying you can be tripped up by the regexp you are using.
<subbu> and hence you are asking for a different id scheme

The offending lines in Parsoid:
uid = 'mw' + JSUtils.counterToBase64(docDp.counter);

...

counterToBase64: function(n) {
	/* jshint bitwise: false */
	var arr = [];
	do {
		arr.unshift(n & 0xff);
		n >>= 8;
	} while (n > 0);
	return (new Buffer(arr))
		.toString("base64")
		.replace(/=/g, "")
		.replace(/\//g, "_")
		.replace(/\+/g, "-");
},

My best guess for a regex to match Parsoid IDs at the moment is

/^mw[a-zA-Z0-9\-_]{2,6}$/

This restricts the match to 6 characters so IDs like 'mw-reference-...' aren't picked up. The document would need to be astronomically long to require 6 'digits' in base 64. (64^5? IDs)

Change 315339 had a related patch set uploaded (by Esanders):
Define RESTBase ID pattern in platform and fix slightly

https://gerrit.wikimedia.org/r/315339

My best guess for a regex to match Parsoid IDs at the moment is

/^mw[a-zA-Z0-9\-_]{2,6}$/

This restricts the match to 6 characters so IDs like 'mw-reference-...' aren't picked up. The document would need to be astronomically long to require 6 'digits' in base 64. (64^5? IDs)

Sounds reasonable to me and looks like it addresses the issue you were raising.

The issue you are trying to avoid is clashes between ids in new, pasted content & existing content, as well as the false data-parsoid and data-mw reuse that would come with that. These clashes can be triggered by any ID -- both those prefixed with 'mw', and those that aren't. So, I don't think stripping only mw* IDs is safe. Instead, I think it would make more sense to check for conflicts & rename / strip IDs if any are found.

Change 315339 merged by jenkins-bot:
Update VE core submodule to master (90cbd62)

https://gerrit.wikimedia.org/r/315339

@Esanders: Is this still an issue / does it need addressing? As far as I can tell, it is not, but, please reopen and update if any action is needed on our end.