Page MenuHomePhabricator

Parsoid needs a core-compatible LinkTarget interface
Closed, ResolvedPublic

Description

A number of methods of ParserOutput currently take a Title or Title-related class, like LinkTarget, as arguments. This leads to two problems:

  1. Parsoid can not directly define an API with these methods, because the Title class is part of core and Parsoid can't reference it without introducing a circular dependency between core and Parsoid. Further the Title class is too enmeshed in core to allow easy refactoring into an independent library.
  2. Eager resolution of strings into Titles causes inefficient database churn. It is preferable to use types which either avoid requiring database queries entirely, or to the extent they are required (for example, to look up the article ID associated with a title), perform such queries in batches.

'''OLD PLAN:'''
To solve both of these issues we propose to introduce a LazyLinkTarget class, and accept LazyLinkTarget|LinkTarget as arguments to ParserOutput. We will attempt to defer resolution of LazyLinkTarget as long as possible, so that it is possible to do so in a batch only as needed.

We'll also introduce a "LazyLinkTarget" factory type in Parsoid, which turns strings (or ints, see comment below) into "LazyLinkTarget" objects. This can be a marker interface in standalone Parsoid, since Parsoid never needs to be able to interrogate a LazyLinkTarget in any way. In core, the LazyLinkTarget factory would be instantiated in such a way that title resolution occurs in batches.

Note that, for instance, ParserOutput::addLink() ultimately does:

		if ( $id === null ) {
			$page = MediaWikiServices::getInstance()->getPageStore()->getPageForLink( $link );
			$id = $page->getId();
		}
		$this->mLinks[$ns][$dbk] = $id;

Part of this task is changing mLinks so that it stores LazyLinkTargets as well, so that we do the getPageForLink()->getId() only at the last possible moment and with a full batch of link targets at once.

(comments re the above code in particular: "The whole thing with optional $id is weird. It can be 0 since not all links can have a page ID, but if it's skipped we do a DB lookup later. it's a foot-gun - use this in any kind of a loop without the ID provided and you end up killing performance completely.")

'''NEW PLAN''': We'll just move the LinkTarget interface from core to Parsoid. Parsoid can then pass LinkTargets into all the ContentMetadataCollector/ParserOutput methods. We still need to lazily evaluate them to pageids in many cases, but that will be handled by ParserOutput and hidden from Parsoid.

Related Objects

Event Timeline

Note that LinkTarget knows nothing about page IDs. LinkTarget is a parsed full title string. Why exactly page IDs are involved is probably an optimization - Parser already resolved the page IDs for redlink rendering so we set them in ParserOutput to not waste good batch database result lookup.

I could easily imagine getLinks also returning array of LinkTarget objects, and callers resolving page IDs themselves. If it's done via LinkBatch, the IDs will be cached already.

Scott noted on https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/738019/ that it may be beneficial to include the variant link resolution mechanism currently implemented in the legacy parser's LinkHolderArray into this proposed new resolver.

@Krinkle noted in a review of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/763322 that, because PHP will transparently convert "numeric strings" to integers in certain cases (eg given function foo($b) { $a = []; $a[$b] = true; return array_keys($a)[0]; } then foo("0") == 0) it's worth being careful to permit int|string as the parameter type in any "convert a string to a link target" method, and to carefully cast to string where needed.

cscott updated the task description. (Show Details)
cscott renamed this task from Parsoid needs a LazyLinkTarget factory to Parsoid needs a core-compatible LinkTarget interface.Nov 2 2023, 5:22 PM
cscott updated the task description. (Show Details)

Change 971249 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Move LinkTarget from core to Parsoid

https://gerrit.wikimedia.org/r/971249

Change 971274 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] LinkTarget: extend LinkTarget interface from Parsoid and use LinkTargetTrait

https://gerrit.wikimedia.org/r/971274

Change 971507 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] LinkTarget: alias LinkTarget interface from Parsoid and use LinkTargetTrait

https://gerrit.wikimedia.org/r/971507

Change 971510 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Move LinkTarget from core to Parsoid, part 1/2

https://gerrit.wikimedia.org/r/971510

Change 971510 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Move LinkTarget from core to Parsoid, part 1/2

https://gerrit.wikimedia.org/r/971510

Change 971249 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Move LinkTarget from core to Parsoid, part 2/2

https://gerrit.wikimedia.org/r/971249

Change 972001 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a5

https://gerrit.wikimedia.org/r/972001

Change 972001 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a5

https://gerrit.wikimedia.org/r/972001

Change 971507 abandoned by C. Scott Ananian:

[mediawiki/core@master] LinkTarget: alias LinkTarget interface from Parsoid and use LinkTargetTrait

Reason:

In favor of I9adc349b230834827b002f412dcc697b44773abe

https://gerrit.wikimedia.org/r/971507

Change 971274 merged by jenkins-bot:

[mediawiki/core@master] LinkTarget: extend LinkTarget interface from Parsoid and use LinkTargetTrait

https://gerrit.wikimedia.org/r/971274

Change 981357 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] ParserOutput: Allow passing LinkTarget to title-related methods

https://gerrit.wikimedia.org/r/981357

Change 981364 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] [CMC] Add ::addLink, ::addImage, ::addLanguageLink methods

https://gerrit.wikimedia.org/r/981364

Change 981357 merged by jenkins-bot:

[mediawiki/core@master] ParserOutput: Allow passing LinkTarget to title-related methods

https://gerrit.wikimedia.org/r/981357

Change 981364 merged by jenkins-bot:

[mediawiki/services/parsoid@master] [CMC] Add ::addLink, ::addImage, ::addLanguageLink methods

https://gerrit.wikimedia.org/r/981364

Change 983902 had a related patch set uploaded (by Subramanya Sastry; author: Subramanya Sastry):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a9

https://gerrit.wikimedia.org/r/983902

Change 983902 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.19.0-a9

https://gerrit.wikimedia.org/r/983902

Change 987430 had a related patch set uploaded (by Thiemo Kreuz (WMDE); author: Thiemo Kreuz (WMDE)):

[mediawiki/extensions/Kartographer@master] Pass tracking category directly to ParserOutput::addCategory

https://gerrit.wikimedia.org/r/987430

Change 987430 merged by jenkins-bot:

[mediawiki/extensions/Kartographer@master] Pass tracking category directly to ParserOutput::addCategory

https://gerrit.wikimedia.org/r/987430

This is mostly done now. The one remaining piece is ParserOutput::addLink, which currently kills performance because it wants to know the page_id. We need to batch the lookup of these to be performance-equivalent to core. I've opened T357048: Parsoid needs to call ParserOutput::addLink with a link batch for this task, so it doesn't get lost here.