Page MenuHomePhabricator

Non-existent URLS are not removed from the externallinks table
Closed, ResolvedPublicBUG REPORT

Description

SELECT page_title, el_to_domain_index, el_to_path , count(*)
FROM externallinks
JOIN page ON page_id = el_from
WHERE page.page_title='Rustem_Umjerow' and page.page_namespace=0 and el_to_domain_index like '%'
GROUP BY page_title, el_to_domain_index, el_to_path

show 120 links for https://.com.qha. and 1 for https://..com.qha.

dewiki: WP:FZW https://de.wikipedia.org/wiki/Wikipedia:Fragen_zur_Wikipedia#Link_auf_qha.com_im_Artikel_Rustem_Umjerow

What happens?:

the empty version of the page shows 121 external links to qha

What should have happened instead?:

the URLs should be removed, when there not linked by wikitext

Event Timeline

Reedy renamed this task from Non-existent URLS re not removed from the externallinks table to Non-existent URLS are not removed from the externallinks table.Nov 3 2023, 2:03 PM
Reedy updated the task description. (Show Details)
Umherirrender subscribed.

There is no unique index on the externallinks table, so the duplicates are okay, but the delta logic of the LinksUpdate should avoid duplicates. I would not expect many pages with duplicates, but there are some:

MariaDB [dewiki_p]> select count(*) from ( select 1 from externallinks group by el_from, el_to_domain_index, el_to_path having count(*) > 1 ) as t;
+----------+
| count(*) |
+----------+
|    37302 |
+----------+
1 row in set (5 min 10.756 sec)
MariaDB [dewiki_p]> select count(*) from externallinks;
+----------+
| count(*) |
+----------+
| 33489989 |
+----------+
1 row in set (7.205 sec)

The domain index https://.com.qha. gets reversed, but https://.qha.com would be invalid and is internally build as https://qha.com, so the code is comparing wrong values and cannot remove this invalid one.
The only way to handle also invalid values from the database seems to change the way the ExternalLinksTable class is handling the delta, it should not modify the existing values before comparing, so it needs a 2d-array and calling LinkFilter::makeIndexes earlier.

Change 972024 had a related patch set uploaded (by Umherirrender; author: Umherirrender):

[mediawiki/core@master] LinksUpdate: Compare raw domain and path for externallinks table

https://gerrit.wikimedia.org/r/972024

usenet links could be test cases: [[:en:Linux]]

SELECT page_title, el_to_domain_index, left(el_to_path,60) as path60, count(*)
FROM externallinks
JOIN page ON page_id = el_from
WHERE  page.page_namespace=0 and length(el_to_path)<60
GROUP BY page_title, el_to_domain_index, left(el_to_path,60)
HAVING count(*)>1
ORDER BY count(*) DESC

In article [[de:Ustym_Karmaljuk]] is a none existing URL http://. in the table.

After deleting the article, the URL http://. remained the only one in the table

https://de.wikipedia.org/w/index.php?title=Ustym_Karmaljuk&diff=prev&oldid=239124367

It is not needed to empty article pages, the deletion of the external links is done via a delta between the database and the new parse result, as there is an existing bug some links remains even on an empty pages. But that is fixable in the software.
The sync between database and the parse result can also be reached with a null edit (a edit which is not visibile in the history as no new version is created, it is not needed to change whitespaces in the page to reach this) after the fix is merged and deployed.

Do not change the wikipages to prove that the software is buggy at this place.

Change 972024 merged by jenkins-bot:

[mediawiki/core@master] LinksUpdate: Compare raw domain and path for externallinks table

https://gerrit.wikimedia.org/r/972024

The patch set fixed mediawiki that it deletes the invalid rows on the next edit, on null edit, edit on transcluded template or on forced linkpurge via api's action=purge.

It does not handle duplicate rows and removes the duplicates. When an invalid row exists with duplicates (as in the task description) all of them gets deleted.
Hopefully no new duplicates should be created, but that was not tested or explicit enforced.
Closing as the duplicates are not the main reason for the task, some of them should be invalid rows and are gone after the fix.

See the roadmap for the next deployment (should be 1.42.0-wmf.7) - https://www.mediawiki.org/wiki/MediaWiki_1.42/Roadmap and verify with https://versions.toolforge.org/