Page MenuHomePhabricator

Image links from #ifexist:Media:... are not being registered properly on tawiktionary
Closed, ResolvedPublic

Description

From this list post: https://lists.wikimedia.org/pipermail/wikitech-l/2020-February/093098.html

If you do something like {{#ifexist:Media:Foo.ogg}}, this should register a link both in the local imagelinks table, and the commons globalimagelinks table. This is used for cache clearing. Currently the link only seems to be added if the file exists.

Consider: https://ta.wiktionary.org/w/api.php?action=parse&text={{%23ifexist:Media:DoesNotExist.png}}{{%23ifexist:Media:Example.png}}&formatversion=2&prop=images|links

Expected behaviour: Both DoesNotExist.png and Example.png are registered as image links
Actual Behaviour: Only Example.png is registered.

I think the logic error is due to the early return:

$file = wfFindFile( $title );
if ( !$file ) { 
        return $else;
}
$parser->mOutput->addImage(
        $file->getName(), $file->getTimestamp(), $file->getSha1() );
return $file->exists() ? $then : $else;

This will exit before registering the dependency if the file does not exist as wfFindFile will return false.

Getting started steps

Event Timeline

Anyways, a suggested fix might be, in the !$file case, to add $parser->mOutput->addImage( $title->getDBKey() )

Hi , @Bawolff I have installed the extension.

localhost_8080_wiki_Main_Page (3).png (638×1 px, 88 KB)

What to do next ?

Hi , @Bawolff I have installed the extension.

localhost_8080_wiki_Main_Page (3).png (638×1 px, 88 KB)

What to do next ?

Actually I want to know how to produce this error and I don't know how to create a page.

How to produce is covered in the description ("If you do something like {{#ifexist:Media:Foo.ogg}}"); for editing see https://www.mediawiki.org/wiki/Help:Editing

Change 702740 had a related patch set uploaded (by TChin; author: TChin):

[mediawiki/extensions/ParserFunctions@master] Record ifexists in imagelinks table

https://gerrit.wikimedia.org/r/702740

Change 702740 merged by jenkins-bot:

[mediawiki/extensions/ParserFunctions@master] Record #ifexist media in imagelinks table

https://gerrit.wikimedia.org/r/702740

HI!

But it must be necessary (maybe by a property in imagelinks table) to find out, that the match of an imagelink is caused by an #ifexist clause. Otherwise I get by an SQL query also all the images listed that are not really missing but missing by #ifexist

Please undo this change and let's look for a proper solution of this problem.

Thank you

The fix provided here is correct according to a spec of image links table - it contains all links to all files that affected ParserOutput for the page. Its used for dependency tracking and cache invalidation. If you have #ifexist on your page, and you created or deleted the media, you want the page to be reparsed to reflect the changes. If we revert the patch - this will become broken again.

@doctaxon could you tell a bit more about your use-case to help find out the solution to your problem?

@Pchelolo Thank you for your reply.

There's a bot running listing all pages having an image link to a not existing file. With this patch there are also listed pages with not existing #ifexist images. But the image link that is filtered by #ifexist must not be returned, because it's even filtered. If it's necessary that these #ifexist image links have to be parsed, then let's add a property to the imagelinks table, that this image link is coming from an #ifexist clause to have an opportunity to exempt these image links from the bot list.

I know that @Ladsgroup is thinking about ideas on how to improve the *links tables, so I would double check with him before adding anything to imagelinks which is already in a very delicate position from a database point of view.

@Marostegui "delicate position"? Let me know about problems, if there're any ...

In general the *links tables are reaching quite considerable on-disk sizes, example on enwiki:

-rw-rw---- 1 mysql mysql  46G Aug  3 05:59 categorylinks.ibd
-rw-rw---- 1 mysql mysql  66G Aug  3 05:59 externallinks.ibd
-rw-rw---- 1 mysql mysql  14G Aug  3 05:59 imagelinks.ibd
-rw-rw---- 1 mysql mysql 160G Aug  3 05:59 pagelinks.ibd
-rw-rw---- 1 mysql mysql 104G Aug  3 05:59 templatelinks.ibd

We are trying to see how to work out this out and improve those tables to try to make them more scalable, that's why I would suggest to talk to @Ladsgroup before adding more things to them, to avoid having to work twice or having to re-do stuff in case things are changed.

imagelinks table (along rest of *links table) are basically a ticking bomb. There are plans to to improve them (T222224) but we haven't got to get it done (attending to other time bombs) but even if links tables were not exploding. I would be very careful in adding a column meaning a lot of data for this non-critical usecase. There are a couple of ways I can think of for this to move forward. I don't know the details of this ticket so I'm sorry if I'm misunderstanding your usecase.

  • Add a new table: Similar to what has happened to watchlist and wtachlist_expiry. For example, if you have a ifexist for an image, it would add a new row to that table to keep track.
  • Make a join with page for local images (for checking existence, e.g. using a left join) and for global, you can join with commons page table (maybe? it can also get bad really quickly)

I should spend some time to properly study this usecase and see what we can do. Changing schema should be done with lots of thought as it's hard to fix later.

Umherirrender subscribed.

@Pchelolo Thank you for your reply.

There's a bot running listing all pages having an image link to a not existing file. With this patch there are also listed pages with not existing #ifexist images. But the image link that is filtered by #ifexist must not be returned, because it's even filtered. If it's necessary that these #ifexist image links have to be parsed, then let's add a property to the imagelinks table, that this image link is coming from an #ifexist clause to have an opportunity to exempt these image links from the bot list.

That is handled on T14019 along with the other link tables.