Page MenuHomePhabricator

Implement a SiteLookup based on a nested array structure.
Open, HighPublic

Description

As per T113034: RFC: Overhaul Interwiki map, unify with Sites and WikiMap, we want to move towards maintaining meta-information about other sites (aka interwiki info) in files, using the structure outlined in P3044. This structure would be stored in JSON or PHP files, and will be represented internally by nested arrays. The SiteLookup should:

  • provide access to the Site objects represented by nested arrays, see T135147: Make the domain model implemented by Site/SiteLookup/SiteStore more flexible
  • be able to load such data from JSON or PHP files (this could be in a separate class, if we want)
  • combine multiple such data-structures (deep-merge)
  • build indexes for efficient access by id or group. If such an index was already included in the data provided, it should be used.
NOTE: FileBasedSiteLookup exists. Perhaps it can be adopted. We would probably need compatibility code so it can continue to support the current/legacy JSON structure (see docs/sitescache.txt). However, the SiteLookup should perhaps not know about files at all, only about nested arrays.

The proposed structure, for reference:

1{
2 "enwiki": {
3 "type": "mediawiki",
4 "ids": {
5 "global": [ "enwiki", "some-old-alias" ],
6 "interwiki": "en",
7 "domain": "en.wikipedia.org"
8 },
9 "groups": {
10 "language": "en",
11 "size": "big",
12 "family": "wikipedia",
13 "db-cluster": "s1"
14 },
15 "paths": {
16 "article": "//en.wikipedia.org/wiki/$1",
17 "api": "https://en.wikipedia.org/w/api.php"
18 },
19 "props": {
20 "database": "enwiki",
21 "language": "en"
22 }
23 },
24 "enwiktionary": {
25 "type": "mediawiki",
26 "ids": {
27 "global": "enwiktionary",
28 "interwiki": [ "wiktionary", "wikt" ],
29 "domain": "en.wiktionary.org"
30 },
31 "groups": {
32 "language": "en",
33 "family": "wiktionary",
34 "db-cluster": "s2"
35 },
36 "paths": {
37 "article": "//en.wiktionary.org/wiki/$1",
38 "api": "https://en.wiktionary.org/w/api.php"
39 },
40 "props": {
41 "database": "enwiktionary",
42 "language": "en",
43 "capital-links": false
44 }
45 },
46 "commonswiki": {
47 "type": "mediawiki",
48 "ids": {
49 "global": [ "commonswiki", "commons" ],
50 "interwiki": [ "commons", "c" ],
51 "domain": "commons.wikimedia.org"
52 },
53 "groups": {
54 "language": "en",
55 "family": "commons",
56 "db-cluster": "s2"
57 },
58 "paths": {
59 "article": "//commons.wikimedia.org/wiki/$1",
60 "api": "https://commons.wikimedia.org/w/api.php"
61 },
62 "props": {
63 "database": "commonswiki",
64 "multilingual": true,
65 "transcludable": true
66 }
67 },
68 "bb": {
69 "type": "unknown",
70 "ids": {
71 "global": ["bb", "boingboing" ],
72 "interwiki": "bb",
73 "domain": "boingboing.net"
74 },
75 "groups": {
76 "language": "en"
77 },
78 "paths": {
79 "article": "https://boingboing.net/$1.html"
80 },
81 "props": {
82 "language": "en"
83 }
84 },
85 "_by_ids": {
86 "global": {
87 "some-old-alias": "enwiki",
88 "enwiki": "enwiki",
89 "enwiktionary": "enwiktionary",
90 "commonswiki": "commonswiki",
91 "commons": "commonswiki",
92 "bb": "bb",
93 "boingboing": "bb"
94 },
95 "interwiki": {
96 "en": "enwiki",
97 "wiktionary": "enwiktionary",
98 "wikt": "enwiktionary",
99 "c": "commonswiki",
100 "commons": "commonswiki",
101 "bb": "bb"
102 },
103 "domain": {
104 "en.wikipedia.org": "enwiki",
105 "en.wiktionary.org": "enwiktionary",
106 "commons.wikimedia.org": "commonswiki",
107 "boingboing.net": "bb"
108 }
109 },
110 "_by_groups": {
111 "language": {
112 "en": [ "enwiki", "enwiktionary", "bb" ],
113 "mul": [ "commonswiki" ]
114 },
115 "family": {
116 "wikipedia": [ "enwiki" ],
117 "wiktionary": [ "enwiktionary" ],
118 "commons": [ "commonswiki" ]
119 },
120 "db-cluster": {
121 "s1": [ "enwiki" ],
122 "s2": [ "enwiktionary", "commonswiki" ]
123 }
124 }
125}

Event Timeline

We should probably have some way to indicate that a mediawiki site is local or not

Or maybe that's just based on the presence of groups db-cluster or props database?

Why is language in both groups and props?

Also this would have to be per-wiki with the current format - en: is the interwiki prefix for enwiki on wikipedias, but on (for example) wikibooks it should point to enwikibooks

This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly more realistic overview of plans. Please feel free to assign this task to yourself again if you still realistically work or plan to work on this task - it would be welcome!

For tips how to manage individual work in Phabricator (noisy notifications, lists of task, etc.), see https://phabricator.wikimedia.org/T228575#6237124 for available options.
(For the records, two emails were sent to assignee addresses before resetting assignees. See T228575 for more info and for potential feedback. Thanks!)