Page Previews/API Specification: Difference between revisions
No edit summary Tag: 2017 source edit |
Adding note |
||
Line 1: | Line 1: | ||
{{Note|For documentation on the completed API, see [[Page Content Service]] and the [https://en.wikipedia.org/api/rest_v1/#/ live API spec].}} |
|||
This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API. |
This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API. |
||
Revision as of 00:10, 17 March 2018
This article outlines the specification for a new Node.js based API to generate summaries for MediaWiki based wikis that replaces the existing TextExtracts API.
Background & Motivation
Up until now, we've mostly gotten away with using the prop=extracts
MediaWiki API provided by TextExtracts and RESTBase to allow us to scale out Page Previews to a couple of large Wikipedias without issue.
However, the requirement that certain classes of pages should be handled differently means that TextExtracts is no longer the most appropriate place to house the notion of what a page preview. We should aim to keep TextExtracts as simple and as general as possible. It may be that we compose the prop=extracts
API and the new Page Preview API rather than integrating them but this is not a goal of this work.
To be clear, the primary goal of this work is to minimise the amount of text/HTML processing in the Page Previews client: the less work the client has to do to display a preview, the better.
The specification
Intros
The API returns well-formed HTML representing the introductory elements of a page, which are defined as follows:
- The first paragraph from the introductory section.
- The first ordered, unordered, or definition list that is the next sibling of the first paragraph.
Herein we'll refer to these elements as an "intro".
Plaintext intros
Certain clients will not be able to handle HTML intros yet, e.g. the Wikipedia apps. To maintain compatibility with these clients, the API will also return a plaintext representation of the introductory elements of a page.
https://gerrit.wikimedia.org/r/370694
Empty intros
After the HTML intro has been processed (see below), it may not contain text content but still contain HTML, e.g. <p><b></b></p>
. Any processed intro that doesn't contain text content must be considered empty.
Markup allowed in an intro
By default, the Page Preview API (herein "the API") must remove any tag that doesn't fall into one of the following cases.
Emphasis
The API must retain any bolded or italicised text in the intro, i.e. the Page Preview API must not remove b
, i
, and em
tags.
Formulae/MathML
In order to support browsers that don't support MathML, the API:
- Must remove
math
tags; and - Must not remove either the inline or block layout fallback images generated by Math while parsing the page.
Super- and subscript
The API must retain all sup
and sub
tags that are not generated by Cite, i.e. <sup class="reference">
elements.
Stripping of parenthetical statements
The API must remove all content enclosed within balanced parentheses. Parentheses will be defined as the following characters: () and ( )
Flattening inline elements
The API must replace all span
and a
tags with their text content, e.g. <span>Foo</span>
should be flattened to Foo
and <a href="/foo">Foo</a>
would be flattened to Foo
.
noexcerpts
The API must remove any element with the noexcerpts
class to replicate the current behaviour of TextExtracts.
Line breaks
It is assumed that any line breaks in the summary are necessary for the display of the content. We thus do not remove any instance of a line break that appears in the lead paragraph of a summary.
Request
Parameters
Name | Type | Description |
---|---|---|
title | String | The title of the page to get the intro for. |
Responses
A successful response from the Page Preview API similarly to all existing endpoints, must have the following properties:
Name | Type | Description |
---|---|---|
titles | Titles | The various titles of the page. |
lang | String | The 2 or 3 character ISO 639-3/ISO 639-1 code of the language of the intro. This should be the site content language or the page content language. |
dir | Enum | The direction of the script used to render the language the intro. One of "ltr" or "rtl". |
last_modified | String | The time at which the page was last modified in ISO 8601 format. |
thumbnail | ?Image | The thumbnail of the image associated with the page. The thumbnail's largest side must not exceed 320px. By default, this property should not be present. |
original | ?Image | The original of the image associated with the page as determined by PageImages. By default, this property should not be present. |
wikidata_label | ?String | The label of the Wikidata item. By default, this property should not be present. |
wikidata_description | ?String | The description of the Wikidata item. |
The new summary endpoint will hydrate these properties with the additional fields specific to summaries:
Name | Type | Description |
---|---|---|
type | Enum | The notional type of the intro. One of "disambiguation", "wikidata", or "standard". |
intro | String | The intro of the page represented as well-formed HTML5. |
plaintext_intro | String | The intro of the page represented as plaintext. This property supersedes the extract property of the current RESTBase Page Summary endpoint.
|
disambiguation_links | ?Titles[] | The titles of the first N links from the disambiguation page. By default, this property should not be present. |
Done
Where an Image
type property must have the following properties:
Name | Type | Description |
---|---|---|
source | String | The URL of the image. |
width | Integer | The width of the image in px. |
height | Integer | The height of the image in px. |
And a Titles
type property must have the following properties:
Name | Type | Description |
---|---|---|
denormalized | String | The title of the page, e.g. File:Igorrr_(band). |
normalized | String | The normalized title of the page, e.g. Igorrr (band).jpg |
display | String | The editor-formatted title of the page (see https://www.mediawiki.org/wiki/Help:Magic_words#Displaytitle), e.g. <strong>Igorrr (band).jpg</strong>. |
namespace_id | Integer | The ID of the namespace that the page is in on the wiki. |
namespace_name | String | The localized name of the namespace, e.g. User, Usario, etc. |
page_id | Integer | The internal ID of the page. |
For a page in the wiki's content namespace(s)
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "standard"
.
If the page has a corresponding Wikidata item, then the wikidata_description
property must be set to the item's description.
For a page outside of the wiki's content namespaces
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
property must be set to ""
.
The extract_html
property must be set to ""
.
For a page that doesn't use the wikitext, wikibase-item, or wikibase-property content model
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
property must be set to ""
.
The extract_html
property must be set to ""
.
For a disambiguation page
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "disambiguation"
.
The disambiguation_links
property of the response must be set to the first N links from the disambiguation page.
The intro
property of the response should be set to the intro of the page so that the client may display it if appropriate.
Blocked
For a page that doesn't exist
The Page Preview API must respond with 404 Not Found.
The response body must be empty.
For a page that doesn't have a lead section
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "standard"
.
The intro
property of the response must be set to ""
.
Examples
For a page that has an empty intro
The response must be the same as the "For a page that doesn't have a lead section" case.
For a page that redirects to another page
The Page Preview API must respond with 302 Found.
The
Location
HTTP header must be set to the URL that will get the intro for the target page.
Note: RESTBase handles redirects transparently to the underlying service (see T176517#3634838).
The Page Preview API must respond with 200 OK.
The type
property of the response must be set to "no-extract"
.
The extract
property must be set to ""
.
The extract_html
property must be set to ""
.
Responses for Wikidata (from T111231: Page previews for Wikidata)
For a Wikidata item
This overrides the "For a page in the wiki's content namespace" case above.
The type
property of the response must be set to "wikidata_preview".
The wikidata_label
property of the response must be set to the item's label.
If the item has the image property set (to I):
- The
image
property of the response must be set to theImage
object that represents the Wikimedia Commons file referenced by I.
- The
thumbnail
property of the response must be set to theImage
object that represents the corresponding thumbnail.
Notes
The item's description should be in the user's language. If the description isn't available in the user's language, then the API must follow the language fallback chain until one is available.
For a Wikidata item with no description
The response should be the same as the For a Wikidata item case apart from the following:
The wikidata_description
property of the response must be set to ""
.