User:Envlh/Denelezh/WHO
This page is a draft of a project grant.
Project idea
What is the problem you're trying to solve?
Wikimedia projects hold several million bibliographies, which form an incredible dataset about who the project finds important. However without further analysis and tools, it's difficult to answer some prominent questions about those biographies like:
- how diverse are they?
- which diversity gaps that would be easy to fill?
- how are these statistics changing over time?
Two actively used tools from the proposers of this grant, WHGI (Wikidata Human Gender Indicators, a previous WMF grantee) and Denelezh, provide some of this data to evaluate aspects of the diversity in bibliographies. The data that they create has been used to make lists for Wikiprojects, and has been a barometer of Wikimedia's gender gap when discussed in the BBC, Bloomberg[1]. The Wikimedia Grant Team's Annual Report has said of them:
“ | Grants for research and tools (such as WHGI) - which minimally contribute to the targets of people or articles - have been extremely valuable in improving our understanding of the gender gap and how or why it manifests. | ” |
Yet these tools each have their own limitations:
- they are mostly dedicated to gender gap, and other biases could be explored
- they only provide some statistics and don't directly help editors to close gaps,
- the tools are only available in English (despite presenting data about many languages),
- they have technical limitations that make them less extensible for new queries,
- some of their features overlap, and two tools means twice maintenance.
We want to take this community data effort to the next level.
What is your solution to this problem?
A combined tool with feature improvements
Our solution is to merge WHGI and Denelezh to build a new tool, called WHO (Wikimedia Human Observatory, name is subject to change), relying on the experience gained with the two previous ones. The new tool will have a new architecture allowing:
- 'backlog features we have already heard the community wants:
- a new "comparison view" making lists helping editors to find subjects they can improve
- e.g. women with articles in French but not English,
- e.g. people who books written about them without Wiki biographies
- a new updated "evolutions view" that will be able to show the history of the data
- e.g. Malayalam Wikipedia has increased 2.5% in representation of Women's biographies between 2015 and now.
- a new focus on other statistics of humans that aren't gender-related
- e.g. geographical or occupational biases
- a more dynamic, data exploration front-end website
- 'an API to provide data to 3rd-party tools
- We have already discussed providing data to the proposed Gender Campaigns tool.
- a new "comparison view" making lists helping editors to find subjects they can improve
- new features elicited from the community during the grant.
Architectural Changes
One important change to facilitate our new features would be the architecture of the tool. WHGI and Denelezh rely on weekly Wikidata JSON dumps, directly generating reports from them. The idea is to introduce an intermediate step to store raw data about humans in a database. Reports will then be generated using this data. This presents several advantages:
- A way to generate the "evolution" view of data.
- A generalized schema that many applications can read from (e.g. our own front-end & Gender Campaign Tools)
- A generalized schema that many applications can contribute to (e.g. Wikidata & other Linked Data sources).
- Space savings over storing Wikidata JSON dumps.
Project goals
- 1) Engineer reliable, polished diversity data collection for the future.
- 5 years after they have come on the scene, Wikiprojects and communities still rely on WHGI and Denelezh for diversity measurement, but their on-going maintenance is hard. We can alleviate this problem by merging two overlapping single-maintainer tools into one comprehensive multi-maintainer tool. This will benefit the Wikimedia's wikiprojects and communities by solidifying diversity data collection for the years to come. *In addition the result will be a more polished, more capable, data portal for diversity data that will increase the public face of this corner of feminist data enthusiasts.
- 2) Create more detailed, actionable statistical outputs for editors.
- While interesting, an enabling the first views into Wikipedia's biography composition as a whole, not all the data produced by WHGI and Denelezh is that actionable. Communities have long pointed out that in addition to knowing some of these statistics the tools aren't delivering on their potential to help users focus their editing.
- Enable comparison features that can answer questions like:
- "Which women are represented in French Wikipedia but not English Wikipedia?"
- "Which occupations have seen the least amount of biographies created in the last 2 years?"
- "Which biographies are being created in my historical interest areas, so that I can find similar minded editors?"
- "Have there been any Wikipedias that have seen systematic anti-diversity editing?"
- "What non-gender dimensions are the humans of two wikis similar/different?"
- 3) Enable future community features through an API.
- The data that is produced through WHGI and Denelezh is not easily re-usable outside of their own websites. We want to make sure the data is usable by in the full tools/data ecosystem. For instance we want to:
- Enable data ingest for the [Gender Campaigns Tool https://docs.google.com/document/d/1LtBujK6kARbwxUzDyF6hpv445rJwk7Ka7rAlf7e9USg/edit#] being built.
- Easily enable bot-updating of stats or lists on to Wikipages or embeds into any website. E.g. weekly reports delivered to Wikiprojects.
- There are even moon-shot ideas that could be unlocked with an open API:
- Unlock A/B testing of editing efforts between languages.
- Unlock AI training data to know what predicts biased editing patterns.
Project impact
How will you know if you have met your goals?
- The WHO tool is a complete replacement for WHGI and Denelezh, therefore we will want
- 1 new repository with 2+ contributors. This project aims to consolidate projects, so having just one active software repository is better.
- 3+ backlog features. Over the years we have amassed a backlog of feature requests on these tools (comparisons, evolutions, lists, etc.), and we aim to implement at least the 3 most desired.
- 1,000 unique visitors. To our web frontend site within the first month.
- 1 year successful running. The tool can run for 1 year without breaking. (FWIW the current developers have been maintaining the older tools for nearly 5 years).
- For the community features we will elicit during this process
- Elicitation outputs. Written summary of the elicitation process, including sketches and mocks of what the community requested features will look like.
- 2 new community features. We aim to solicit new ideas from the community for the tool and implement at least two of them.
- Software acceptance outputs. We will follow a "software acceptance" process. This will let us know if our community features are also satisfactory to our users, and will record this process.
- 1 new Wikiproject user. Our tools are currently in use by Wikiprojects, it would be fantastic to support new projects.
- 1 new API user. We aim to have 1 API user using our data (Gender Campaign Tools or other).
Do you have any goals around participation or content?
We will be conducting user-centered design throughout our development, which means there will be rounds of elicitation, prototyping, user-testing, and user-acceptance. We will also create blog and video updates and announcements.
Project plan
Activities
- A) Review WHGI and Denelezh with Community
- We will synthesize the past discussions and bug reports defining our "backlog features".
- Host a virtual focus group with previous users of WHGI and Denelezh (asynchronously and synchronously), creating "community features".
- B) Core application
This step creates the underlying tech stack.This part has two sub-goals:
- Define the schema of the intermediate database (using section 2.2 from preliminary work).
- Design and implement the core of the application, with the ability to:
- download Wikidata JSON dumps,
- create intermediate statistics and reports in a local database,
- serve statistics via a generic API, to our own front-end and other services.
- C) Re-implement the features of WHGI and Denelezh
- "Backfill" snapshot data from WHGI into the new architecture.
- Re-implement the statistical reports of WHGI and Denelezh in the new architecture
- Create a skeleton dynamic web front-end
- D) Implement new reports from Community Input
Here we will implement our "backlog" and "community" features:
- We will prototype and wire-frame features first allowing for a community input before building.
- We will conduct user-testing midway, and software acceptance testing at the end.
- Taskforce and list-making support:
- Evolution view.
- Extenral ID support.
- Comparison of two subsets (with the ability to explore subsets by external ID, in addition to date, gender, year of birth, country of citizenship, occupation, Wikimedia project).
- Hierarchy of occupations.
- Data quality:
- Show data availability of specific Wikidata properties describing humans (with filtering by date, gender, year of birth, country of citizenship, occupation, Wikimedia project, external ID).
- List data that are probably mistakes that can be fixed on Wikidata (e.g. a nationality like French instead of France).
See Mockup for some ideas.
- E) API & Exports
Statistics and data generated by the project will be reusable and published under CC0 license (domain public).
- Create an API that allows 3rd party applications (like the Gender Campaigns Tool) to access any of the data available to the front-end via HTTPS.
- Export data from reports in CSV files.
- F) Internationalization and localization
As the application should be available in as many languages as possible, two things will be internationalized:
- the user interface, relying on translatewiki.net,
- the content, relying on Wikidata translations.
The application will be available in English and in French. Other translations are out of scope and will be made by the community.
- G) Documentation
The project will provide:
- end-user documentation:
- directly on tool,
- slides for hands-on workshops presenting the tool and its features to end-users.
- technical documentation, at least:
- architecture description (on Meta),
- database schema,
- code under AGPLv3 license.
- H) Future Maintenance Plan
- Have a plan for how errors will be reported.
- Have a way to monitor data processing.
- Agree on a way to collect bug reports.
- I) Project Management
- We will conduct weekly meetings to keep the project alive.
- We will conduct bi-weekly (and special case) meetings with our Counselor/Advisor to make sure the project is ecologocially on-track.
Budget
- Human resources
- Software engineering at $50/hour
- 480 hours (12 weeks) of Engineering .
- $24,000
- Project management at $50/hour
- 30 hours Community Interaction
- 20 hour Meetings and writing
- $2,000
- Counselor/advisor at $50/hour (we would like help finding a domain expert in Wikipedia and feminist technology).
- 20 hours advising.
- $1,000
- Travel
- Wikimania
- 3 x $2000 stipends for each team member
- $6,000
- Wikimedia Hackathon
- 2 x $2000 stipends for each team member
- $4,000
- Total
- $37,000
- Server hosting
WHGI was funded by an IEG grant in 2015. Wikimedia Foundation still provides a server to WHGI, which will be reused for this project. Denelezh relies on a server provided by Wikimédia France; this server will be decommissioned when goals A and B will be achieved, saving Wikimedia France some money.
Community engagement
Online workshops will be organized with end-users at some milestones:
- User feedback and feature wishlist making (at stage A)
- User prototype testing (at stage C).
- User acceptance testing (at stage E) until the end of the projects.
Get involved
Participants
- Maximilian Klein, Data scientist, founder of WHGI
- Envel Le Hir, Data engineer, founder of Denelezh
- UX/UI designer (yet to be found, maybe a subcontract with Wikimedia Foundation or Wikimedia Deutschland?)
Community notification
The project was presented on several occasions:
- August 2019: Phabricator task T230184
- September 2019: lightning talk at WikiConvention francophone (slides in French, in English)
- October 2019: poster at WikidataCon (poster)
- November 2019: poster at WikiConference North America (poster)
- TODO: on relevant communities (on-wiki: Women in Red, Les sans pagEs, WikiDonne, Wikidata project chat, ..., on mailing-lists: wiki-research-l, analytics, wikidata) at milestones.