Approved license
I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works
- Yes
No proprietary software:
I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.
- Yes
How long has your team been accepting publicly submitted contributions?
- 1 year
How many regular contributors does your team have?
- 1-2 people
Brief summary
Names are really complex. Which part is the first name? Which is the middle name? How do you define your surname? What happens if you have multiple family names? How do names work across multiple languages and cultures?
Accurately recording this information is important for scientific references that are used in Wikipedia articles and Wikidata items - but if it is wrong, then it's easy to miss-attribute publications, or miss connections between different works by the same author. It's also very difficult to get right, since this is very complex, particularly between different languages.
This project will focus on understanding what makes a name, and how it can be recorded in structured data, across many languages and conventions. The project focuses on Wikidata, which is the structured data repository linked to Wikipedia and the other Wikimedia projects. Wikidata holds records of millions of scientific publications as part of WikiCite. However, identifying individual author names and linking between their different publications is still in its early stages.
In this project, you will use currently available Bibtex author information to split author names into 'first' and 'last' names, and you will add this information to thousands of Wikidata items using Pywikibot. You will explore other approaches to identifying first and last names, potentially including machine learning, to see how reliably you can identify first/last names.
This project is mentored by Mike Peel and Andy Mabbett. Knowledge of scientific references and Python are useful, although they can be learnt during the project.
Minimum system requirements
You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.
How can applicants make a contribution to your project?
You will start by learning how scientific references in Wikidata are structured, particularly with their author names. You will then investigate how author names are described, and how to identify first and last names of the authors. You will then write code that automatically identifies first and last names, and adds them to Wikidata.
You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .
Repository
https://github.com/mpeel/wikicode/
Issue tracker
N/A
Tasks
There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, and you should do them in order. You don't have to do all of them, but it's recommended to try to do so. These tasks also form the start of the main project
- T301733 Look at existing Wikidata items for scientific articles. Document how author names have been recorded in them, and how they could be improved
- T301735 Set up pywikibot on your computer, and understand how it interacts with Wikidata
- T301737 Take a specific item (specified by Mike or Andy), and try to identify the first and last names of the authors (using bibtex/other means).
Application and timeline
The Outreachy positions are assessed solely on the contributions and the application you submit for the project; the best things you can do are to do well with the contributions, and include all relevant information in your application. Contributions are evaluated based on their completeness, coding style, understanding of the tasks, and any additional work beyond the core task. I generally look for applicants who have demonstrated that they understand the tasks and the Wikimedia community.
When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:
- You should split the timeline into periods, e.g., weekly or two-weekly, and write a short summary of what you expect to be doing in that period.
- The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
- Large runs to add sitelinks will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
- Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
- If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!
There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the Wikimedia community (e.g., past editing of Wikipedia/Wikidata), or previous python coding activities in your application, that will will be really helpful, but not essential.
Also, please bear in mind that we will only accept one intern for this project, so we strongly recommend contributing to multiple Outreachy projects (particularly those with few applicants) to increase your chances of getting an internship.
You are also encouraged to attend the Wikimedia Hackathon on 20-22 May:
Benefits
You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata and in other scientific databases.
Community benefits
Better metadata for wikicite items, being able to sort references by surname on Wikipedia
Questions?
Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)