What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Closed, ResolvedPublic
Actions

Assigned To

Authored By

	Mike_Peel
	Jan 26 2022, 10:24 PM

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Approved license

I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works

No proprietary software:

I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.

How long has your team been accepting publicly submitted contributions?

1 year

How many regular contributors does your team have?

1-2 people

Brief summary

Names are really complex. Which part is the first name? Which is the middle name? How do you define your surname? What happens if you have multiple family names? How do names work across multiple languages and cultures?

Accurately recording this information is important for scientific references that are used in Wikipedia articles and Wikidata items - but if it is wrong, then it's easy to miss-attribute publications, or miss connections between different works by the same author. It's also very difficult to get right, since this is very complex, particularly between different languages.

This project will focus on understanding what makes a name, and how it can be recorded in structured data, across many languages and conventions. The project focuses on Wikidata, which is the structured data repository linked to Wikipedia and the other Wikimedia projects. Wikidata holds records of millions of scientific publications as part of WikiCite. However, identifying individual author names and linking between their different publications is still in its early stages.

In this project, you will use currently available Bibtex author information to split author names into 'first' and 'last' names, and you will add this information to thousands of Wikidata items using Pywikibot. You will explore other approaches to identifying first and last names, potentially including machine learning, to see how reliably you can identify first/last names.

This project is mentored by Mike Peel and Andy Mabbett. Knowledge of scientific references and Python are useful, although they can be learnt during the project.

Minimum system requirements

You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.

How can applicants make a contribution to your project?

You will start by learning how scientific references in Wikidata are structured, particularly with their author names. You will then investigate how author names are described, and how to identify first and last names of the authors. You will then write code that automatically identifies first and last names, and adds them to Wikidata.

You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .

Repository

https://github.com/mpeel/wikicode/

Issue tracker

N/A

Tasks

There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, and you should do them in order. You don't have to do all of them, but it's recommended to try to do so. These tasks also form the start of the main project

T301733 Look at existing Wikidata items for scientific articles. Document how author names have been recorded in them, and how they could be improved
T301735 Set up pywikibot on your computer, and understand how it interacts with Wikidata
T301737 Take a specific item (specified by Mike or Andy), and try to identify the first and last names of the authors (using bibtex/other means).

Application and timeline

The Outreachy positions are assessed solely on the contributions and the application you submit for the project; the best things you can do are to do well with the contributions, and include all relevant information in your application. Contributions are evaluated based on their completeness, coding style, understanding of the tasks, and any additional work beyond the core task. I generally look for applicants who have demonstrated that they understand the tasks and the Wikimedia community.

When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:

You should split the timeline into periods, e.g., weekly or two-weekly, and write a short summary of what you expect to be doing in that period.
The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
Large runs to add sitelinks will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!

There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the Wikimedia community (e.g., past editing of Wikipedia/Wikidata), or previous python coding activities in your application, that will will be really helpful, but not essential.

Also, please bear in mind that we will only accept one intern for this project, so we strongly recommend contributing to multiple Outreachy projects (particularly those with few applicants) to increase your chances of getting an internship.

You are also encouraged to attend the Wikimedia Hackathon on 20-22 May:

https://www.mediawiki.org/wiki/Wikimedia_Hackathon_2022

Benefits

You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata and in other scientific databases.

Community benefits

Better metadata for wikicite items, being able to sort references by surname on Wikipedia

Questions?

Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)

Related Objects
Search...

Status	Assigned	Task
Resolved	Mike_Peel	T300207 What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Resolved	Feliciss	T309766 Summarize understanding of the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Resolved	Feliciss	T309840 Wikimania 2022 Draft for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Resolved	Feliciss	T310361 Approach to Names for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Resolved	PangolinMexico	T311301 What's in a name? - AuthorBot: Process and Progress
Resolved	PangolinMexico	T314795 Wikimania Hackathon 2022: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Resolved	Feliciss	T315660 Manual for Running ADSBot English Paper on Toolforge
Resolved	Feliciss	T316089 Instructions for continuing contributing to ADSBot English Statement

Event Timeline

Mike_Peel created this task.Jan 26 2022, 10:24 PM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptJan 26 2022, 10:24 PM

Mike_Peel updated the task description. (Show Details)Feb 14 2022, 10:04 PM

Mike_Peel mentioned this in T301733: What's in a name? Task 1.Feb 14 2022, 10:16 PM

Mike_Peel updated the task description. (Show Details)

Mike_Peel mentioned this in T301735: What's in a name? Task 2.Feb 14 2022, 10:22 PM

Mike_Peel updated the task description. (Show Details)

Mike_Peel mentioned this in T299453: Coordinate Wikimedia's participation in GSoC 2022 and Outreachy Round 24.Feb 14 2022, 10:33 PM

Mike_Peel updated the task description. (Show Details)Feb 14 2022, 10:36 PM

Mike_Peel mentioned this in T301737: What's in a name? Task 3.Feb 14 2022, 10:40 PM

Mike_Peel updated the task description. (Show Details)

Pigsonthewing awarded a token.Feb 15 2022, 7:44 PM

srishakatux moved this task from Backlog to Featured Projects on the Outreachy (Round 24) board.Feb 16 2022, 3:24 AM

@Mike_Peel Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".Feb 22 2022, 8:13 PM

In T300207#7729925, @srishakatux wrote:

@Mike_Peel Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

Sure, could you remind me where to do that please?

In T300207#7729942, @Mike_Peel wrote:

Sure, could you remind me where to do that please?

Yes, step 4 here https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program

Mike_Peel updated the task description. (Show Details)Feb 23 2022, 4:55 PM

In T300207#7729949, @srishakatux wrote:

In T300207#7729942, @Mike_Peel wrote:

Sure, could you remind me where to do that please?

Yes, step 4 here https://www.mediawiki.org/wiki/Outreachy/Mentors#_Before_the_program

Thanks - done!

@srishakatux Please could you include @Pigsonthewing as a mentor for the proposal on Outreachy (he's now registered on Outreachy under that username), and also give him view permissions for the Outreachy tickets?

In T300207#7734752, @Mike_Peel wrote:

@srishakatux Please could you include @Pigsonthewing as a mentor for the proposal on Outreachy (he's now registered on Outreachy under that username), and also give him view permissions for the Outreachy tickets?

Done done!

Mike_Peel updated the task description. (Show Details)Mar 5 2022, 8:33 PM

Mike_Peel updated the task description. (Show Details)

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 25 2022, 5:33 PM

PangolinMexico subscribed.Mar 25 2022, 5:45 PM

Jiehui_Ma subscribed.Mar 25 2022, 6:13 PM

abhigya_pandey subscribed.Mar 25 2022, 6:43 PM

Feliciss subscribed.Mar 26 2022, 5:53 AM

Aryan-kaushik21 subscribed.Mar 26 2022, 12:43 PM

Robot_Jelly subscribed.Mar 26 2022, 2:06 PM

Esther.Osayande subscribed.Mar 27 2022, 10:16 AM

Emmanuel-Wachukwu subscribed.Mar 27 2022, 12:58 PM

Siddharth628 subscribed.Mar 27 2022, 6:12 PM

Antima_Dwivedi subscribed.Mar 28 2022, 7:44 PM

Bisola7 subscribed.Apr 1 2022, 10:27 AM

Appledora subscribed.Apr 2 2022, 5:10 PM

MSGJ updated the task description. (Show Details)Apr 4 2022, 7:04 AM

Akandoria subscribed.Apr 4 2022, 1:05 PM

Arfat2396 subscribed.Apr 4 2022, 3:29 PM

Addydo subscribed.Apr 4 2022, 4:51 PM

Hello Mentors.
I found this project very interesting and I have started doing the tasks provided by you.
I have 3 years of experience in python, I have developed lot of projects in python and also did two research internship from a reputed university which involved use of python, machine learning and deep learning and I have also done internship in Microsoft. I really enjoy doing the work for opensource .
I will be contributing for this community for the next one or two years as I really appreciate and love the work done by you all.
Looking forward to work with this community
Regards

In T300207#7829839, @Tamandeep_singh wrote:

I found this project very interesting and I have started doing the tasks provided by you.

Great - please have a go at the three starter tasks, and let me know when you're ready for me to look at your work!

Hello.
Iam Amitha a third year undergraduate at IIITBangalore. I found this project quite interesting. Looking forward to get started and make contributions. I have good experience with python and Machine learning.

Greetings everyone!
I am Sonali Rastogi, a prospective Outreachy intern from India.
I am a junior undergrad student pursuing my bachelors from NIT Agartala. I am an innovator and have been developing multiple tech projects ranging from core engineering to web platforms.

I am interested in "What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata ", a very exciting project resonating with my interest in research and journal papers. I am looking forward to work with this spectacular community.

Hi @Amitha67 and @Sonali.Rastogi welcome! Please try the first microtask to get started, see T301733

I've marked this as 'Closed to new applicants' on Outreachy as there are now several people that have completed all three tasks. The project is still open for contributions, though, and will remain so until the contribution period ends. And if anyone new still wants to have a go at the tasks, that's also fine.

Hello Everyone, I'm Pinalee an Outreachy applicant from India. I want to contribute in this project. Looking forward to work with Wikimedia foundation. Thank you.

Emmanuel-Wachukwu unsubscribed.Apr 12 2022, 7:23 AM

In T300207#7847279, @PinRathod wrote:

Hello Everyone, I'm Pinalee an Outreachy applicant from India. I want to contribute in this project. Looking forward to work with Wikimedia foundation. Thank you.

Hi Pinalee, please try task 1: T301733

Aklapper mentioned this in T309766: Summarize understanding of the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata.Jun 2 2022, 9:55 AM

Aklapper added a subtask: T309766: Summarize understanding of the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata.

Feliciss closed subtask T309766: Summarize understanding of the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata as Resolved.Jun 2 2022, 2:30 PM

Feliciss added a subtask: T309840: Wikimania 2022 Draft for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata.Jun 3 2022, 9:52 AM

Feliciss closed subtask T309840: Wikimania 2022 Draft for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata as Resolved.Jun 6 2022, 1:31 PM

Feliciss added a subtask: T310361: Approach to Names for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata .Jun 10 2022, 2:15 PM

Here is a blog post about the project: https://colonelsheep.github.io/outreachy24-blog/week-5 . It's a nice summary of what is spread out over this Phabricator ticket. While it does not have pointers to more detailed information, this wiki page does.

Feliciss closed subtask T310361: Approach to Names for the project: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata as Resolved.Aug 19 2022, 10:30 AM

Feliciss added a subtask: T315660: Manual for Running ADSBot English Paper on Toolforge.Aug 19 2022, 11:28 AM

Feliciss added a subtask: T316089: Instructions for continuing contributing to ADSBot English Statement.Aug 24 2022, 9:44 AM

Aklapper closed subtask T314795: Wikimania Hackathon 2022: What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata as Resolved.Aug 29 2022, 6:58 AM

To mentors monitoring this task - could you ensure all relevant project updates get added to https://www.mediawiki.org/wiki/Outreachy/Past_projects? If there isn't anything remaining to be resolved, please close this Phabricator task. Move any pending items to a separate task.

Mike_Peel closed subtask T311301: What's in a name? - AuthorBot: Process and Progress as Resolved.Sep 24 2022, 4:59 PM

Mike_Peel closed subtask T315660: Manual for Running ADSBot English Paper on Toolforge as Resolved.

Mike_Peel closed this task as Resolved.Sep 24 2022, 5:04 PM

Mike_Peel claimed this task.

Mike_Peel closed subtask T316089: Instructions for continuing contributing to ADSBot English Statement as Resolved.Oct 2 2022, 7:36 PM

What's in a name? Automatically identifying first and last author names for Wikicite and WikidataClosed, ResolvedPublicActions

Description

Approved license

No proprietary software:

How long has your team been accepting publicly submitted contributions?

How many regular contributors does your team have?

Brief summary

Minimum system requirements

How can applicants make a contribution to your project?

Repository

Issue tracker

Tasks

Application and timeline

Benefits

Community benefits

Questions?

Related ObjectsSearch...

Event Timeline

What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Closed, ResolvedPublic
Actions

Related Objects
Search...