Property talk:P594

From Wikidata
Jump to navigation Jump to search

Documentation

Ensembl gene ID
identifier for a gene as per the Ensembl (European Bioinformatics Institute and the Wellcome Trust Sanger Institute) database
Representsgene (Q7187)
Associated itemEnsembl genome database project (Q1344256)
Applicable "stated in" valueEnsembl genome database project (Q1344256)
Data typeExternal identifier
Template parameteren:Template:GNF_Protein_box = Hs_Ensembl (humans) Mm_Ensembl (mouse)
Domain
According to this template: gene
According to statements in the property:
gene (Q7187)
When possible, data should only be stored as statements
Allowed values(ENS(|MUS|RNO)G\d{11})|(Y[A-P][LR]\d{3}[CW](-[A-G])?) (string of species-specific uppercase prefix, followed by 11 digits or a ORF name)
ExampleMB (Q14819296)ENSG00000198125 (RDF)
Mb (Q14819298)ENSMUSG00000018893 (RDF)
FN1 (Q14819473)ENSG00000115414 (RDF)
Fn1 (Q14819475)ENSMUSG00000026193 (RDF)
Fn1 (Q24394377)ENSRNOG00000014288 (RDF)
Sourcehttps://www.ensembl.org
Formatter URLhttps://identifiers.org/ensembl/$1
Tracking: usageCategory:Pages using Wikidata property P594 (Q26250016)
See alsoEnsembl protein ID (P705), Ensembl transcript ID (P704)
Lists
Proposal discussionProposal discussion
Current uses
Total363,321
Main statement156,41243.1% of uses
Qualifier5<0.1% of uses
Reference206,90456.9% of uses
Search for values
[create Create a translatable help page (preferably in English) for this property to be included here]
Single value: this property generally contains a single value. (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P594#Single value, SPARQL
Format “(ENS(|MUS|RNO|DAR)G\d{11})|(Y[A-P][LR]\d{3}[CW](-[A-G])?)|(WBGene\d{8})|(FBgn\d{7})|(Q\d{4}): value must be formatted using this pattern (PCRE syntax). (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P594#Format, SPARQL
Type “gene (Q7187): item must contain property “instance of (P31), subclass of (P279)” with classes “gene (Q7187)” or their subclasses (defined using subclass of (P279)). (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P594#Type Q7187, SPARQL
Distinct values: this property likely contains a value that is different from all other items. (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P594#Unique value, SPARQL (every item), SPARQL (by value)
Scope is as main value (Q54828448), as reference (Q54828450): the property must be used by specified way only (Help)
List of violations of this constraint: Database reports/Constraint violations/P594#Scope, hourly updated report, SPARQL
Allowed entity types are Wikibase item (Q29934200): the property may only be used on a certain entity type (Help)
Exceptions are possible as rare values may exist. Exceptions can be specified using exception to constraint (P2303).
List of violations of this constraint: Database reports/Constraint violations/P594#Entity types
This property is being used by:

Please notify projects that use this property before big changes (renaming, deletion, merge with another property, etc.)

Maybe need to be removed:

Not quite unique but close

[edit]

It happens that in some instances the same Ensembl id can point to what may be in reality and what other databases such as Entrez consider the same thing. See for example Q18048211 and Q18033903 which both might/can/should link to ENSG00000003096 but do appear to be different entities. This can result in cases where there are multiple records with the same ensembl id. In the vast majority of cases these ids are in fact unique. For more information on this please see the thread: https://www.biostars.org/p/16505/#16604


Added format ENSRNOG

[edit]

There are about 19000 format violations due to Ids starting with "ENSRNOG". Assuming these are correct, I am adding the RNO case to the format constraint. -- LaddΩ chat ;) 13:57, 13 June 2016 (UTC)[reply]

How to deal with "legitimate" constraint violations

[edit]

There are currently a relatively high number of constraint violations reported. Most, if not all are due to the ProteinBoxBot which maintains genes on Wikidata.

The issue is that the ProteinBoxBot, which maintains gene annotations on Wikidata, uses NCBI gene as key. The bot's working is strait forward. On every bot run all NCBI gene annotations are extracted and updates in Wikidata, together with all known external mappings. The Ensembl Gene ID (P594) is part of this set of mappings. Although efforts are in place to harmonize between Ensembl and the NCBI, there stil are cases where one ensembl ID maps to multiple NCBI Gene IDs. I did checks on the constraint violations being reported and they appeared to be "legitimate". There are even cases where 1 Ensembl ID, maps to 14 and 19 NCBI gene IDs. Counts: (2: 486, 3: 48, 4: 12, 5: 3, 6: 3, 7: 1, 8: 1, 10: 1, 14: 1, 19: 1} This is a known issue. The question is how to proceed here. Could we remove the constraint violation on this property? Or do we allow the number of constraint violations reach close to 1000. When more species will get coverage in Wikidata the number of this type of constraint violations will likely increase. --Andrawaag (talk) 22:20, 4 July 2016 (UTC)[reply]