Wikidata:WikiProject Molecular biology/Properties

From Wikidata
Jump to navigation Jump to search

Home

 

Properties

 

Presentations

 

Tools

 

ShEx

 

Goals

  • This page aims to organize a consensus view of the properties that describe molecular biology concepts. Please be bold and add your suggestions below! (For example, what property should we create to capture connections between genes and the categories defined by the Gene Ontology or the Disease Ontology?)

Rules of this page:

  • Feel free to add new property for discussion in the tables. Set the "Creation level" to Proposal
  • Please use the talk page to discuss about properties creation or use. If you want to discuss about one property, create a new section on the talk page and set the "Creation level" to 'discussion and link the property in the table with the section.

Other relevant pages

Understanding properties: Properties link to particular datatypes. http://meta.wikimedia.org/wiki/Wikidata/Data_model#Datatypes_and_their_Values

See examples on the (currently much more complete) Wikidata:Chemistry task force/Properties.

Main classes and their canonical database

[edit]

Before going to the properties of the main classes, here the classes and their canonical database. This means that adding a new instance of that class should always have an ID in its canonical database. It is possible e.g. to add a protein without its UniProt identifier but the entry will not be updated by the bot when changes occur, and curators might avoid it. It also means that if you want to import new items of main classes duplicates are likely, and you are responsible for avoiding them beforehand. Creating lots of duplicates is frowned upon, as they have to be found and merged, wasting much time.

Main class Canonical database Corresponding property Possible EntitySchema
gene (Q7187) Entrez (Q1345229) Entrez Gene ID (P351) E37 (human only), E165 (virus only), E74, E252
protein (Q8054) UniProt (Q905695) UniProt protein ID (P352) E167 (general), E38 (human only), E169 (virus only)
protein superfamily (Q7251477) and protein family (Q417841), except enzymes and transporters [note 1][note 2] InterPro (Q3047275) InterPro ID (P2926) E233
group or class of enzymes (Q67015883) Gene Ontology (Q135085) [note 3] molecular function (P680) E277
group or class of transmembrane transport proteins (Q67101749) transmembrane transport protein superfamily (Q68461428) Transporter Classification database (Q142667) Transporter Classification Database ID (P7260) E278
protein domain (Q898273), structural motif (Q3273544), supersecondary structure (Q7644128), binding site (Q616005) InterPro (Q3047275) InterPro ID (P2926)
protein family associated with domain (Q81505329) InterPro (Q3047275) InterPro ID (P2926)
protein complex (Q420927) (human) Reactome (Q2134522) Reactome ID (P3937) E186, E194
biological process (Q2996394) (non-species specific) Gene Ontology (Q135085) Gene Ontology ID (P686)
biological process (Q2996394) = biological pathway (Q4915012) (human) Reactome (Q2134522) Reactome ID (P3937)
  1. family of subunits of protein complexes (Q83343207) are a subset of protein family, mostly defined by InterPro families. Again, no dedicated database exists that associates these with species-independent complex families
  2. group or class of proteins (Q84467700) should be used for everything outside the box, especially small sets, or families of non-homologous proteins
  3. EC is being replaced by GO for enzymes, as GO clearly separates between enzyme and enzymatic activity. For example EC has entries with multiple activity, and multiple entries with the same activity (different taxa), this is not a clean approach.

Also,

General properties for genes and proteins

[edit]

See the properties that the ProteinBoxBot understands.

Application of data

[edit]

Identifier Properties

[edit]

Human genes

[edit]
Title ID Data type Description Examples Inverse
Entrez Gene IDP351External identifieridentifier for a gene per the NCBI Entrez databaseCDK2 <Entrez Gene ID> 1017-
HGNC gene symbolP353External identifierthe official gene symbol approved by the HGNC, which is typically a short form of the gene nameRELN <HGNC gene symbol> RELN-
HGNC IDP354External identifiera unique ID provided by the HGNC for each gene with an approved symbol. HGNC IDs remain stable even if a name or symbol changesRELN <HGNC ID> 9957-
OMIM IDP492External identifierdisease, gene and phenotype: Online "Mendelian Inheritance in Man" catalogue codes for diseases, genes, or phenotypesHuntington's disease <OMIM ID> 143100-
Ensembl gene IDP594External identifiergene: identifier for a gene as per the Ensembl (European Bioinformatics Institute and the Wellcome Trust Sanger Institute) databaseMB <Ensembl gene ID> ENSG00000198125-
genomic startP644Stringbiological sequence: genomic starting coordinate of the biological sequence (e.g. a gene)RELN <genomic start> 103112231-
genomic endP645Stringbiological sequence: genomic ending coordinate of the biological sequence (e.g. a gene)RELN <genomic end> 103629963-
genomic assemblyP659Itemgenome assembly: specifies the genome assembly on which the feature is placedRELN <genomic assembly> genome assembly GRCh38-
HomoloGene IDP593Stringidentifier in the HomoloGene databaserhodopsin <HomoloGene ID> 68068-
RefSeq genome IDP2249External identifierID in the RefSeq Genome databaseChlamydia trachomatis D/UW-3/CX chromosome <RefSeq genome ID> NC_000117-
dbSNP Reference SNP numberP6861External identifieridentifier used in dbSNP to uniquely identify a genetic variantBRAF V600E <dbSNP Reference SNP number> rs113488022-
  • proposed:: Alias ( Other gene symbols (e.g. retired) used to name this gene). Note there are also aliases for item labels outside the property structure)

Human proteins

[edit]
Title ID Data type Description Examples Inverse
UniProt protein IDP352External identifieridentifier for a protein per the UniProt databasereelin <UniProt protein ID> P78509-
PDB structure IDP638External identifieridentifier for 3D structural data as per the PDB (Protein Data Bank) databasehydroxysteroid 11-beta dehydrogenase 1 <PDB structure ID> 4P38 and 1XU7-
EC enzyme numberP591StringEnzyme Commission number: classification scheme for enzymesTriacylglycerol lipase <EC enzyme number> 2.7.3.2-
RefSeq protein IDP637External identifieridentifier for a proteinreelin <RefSeq protein ID> NP_005036-
Ensembl protein IDP705External identifieridentifier for a protein issued by Ensembl databasereelin <Ensembl protein ID> ENSP00000392423 and ENSP00000345694-
Transporter Classification Database IDP7260External identifierclassifies transport proteins similar to how EC classifies enzymesP-type ATPase <Transporter Classification Database ID> 3.A.3-

Mouse genes

[edit]
Title ID Data type Description Examples Inverse
Mouse Genome Informatics IDP671External identifieridentifier for a gene in the Mouse Genome Informatics databasemyoglobin <Mouse Genome Informatics ID> MGI:96922-

Mouse proteins

[edit]

Unsorted

[edit]
Title ID Data type Description Examples Inverse
RefSeq RNA IDP639External identifierRNA IdentifierRELN <RefSeq RNA ID> NM_005045-
chromosomeP1057Itemchromosome: chromosome on which an entity is localizedRELN <chromosome> human chromosome 7-

Proposed Media Properties

[edit]
Title ID Data type Description Examples Inverse
chemical structureP117Commons media filechemical structure and structural formula: image of a representation of the structure for a chemical compoundmethane <chemical structure> Methan Keilstrich.svg-
Gene Atlas imageP692Commons media fileimage showing the GeneAtlas expression patternRELN <Gene Atlas image> PBB GE RELN 205923 at tn.png-

Proposed properties linking genes to other biological concepts (cell components, processes, etc.)

[edit]
Title ID Data type Description Examples Inverse
found in taxonP703Itemnatural product: the taxon in which the item can be foundRELN <found in taxon> human-
cell componentP681Itemcellular component: component of the cell in which this item is presentreelin <cell component> cytoplasm-
biological processP682Itembiological process: is involved in the biological processNeurotrophin 3 <biological process> positive regulation of MAP kinase activity-
molecular functionP680Itemmolecular function: represents gene ontology function annotationsRELN <molecular function> metal ion binding-
regulates (molecular biology)P128Itemprocess regulated by a protein or RNA in molecular biologyreelin <regulates (molecular biology)> nervous system development-
encodesP688Itemthe product of a gene (protein or RNA)RELN <encodes> reelinencoded by
encoded byP702Itemthe gene that encodes some gene productreelin <encoded by> RELNencodes

Notes:

  • 682: As in, Reelin is involved in the process of neuron migration. Use to represent gene ontology process annotations. "operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms"; see Gene Ontology. This biological process (Q2996394) property would be a predicate that links a gene or protein subject like BRCA1 (Q227339) with a specific biological process object like DNA repair (Q210538) || || A typical reference for the statement would be a link to the subject's entry on the Gene Ontology website. For the BRCA1-biological process-DNA repair example above, the reference would be http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:C6YB45.
Property Datatype Creation level Description Links Comments
Taxon Item Proposal Taxon / species from in gene/protein is encoded
contains_domain Item Proposal As in, Reelin contains the domains "Reeler domain" and "BNR/Asp-box repeat"

Proposed Properties linking genes to genes

[edit]
Title ID Data type Description Examples Inverse
physically interacts withP129Itemphysical contact: physical entity that the subject interacts withtrack chain <physically interacts with> soil-
orthologP684Itemorthology: orthologous gene in another species (use with 'species' qualifier)RELN <ortholog> GUF1-
Property Datatype Creation level Description Links Comments
Activates Item Proposal The product of this gene activates the function of the target gene
Inhibits Item Proposal The product of this gene inhibits the function of the target gene
Binds to Item Proposal The product of this gene binds to the product of the target gene
Phenotype Item Proposal See use in http://string-db.org
Catalysis Item Proposal See use in String database
Post-translationally-modifies Item Proposal See use in String database
Reaction Item Proposal See use in String database
Expression Item Proposal See use in String database


Proposed Properties linking proteins to proteins

[edit]
Property Datatype Creation level Description Links Comments
Phosphorylates substrate Item Proposal This kinase reportedly phosphorylates the target protein substrate https://en.wikipedia.org/wiki/Protein_phosphorylation As the most abundant post-translational modification, modelling this property separately is interesting

General properties for genomics

[edit]
Property Datatype Creation level Description Links Comments
Genome size (or Genome length) Number Proposal The size (or length) of the genome for a given species wikipedia:Genome_size Currently being discussed here: Wikidata:Property_proposal/Natural_science#Genome_size
Number of genes Number Proposal The number of genes for a given species
Nucleic acid type String Proposal Is it: ssDNA / dsDNA / ssRNA / dsRNA
Number of chromosomes Number Proposal The number of chromosomes in a genome
  • proposed:: Genomes assembly database identifiers. See [1]
  • proposed:: ENA Sequence identifier.

General properties for pathways

[edit]

Proposed identifier properties

[edit]
Title ID Data type Description Examples Inverse
KEGG IDP665External identifieridentifier from databases dealing with genomes, enzymatic pathways, and biological chemicalsDL-ascorbic acid <KEGG ID> D00018-
Property Datatype Creation level Description Links Comments
Wikipathways ID String Proposal WikiPathways Identifier. http://www.wikipathways.org

Drugs

[edit]

Identifiers

[edit]
Title ID Data type Description Examples Inverse
Guide to Pharmacology Ligand IDP595External identifierligand identifier of the Guide to Pharmacology databasecocaine <Guide to Pharmacology Ligand ID> 2286-
ChEMBL IDP592External identifieridentifier from a chemical database of bioactive molecules with drug-like propertiestropicamide <ChEMBL ID> CHEMBL1200604-
HomoloGene IDP593Stringidentifier in the HomoloGene databaserhodopsin <HomoloGene ID> 68068-
DrugBank IDP715External identifieridentifier in the bioinformatics and cheminformatics database from the University of Albertavitamin C <DrugBank ID> DB00126-
ChemSpider IDP661External identifieridentifier in a free chemical database, owned by the Royal Society of Chemistry(RS)-methadone <ChemSpider ID> 3953-

Interactions

[edit]
Title ID Data type Description Examples Inverse
significant drug interactionP769Itemdrug interaction: clinically significant interaction between two pharmacologically active substances (i.e., drugs and/or active metabolites) where concomitant intake can lead to altered effectiveness or adverse drug events.(RS)-warfarin <significant drug interaction> lovastatin-


WikiProject Molecular_biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Tobias1984
Doc James
Bluerasberry
Gambo7
Daniel Mietchen
Andrew Su
Andrux
Pavel Dušek
Mvolz
User:Jtuom
Chris Mungall
ChristianKl
Gstupp
Sintakso
علاء
Adert
CFCF
Jtuom
Drchriswilliams
Okkn
CAPTAIN RAJU
LeadSongDog
Ozzie10aaaa
Marsupium
Netha Hussain
Abhijeet Safai
Seppi333
Shani Evenstein
Csisc
TiagoLubiana
ZI Jony
Antoine2711
JustScienceJS
Scossin
Josegustavomartins
Zeromonk
The Anome
Kasyap
JMagalhães
Ameer Fauri
CorraleH

Notified participants of WikiProject Medicine

Notes: The following drug objects should serve as the unifying examples for drugs in WikiData. In order to include all major identifiers, several new properties will be requested shortly (e.g. WHO INN, USAN)

Taxa

[edit]

Identifiers

[edit]
Title ID Data type Description Examples Inverse
NCBI taxonomy IDP685External identifieridentifier for a taxon in the Taxonomy Database by the National Center for Biotechnology Informationhuman <NCBI taxonomy ID> 9606-

Modeling questions

[edit]

Why do we have both e.g. "peptidase" (class of enzymes) and "peptidase activity" (molecular function). Can't they be merged?

[edit]

No. An enzyme can have multiple functions, see multifunctional enzyme (Q67211934) in the query:

SELECT ?item ?label WHERE {
  { ?item wdt:P31 wd:Q67211934. } UNION { ?item wdt:P279/wdt:P279* wd:Q67211934. }
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}
Try it!

Also in principle enzymes additionally have binding function of their substrates and products, e.g. ATP binding (Q14817981).

Why is the EC (Enzyme Commission) no longer used as normative?

[edit]

We use exact/broad mapping to Gene Ontology function entities to build our enzyme hierarchy. EC was never consistent, it had both multifunctional entries and very narrow or species-specific sub-entries at the same level. Contrarily GO functions are never defined by gene product or taxon.

Why do we have both glutamine-tRNA synthetase (Q105722884) (class of enzymes) and glutamine-tRNA synthetase (Q24785187) (InterPro family)?

[edit]

Because the first is an abstract concept and the second is a specific set of proteins defined by InterPro (which also keeps changing invisibly). Practically, if a new organism is discovered, outside of known taxa, its glutaminyl-tRNA synthetase would be automatically a member of the first (open) set, but possibly not of the second set. This also means that InterPro families frequently are subgroups of those protein classes suggested in their title. (This is also the reason why function statements on InterPro families should never have exact mapping type)

Why do I get WD40 repeat (Q7948257) and WD40 repeat, protein family (Q95350717) if I search for IPR001680?

[edit]

Because we want a concept for the protein domain too to make statements about it. Unfortunately InterPro domain entries both stand for the domain and the set of proteins that have that domain (according to the computational rules they apply) so we need two items, and we want to link both to the InterPro entry.

Why are (some) Gene Ontology protein complexes (cellular components) also instances of family of protein complexes (Q78155096)?

[edit]

This is a temporary solution but not very wrong. Contrary to Complex Portal entries and Reactome protein complexes, Gene Ontology complexes are species-independent, so if the set of all complexes that are defined by such an entry is homologous then it is a family. Of course the condition is that the parts are always the same, i.e. they are from the same protein families. This can change and make sub-entries necessary later. However, at the moment this slowly growing part of Wikidata is to our knowledge the only existing database of species-independent protein complexes linking them to their parts families.

For an overview, there are currently 2,579 complexes in GO. We have annotated the parts of 35 of them, and encourage you to add those you are interested in. Query:

SELECT DISTINCT ?item ?label WHERE {
  ?item wdt:P31 wd:Q78155096.
  ?item wdt:P31 wd:Q5058355.
  ?item wdt:P2670 [].
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}
Try it!