Wikidata:WikiProject Molecular biology/Properties

Home

Properties

Presentations

Tools

ShEx

Goals

This page aims to organize a consensus view of the properties that describe molecular biology concepts. Please be bold and add your suggestions below! (For example, what property should we create to capture connections between genes and the categories defined by the Gene Ontology or the Disease Ontology?)

Rules of this page:

Feel free to add new property for discussion in the tables. Set the "Creation level" to Proposal
Please use the talk page to discuss about properties creation or use. If you want to discuss about one property, create a new section on the talk page and set the "Creation level" to 'discussion and link the property in the table with the section.

Other relevant pages

The general-purpose property discussions are happening at Wikidata:Property_proposal.

Property proposals section for molecular biology

The list of created properties is at Wikidata:List_of_properties.
Wikidata API documentation
API Sandbox (test your queries)

Understanding properties: Properties link to particular datatypes. http://meta.wikimedia.org/wiki/Wikidata/Data_model#Datatypes_and_their_Values

See examples on the (currently much more complete) Wikidata:Chemistry task force/Properties.

Main classes and their canonical database

Before going to the properties of the main classes, here the classes and their canonical database. This means that adding a new instance of that class should always have an ID in its canonical database. It is possible e.g. to add a protein without its UniProt identifier but the entry will not be updated by the bot when changes occur, and curators might avoid it. It also means that if you want to import new items of main classes duplicates are likely, and you are responsible for avoiding them beforehand. Creating lots of duplicates is frowned upon, as they have to be found and merged, wasting much time.

Main class	Canonical database	Corresponding property	Possible EntitySchema
gene (Q7187)	Entrez (Q1345229)	Entrez Gene ID (P351)	E37 (human only), E165 (virus only), E74, E252
protein (Q8054)	UniProt (Q905695)	UniProt protein ID (P352)	E167 (general), E38 (human only), E169 (virus only)
protein superfamily (Q7251477) and protein family (Q417841), except enzymes and transporters ^{[note 1]}^{[note 2]}	InterPro (Q3047275)	InterPro ID (P2926)	E233
group or class of enzymes (Q67015883)	Gene Ontology (Q135085) ^{[note 3]}	molecular function (P680)	E277
group or class of transmembrane transport proteins (Q67101749) transmembrane transport protein superfamily (Q68461428)	Transporter Classification database (Q142667)	Transporter Classification Database ID (P7260)	E278
protein domain (Q898273), structural motif (Q3273544), supersecondary structure (Q7644128), binding site (Q616005)	InterPro (Q3047275)	InterPro ID (P2926)
protein family associated with domain (Q81505329)	InterPro (Q3047275)	InterPro ID (P2926)
protein complex (Q420927) (human)	Reactome (Q2134522)	Reactome ID (P3937)	E186, E194
biological process (Q2996394) (non-species specific)	Gene Ontology (Q135085)	Gene Ontology ID (P686)
biological process (Q2996394) = biological pathway (Q4915012) (human)	Reactome (Q2134522)	Reactome ID (P3937)

↑ family of subunits of protein complexes (Q83343207) are a subset of protein family, mostly defined by InterPro families. Again, no dedicated database exists that associates these with species-independent complex families
↑ group or class of proteins (Q84467700) should be used for everything outside the box, especially small sets, or families of non-homologous proteins
↑ EC is being replaced by GO for enzymes, as GO clearly separates between enzyme and enzymatic activity. For example EC has entries with multiple activity, and multiple entries with the same activity (different taxa), this is not a clean approach.

Also,

family of protein complexes (Q78155096) are a subset of Gene Ontology's cellular component (Q5058355), they already have a lot of species-independent complex families. However, at the moment there is no dedicated database.

General properties for genes and proteins

See the properties that the ProteinBoxBot understands.

Application of data

The 10,500+ articles that use en:Template:GNF_Protein_box

Identifier Properties

Human genes

Title	ID	Data type	Description	Examples	Inverse
Entrez Gene ID	P351	External identifier	identifier for a gene per the NCBI Entrez database	CDK2 <Entrez Gene ID> 1017	-
HGNC gene symbol	P353	External identifier	the official gene symbol approved by the HGNC, which is typically a short form of the gene name	RELN <HGNC gene symbol> RELN	-
HGNC ID	P354	External identifier	a unique ID provided by the HGNC for each gene with an approved symbol. HGNC IDs remain stable even if a name or symbol changes	RELN <HGNC ID> 9957	-
OMIM ID	P492	External identifier	disease, gene and phenotype: Online "Mendelian Inheritance in Man" catalogue codes for diseases, genes, or phenotypes	Huntington's disease <OMIM ID> 143100	-
Ensembl gene ID	P594	External identifier	gene: identifier for a gene as per the Ensembl (European Bioinformatics Institute and the Wellcome Trust Sanger Institute) database	MB <Ensembl gene ID> ENSG00000198125	-
genomic start	P644	String	biological sequence: genomic starting coordinate of the biological sequence (e.g. a gene)	RELN <genomic start> 103112231	-
genomic end	P645	String	biological sequence: genomic ending coordinate of the biological sequence (e.g. a gene)	RELN <genomic end> 103629963	-
genomic assembly	P659	Item	genome assembly: specifies the genome assembly on which the feature is placed	RELN <genomic assembly> genome assembly GRCh38	-
HomoloGene ID	P593	String	identifier in the HomoloGene database	rhodopsin <HomoloGene ID> 68068	-
RefSeq genome ID	P2249	External identifier	ID in the RefSeq Genome database	Chlamydia trachomatis D/UW-3/CX chromosome <RefSeq genome ID> NC_000117	-
dbSNP Reference SNP number	P6861	External identifier	identifier used in dbSNP to uniquely identify a genetic variant	BRAF V600E <dbSNP Reference SNP number> rs113488022	-

proposed:: Alias ( Other gene symbols (e.g. retired) used to name this gene). Note there are also aliases for item labels outside the property structure)

Human proteins

Title	ID	Data type	Description	Examples	Inverse
UniProt protein ID	P352	External identifier	identifier for a protein per the UniProt database	reelin <UniProt protein ID> P78509	-
PDB structure ID	P638	External identifier	identifier for 3D structural data as per the PDB (Protein Data Bank) database	hydroxysteroid 11-beta dehydrogenase 1 <PDB structure ID> 4P38 and 1XU7	-
EC enzyme number	P591	String	Enzyme Commission number: classification scheme for enzymes	Triacylglycerol lipase <EC enzyme number> 2.7.3.2	-
RefSeq protein ID	P637	External identifier	identifier for a protein	reelin <RefSeq protein ID> NP_005036	-
Ensembl protein ID	P705	External identifier	identifier for a protein issued by Ensembl database	reelin <Ensembl protein ID> ENSP00000392423 and ENSP00000345694	-
Transporter Classification Database ID	P7260	External identifier	classifies transport proteins similar to how EC classifies enzymes	P-type ATPase <Transporter Classification Database ID> 3.A.3	-

Mouse genes

Title	ID	Data type	Description	Examples	Inverse
Mouse Genome Informatics ID	P671	External identifier	identifier for a gene in the Mouse Genome Informatics database	myoglobin <Mouse Genome Informatics ID> MGI:96922	-

Mouse proteins

Unsorted

Title	ID	Data type	Description	Examples	Inverse
RefSeq RNA ID	P639	External identifier	RNA Identifier	RELN <RefSeq RNA ID> NM_005045	-
chromosome	P1057	Item	chromosome: chromosome on which an entity is localized	RELN <chromosome> human chromosome 7	-

Proposed Media Properties

Title	ID	Data type	Description	Examples	Inverse
chemical structure	P117	Commons media file	chemical structure and structural formula: image of a representation of the structure for a chemical compound	methane <chemical structure> Methan Keilstrich.svg	-
Gene Atlas image	P692	Commons media file	image showing the GeneAtlas expression pattern	RELN <Gene Atlas image> PBB GE RELN 205923 at tn.png	-

Proposed properties linking genes to other biological concepts (cell components, processes, etc.)

Title	ID	Data type	Description	Examples	Inverse
found in taxon	P703	Item	natural product: the taxon in which the item can be found	RELN <found in taxon> human	-
cell component	P681	Item	cellular component: component of the cell in which this item is present	reelin <cell component> cytoplasm	-
biological process	P682	Item	biological process: is involved in the biological process	Neurotrophin 3 <biological process> positive regulation of MAP kinase activity	-
molecular function	P680	Item	molecular function: represents gene ontology function annotations	RELN <molecular function> metal ion binding	-
regulates (molecular biology)	P128	Item	process regulated by a protein or RNA in molecular biology	reelin <regulates (molecular biology)> nervous system development	-
encodes	P688	Item	the product of a gene (protein or RNA)	RELN <encodes> reelin	encoded by
encoded by	P702	Item	the gene that encodes some gene product	reelin <encoded by> RELN	encodes

Notes:

682: As in, Reelin is involved in the process of neuron migration. Use to represent gene ontology process annotations. "operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms"; see Gene Ontology. This biological process (Q2996394) property would be a predicate that links a gene or protein subject like BRCA1 (Q227339) with a specific biological process object like DNA repair (Q210538) || || A typical reference for the statement would be a link to the subject's entry on the Gene Ontology website. For the BRCA1-biological process-DNA repair example above, the reference would be http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:C6YB45.

Property	Datatype	Creation level	Description	Links	Comments
Taxon	Item	Proposal	Taxon / species from in gene/protein is encoded
contains_domain	Item	Proposal	As in, Reelin contains the domains "Reeler domain" and "BNR/Asp-box repeat"

Proposed Properties linking genes to genes

Title	ID	Data type	Description	Examples	Inverse
physically interacts with	P129	Item	physical contact: physical entity that the subject interacts with	track chain <physically interacts with> soil	-
ortholog	P684	Item	orthology: orthologous gene in another species (use with 'species' qualifier)	RELN <ortholog> GUF1	-

Property	Datatype	Creation level	Description
Activates	Item	Proposal	The product of this gene activates the function of the target gene
Inhibits	Item	Proposal	The product of this gene inhibits the function of the target gene
Binds to	Item	Proposal	The product of this gene binds to the product of the target gene
Phenotype	Item	Proposal	See use in http://string-db.org
Catalysis	Item	Proposal	See use in String database
Post-translationally-modifies	Item	Proposal	See use in String database
Reaction	Item	Proposal	See use in String database
Expression	Item	Proposal	See use in String database

Proposed Properties linking proteins to proteins

Property	Datatype	Creation level	Description	Links	Comments
Phosphorylates substrate	Item	Proposal	This kinase reportedly phosphorylates the target protein substrate	https://en.wikipedia.org/wiki/Protein_phosphorylation	As the most abundant post-translational modification, modelling this property separately is interesting

General properties for genomics

Property	Datatype	Creation level	Description	Links	Comments
Genome size (or Genome length)	Number	Proposal	The size (or length) of the genome for a given species	wikipedia:Genome_size	Currently being discussed here: Wikidata:Property_proposal/Natural_science#Genome_size
Number of genes	Number	Proposal	The number of genes for a given species
Nucleic acid type	String	Proposal	Is it: ssDNA / dsDNA / ssRNA / dsRNA
Number of chromosomes	Number	Proposal	The number of chromosomes in a genome

proposed:: Genomes assembly database identifiers. See [1]
proposed:: ENA Sequence identifier.

General properties for pathways

Proposed identifier properties

Title	ID	Data type	Description	Examples	Inverse
KEGG ID	P665	External identifier	identifier from databases dealing with genomes, enzymatic pathways, and biological chemicals	DL-ascorbic acid <KEGG ID> D00018	-

Property	Datatype	Creation level	Description	Links	Comments
Wikipathways ID	String	Proposal	WikiPathways Identifier.	http://www.wikipathways.org

Drugs

Identifiers

Title	ID	Data type	Description	Examples	Inverse
Guide to Pharmacology Ligand ID	P595	External identifier	ligand identifier of the Guide to Pharmacology database	cocaine <Guide to Pharmacology Ligand ID> 2286	-
ChEMBL ID	P592	External identifier	identifier from a chemical database of bioactive molecules with drug-like properties	tropicamide <ChEMBL ID> CHEMBL1200604	-
HomoloGene ID	P593	String	identifier in the HomoloGene database	rhodopsin <HomoloGene ID> 68068	-
DrugBank ID	P715	External identifier	identifier in the bioinformatics and cheminformatics database from the University of Alberta	vitamin C <DrugBank ID> DB00126	-
ChemSpider ID	P661	External identifier	identifier in a free chemical database, owned by the Royal Society of Chemistry	(RS)-methadone <ChemSpider ID> 3953	-

Interactions

Title	ID	Data type	Description	Examples	Inverse
significant drug interaction	P769	Item	drug interaction: clinically significant interaction between two pharmacologically active substances (i.e., drugs and/or active metabolites) where concomitant intake can lead to altered effectiveness or adverse drug events.	(RS)-warfarin <significant drug interaction> lovastatin	-

WikiProject Molecular_biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.

Notified participants of WikiProject Medicine

Notes: The following drug objects should serve as the unifying examples for drugs in WikiData. In order to include all major identifiers, several new properties will be requested shortly (e.g. WHO INN, USAN)

Taxa

Identifiers

Title	ID	Data type	Description	Examples	Inverse
NCBI taxonomy ID	P685	External identifier	identifier for a taxon in the Taxonomy Database by the National Center for Biotechnology Information	human <NCBI taxonomy ID> 9606	-

Modeling questions

Why do we have both e.g. "peptidase" (class of enzymes) and "peptidase activity" (molecular function). Can't they be merged?

No. An enzyme can have multiple functions, see multifunctional enzyme (Q67211934) in the query:

SELECT ?item ?label WHERE {
  { ?item wdt:P31 wd:Q67211934. } UNION { ?item wdt:P279/wdt:P279* wd:Q67211934. }
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}

Try it!

Also in principle enzymes additionally have binding function of their substrates and products, e.g. ATP binding (Q14817981).

Why is the EC (Enzyme Commission) no longer used as normative?

We use exact/broad mapping to Gene Ontology function entities to build our enzyme hierarchy. EC was never consistent, it had both multifunctional entries and very narrow or species-specific sub-entries at the same level. Contrarily GO functions are never defined by gene product or taxon.

Why do we have both glutamine-tRNA synthetase (Q105722884) (class of enzymes) and glutamine-tRNA synthetase (Q24785187) (InterPro family)?

Because the first is an abstract concept and the second is a specific set of proteins defined by InterPro (which also keeps changing invisibly). Practically, if a new organism is discovered, outside of known taxa, its glutaminyl-tRNA synthetase would be automatically a member of the first (open) set, but possibly not of the second set. This also means that InterPro families frequently are subgroups of those protein classes suggested in their title. (This is also the reason why function statements on InterPro families should never have exact mapping type)

Why do I get WD40 repeat (Q7948257) and WD40 repeat, protein family (Q95350717) if I search for IPR001680?

Because we want a concept for the protein domain too to make statements about it. Unfortunately InterPro domain entries both stand for the domain and the set of proteins that have that domain (according to the computational rules they apply) so we need two items, and we want to link both to the InterPro entry.

Why are (some) Gene Ontology protein complexes (cellular components) also instances of family of protein complexes (Q78155096)?

This is a temporary solution but not very wrong. Contrary to Complex Portal entries and Reactome protein complexes, Gene Ontology complexes are species-independent, so if the set of all complexes that are defined by such an entry is homologous then it is a family. Of course the condition is that the parts are always the same, i.e. they are from the same protein families. This can change and make sub-entries necessary later. However, at the moment this slowly growing part of Wikidata is to our knowledge the only existing database of species-independent protein complexes linking them to their parts families.

For an overview, there are currently 2,579 complexes in GO. We have annotated the parts of 35 of them, and encourage you to add those you are interested in. Query:

SELECT DISTINCT ?item ?label WHERE {
  ?item wdt:P31 wd:Q78155096.
  ?item wdt:P31 wd:Q5058355.
  ?item wdt:P2670 [].
  ?item rdfs:label ?label.
  FILTER(lang(?label) = 'en')
}

Try it!

[1] y of subunits of protein complexes (Q83343207) are a subset of protein family, mostly defined by InterPro families. Again, no dedicated database exists that associates these with species-independent complex families

[2] roup or class of proteins (Q84467700) should be used for everything outside the box, especially small sets, or families of non-homologous proteins

[3] EC is being replaced by GO for enzymes, as GO clearly separates between enzyme and enzymatic activity. For example EC has entries with multiple activity, and multiple entries with the same activity (different taxa), this is not a clean approach.

[note 1]

[note 2]

[note 3]

Wikidata:WikiProject Molecular biology/Properties

Contents

Main classes and their canonical database

General properties for genes and proteins

Application of data

Identifier Properties

Human genes

Human proteins

Mouse genes

Mouse proteins

Unsorted

Proposed Media Properties

Proposed properties linking genes to other biological concepts (cell components, processes, etc.)

Proposed Properties linking genes to genes

Proposed Properties linking proteins to proteins

General properties for genomics

General properties for pathways

Proposed identifier properties

Drugs

Identifiers

Interactions

Taxa

Identifiers

Modeling questions

Why do we have both e.g. "peptidase" (class of enzymes) and "peptidase activity" (molecular function). Can't they be merged?

Why is the EC (Enzyme Commission) no longer used as normative?

Why do we have both glutamine-tRNA synthetase (Q105722884) (class of enzymes) and glutamine-tRNA synthetase (Q24785187) (InterPro family)?

Why do I get WD40 repeat (Q7948257) and WD40 repeat, protein family (Q95350717) if I search for IPR001680?

Why are (some) Gene Ontology protein complexes (cellular components) also instances of family of protein complexes (Q78155096)?

Navigation menu

Wikidata:WikiProject Molecular biology/Properties

Main classes and their canonical database

General properties for genes and proteins

Application of data

Identifier Properties

Human genes

Human proteins

Mouse genes

Mouse proteins

Unsorted

Proposed Media Properties

Proposed properties linking genes to other biological concepts (cell components, processes, etc.)

Proposed Properties linking genes to genes

Proposed Properties linking proteins to proteins

General properties for genomics

General properties for pathways

Proposed identifier properties

Drugs

Identifiers

Interactions

Taxa

Identifiers

Modeling questions

Why do we have both e.g. "peptidase" (class of enzymes) and "peptidase activity" (molecular function). Can't they be merged?

Why is the EC (Enzyme Commission) no longer used as normative?

Why do we have both glutamine-tRNA synthetase (Q105722884) (class of enzymes) and glutamine-tRNA synthetase (Q24785187) (InterPro family)?

Why do I get WD40 repeat (Q7948257) and WD40 repeat, protein family (Q95350717) if I search for IPR001680?

Why are (some) Gene Ontology protein complexes (cellular components) also instances of family of protein complexes (Q78155096)?

Navigation menu

Search