Wikidata:WikiProject Molecular biology/Properties
Goals
- This page aims to organize a consensus view of the properties that describe molecular biology concepts. Please be bold and add your suggestions below! (For example, what property should we create to capture connections between genes and the categories defined by the Gene Ontology or the Disease Ontology?)
Rules of this page:
- Feel free to add new property for discussion in the tables. Set the "Creation level" to Proposal
- Please use the talk page to discuss about properties creation or use. If you want to discuss about one property, create a new section on the talk page and set the "Creation level" to 'discussion and link the property in the table with the section.
Other relevant pages
- The general-purpose property discussions are happening at Wikidata:Property_proposal.
- The list of created properties is at Wikidata:List_of_properties.
- Wikidata API documentation
- API Sandbox (test your queries)
Understanding properties: Properties link to particular datatypes. http://meta.wikimedia.org/wiki/Wikidata/Data_model#Datatypes_and_their_Values
See examples on the (currently much more complete) Wikidata:Chemistry task force/Properties.
Main classes and their canonical database
[edit]Before going to the properties of the main classes, here the classes and their canonical database. This means that adding a new instance of that class should always have an ID in its canonical database. It is possible e.g. to add a protein without its UniProt identifier but the entry will not be updated by the bot when changes occur, and curators might avoid it. It also means that if you want to import new items of main classes duplicates are likely, and you are responsible for avoiding them beforehand. Creating lots of duplicates is frowned upon, as they have to be found and merged, wasting much time.
- ↑ family of subunits of protein complexes (Q83343207) are a subset of protein family, mostly defined by InterPro families. Again, no dedicated database exists that associates these with species-independent complex families
- ↑ group or class of proteins (Q84467700) should be used for everything outside the box, especially small sets, or families of non-homologous proteins
- ↑ EC is being replaced by GO for enzymes, as GO clearly separates between enzyme and enzymatic activity. For example EC has entries with multiple activity, and multiple entries with the same activity (different taxa), this is not a clean approach.
Also,
- family of protein complexes (Q78155096) are a subset of Gene Ontology's cellular component (Q5058355), they already have a lot of species-independent complex families. However, at the moment there is no dedicated database.
General properties for genes and proteins
[edit]See the properties that the ProteinBoxBot understands.
Application of data
[edit]- The 10,500+ articles that use en:Template:GNF_Protein_box
Identifier Properties
[edit]Human genes
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
Entrez Gene ID | P351 | External identifier | identifier for a gene per the NCBI Entrez database | CDK2 <Entrez Gene ID> 1017 | - |
HGNC gene symbol | P353 | External identifier | the official gene symbol approved by the HGNC, which is typically a short form of the gene name | RELN <HGNC gene symbol> RELN | - |
HGNC ID | P354 | External identifier | a unique ID provided by the HGNC for each gene with an approved symbol. HGNC IDs remain stable even if a name or symbol changes | RELN <HGNC ID> 9957 | - |
OMIM ID | P492 | External identifier | disease, gene and phenotype: Online "Mendelian Inheritance in Man" catalogue codes for diseases, genes, or phenotypes | Huntington's disease <OMIM ID> 143100 | - |
Ensembl gene ID | P594 | External identifier | gene: identifier for a gene as per the Ensembl (European Bioinformatics Institute and the Wellcome Trust Sanger Institute) database | MB <Ensembl gene ID> ENSG00000198125 | - |
genomic start | P644 | String | biological sequence: genomic starting coordinate of the biological sequence (e.g. a gene) | RELN <genomic start> 103112231 | - |
genomic end | P645 | String | biological sequence: genomic ending coordinate of the biological sequence (e.g. a gene) | RELN <genomic end> 103629963 | - |
genomic assembly | P659 | Item | genome assembly: specifies the genome assembly on which the feature is placed | RELN <genomic assembly> genome assembly GRCh38 | - |
HomoloGene ID | P593 | String | identifier in the HomoloGene database | rhodopsin <HomoloGene ID> 68068 | - |
RefSeq genome ID | P2249 | External identifier | ID in the RefSeq Genome database | Chlamydia trachomatis D/UW-3/CX chromosome <RefSeq genome ID> NC_000117 | - |
dbSNP Reference SNP number | P6861 | External identifier | identifier used in dbSNP to uniquely identify a genetic variant | BRAF V600E <dbSNP Reference SNP number> rs113488022 | - |
- proposed:: Alias ( Other gene symbols (e.g. retired) used to name this gene). Note there are also aliases for item labels outside the property structure)
Human proteins
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
UniProt protein ID | P352 | External identifier | identifier for a protein per the UniProt database | reelin <UniProt protein ID> P78509 | - |
PDB structure ID | P638 | External identifier | identifier for 3D structural data as per the PDB (Protein Data Bank) database | hydroxysteroid 11-beta dehydrogenase 1 <PDB structure ID> 4P38 and 1XU7 | - |
EC enzyme number | P591 | String | Enzyme Commission number: classification scheme for enzymes | Triacylglycerol lipase <EC enzyme number> 2.7.3.2 | - |
RefSeq protein ID | P637 | External identifier | identifier for a protein | reelin <RefSeq protein ID> NP_005036 | - |
Ensembl protein ID | P705 | External identifier | identifier for a protein issued by Ensembl database | reelin <Ensembl protein ID> ENSP00000392423 and ENSP00000345694 | - |
Transporter Classification Database ID | P7260 | External identifier | classifies transport proteins similar to how EC classifies enzymes | P-type ATPase <Transporter Classification Database ID> 3.A.3 | - |
Mouse genes
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
Mouse Genome Informatics ID | P671 | External identifier | identifier for a gene in the Mouse Genome Informatics database | myoglobin <Mouse Genome Informatics ID> MGI:96922 | - |
Mouse proteins
[edit]Unsorted
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
RefSeq RNA ID | P639 | External identifier | RNA Identifier | RELN <RefSeq RNA ID> NM_005045 | - |
chromosome | P1057 | Item | chromosome: chromosome on which an entity is localized | RELN <chromosome> human chromosome 7 | - |
Proposed Media Properties
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
chemical structure | P117 | Commons media file | chemical structure and structural formula: image of a representation of the structure for a chemical compound | methane <chemical structure> Methan Keilstrich.svg | - |
Gene Atlas image | P692 | Commons media file | image showing the GeneAtlas expression pattern | RELN <Gene Atlas image> PBB GE RELN 205923 at tn.png | - |
Proposed properties linking genes to other biological concepts (cell components, processes, etc.)
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
found in taxon | P703 | Item | natural product: the taxon in which the item can be found | RELN <found in taxon> human | - |
cell component | P681 | Item | cellular component: component of the cell in which this item is present | reelin <cell component> cytoplasm | - |
biological process | P682 | Item | biological process: is involved in the biological process | Neurotrophin 3 <biological process> positive regulation of MAP kinase activity | - |
molecular function | P680 | Item | molecular function: represents gene ontology function annotations | RELN <molecular function> metal ion binding | - |
regulates (molecular biology) | P128 | Item | process regulated by a protein or RNA in molecular biology | reelin <regulates (molecular biology)> nervous system development | - |
encodes | P688 | Item | the product of a gene (protein or RNA) | RELN <encodes> reelin | encoded by |
encoded by | P702 | Item | the gene that encodes some gene product | reelin <encoded by> RELN | encodes |
Notes:
- 682: As in, Reelin is involved in the process of neuron migration. Use to represent gene ontology process annotations. "operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms"; see Gene Ontology. This biological process (Q2996394) property would be a predicate that links a gene or protein subject like BRCA1 (Q227339) with a specific biological process object like DNA repair (Q210538) || || A typical reference for the statement would be a link to the subject's entry on the Gene Ontology website. For the BRCA1-biological process-DNA repair example above, the reference would be http://amigo.geneontology.org/cgi-bin/amigo/gp-assoc.cgi?gp=UniProtKB:C6YB45.
Property | Datatype | Creation level | Description | Links | Comments |
---|---|---|---|---|---|
Taxon | Item | Proposal | Taxon / species from in gene/protein is encoded | ||
contains_domain | Item | Proposal | As in, Reelin contains the domains "Reeler domain" and "BNR/Asp-box repeat" |
Proposed Properties linking genes to genes
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
physically interacts with | P129 | Item | physical contact: physical entity that the subject interacts with | track chain <physically interacts with> soil | - |
ortholog | P684 | Item | orthology: orthologous gene in another species (use with 'species' qualifier) | RELN <ortholog> GUF1 | - |
Property | Datatype | Creation level | Description | Links | Comments |
---|---|---|---|---|---|
Activates | Item | Proposal | The product of this gene activates the function of the target gene | ||
Inhibits | Item | Proposal | The product of this gene inhibits the function of the target gene | ||
Binds to | Item | Proposal | The product of this gene binds to the product of the target gene | ||
Phenotype | Item | Proposal | See use in http://string-db.org | ||
Catalysis | Item | Proposal | See use in String database | ||
Post-translationally-modifies | Item | Proposal | See use in String database | ||
Reaction | Item | Proposal | See use in String database | ||
Expression | Item | Proposal | See use in String database |
Proposed Properties linking proteins to proteins
[edit]Property | Datatype | Creation level | Description | Links | Comments |
---|---|---|---|---|---|
Phosphorylates substrate | Item | Proposal | This kinase reportedly phosphorylates the target protein substrate | https://en.wikipedia.org/wiki/Protein_phosphorylation | As the most abundant post-translational modification, modelling this property separately is interesting |
General properties for genomics
[edit]Property | Datatype | Creation level | Description | Links | Comments |
---|---|---|---|---|---|
Genome size (or Genome length) | Number | Proposal | The size (or length) of the genome for a given species | wikipedia:Genome_size | Currently being discussed here: Wikidata:Property_proposal/Natural_science#Genome_size |
Number of genes | Number | Proposal | The number of genes for a given species | ||
Nucleic acid type | String | Proposal | Is it: ssDNA / dsDNA / ssRNA / dsRNA | ||
Number of chromosomes | Number | Proposal | The number of chromosomes in a genome |
- proposed:: Genomes assembly database identifiers. See [1]
- proposed:: ENA Sequence identifier.
General properties for pathways
[edit]Proposed identifier properties
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
KEGG ID | P665 | External identifier | identifier from databases dealing with genomes, enzymatic pathways, and biological chemicals | DL-ascorbic acid <KEGG ID> D00018 | - |
Property | Datatype | Creation level | Description | Links | Comments |
---|---|---|---|---|---|
Wikipathways ID | String | Proposal | WikiPathways Identifier. | http://www.wikipathways.org |
Drugs
[edit]Identifiers
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
Guide to Pharmacology Ligand ID | P595 | External identifier | ligand identifier of the Guide to Pharmacology database | cocaine <Guide to Pharmacology Ligand ID> 2286 | - |
ChEMBL ID | P592 | External identifier | identifier from a chemical database of bioactive molecules with drug-like properties | tropicamide <ChEMBL ID> CHEMBL1200604 | - |
HomoloGene ID | P593 | String | identifier in the HomoloGene database | rhodopsin <HomoloGene ID> 68068 | - |
DrugBank ID | P715 | External identifier | identifier in the bioinformatics and cheminformatics database from the University of Alberta | vitamin C <DrugBank ID> DB00126 | - |
ChemSpider ID | P661 | External identifier | identifier in a free chemical database, owned by the Royal Society of Chemistry | (RS)-methadone <ChemSpider ID> 3953 | - |
Interactions
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
significant drug interaction | P769 | Item | drug interaction: clinically significant interaction between two pharmacologically active substances (i.e., drugs and/or active metabolites) where concomitant intake can lead to altered effectiveness or adverse drug events. | (RS)-warfarin <significant drug interaction> lovastatin | - |
WikiProject Molecular_biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
Notified participants of WikiProject Medicine
Notes: The following drug objects should serve as the unifying examples for drugs in WikiData. In order to include all major identifiers, several new properties will be requested shortly (e.g. WHO INN, USAN)
Taxa
[edit]Identifiers
[edit]Title | ID | Data type | Description | Examples | Inverse |
---|---|---|---|---|---|
NCBI taxonomy ID | P685 | External identifier | identifier for a taxon in the Taxonomy Database by the National Center for Biotechnology Information | human <NCBI taxonomy ID> 9606 | - |
Modeling questions
[edit]Why do we have both e.g. "peptidase" (class of enzymes) and "peptidase activity" (molecular function). Can't they be merged?
[edit]No. An enzyme can have multiple functions, see multifunctional enzyme (Q67211934) in the query:
SELECT ?item ?label WHERE {
{ ?item wdt:P31 wd:Q67211934. } UNION { ?item wdt:P279/wdt:P279* wd:Q67211934. }
?item rdfs:label ?label.
FILTER(lang(?label) = 'en')
}
Also in principle enzymes additionally have binding function of their substrates and products, e.g. ATP binding (Q14817981).
Why is the EC (Enzyme Commission) no longer used as normative?
[edit]We use exact/broad mapping to Gene Ontology function entities to build our enzyme hierarchy. EC was never consistent, it had both multifunctional entries and very narrow or species-specific sub-entries at the same level. Contrarily GO functions are never defined by gene product or taxon.
Why do we have both glutamine-tRNA synthetase (Q105722884) (class of enzymes) and glutamine-tRNA synthetase (Q24785187) (InterPro family)?
[edit]Because the first is an abstract concept and the second is a specific set of proteins defined by InterPro (which also keeps changing invisibly). Practically, if a new organism is discovered, outside of known taxa, its glutaminyl-tRNA synthetase would be automatically a member of the first (open) set, but possibly not of the second set. This also means that InterPro families frequently are subgroups of those protein classes suggested in their title. (This is also the reason why function statements on InterPro families should never have exact mapping type)
Why do I get WD40 repeat (Q7948257) and WD40 repeat, protein family (Q95350717) if I search for IPR001680?
[edit]Because we want a concept for the protein domain too to make statements about it. Unfortunately InterPro domain entries both stand for the domain and the set of proteins that have that domain (according to the computational rules they apply) so we need two items, and we want to link both to the InterPro entry.
Why are (some) Gene Ontology protein complexes (cellular components) also instances of family of protein complexes (Q78155096)?
[edit]This is a temporary solution but not very wrong. Contrary to Complex Portal entries and Reactome protein complexes, Gene Ontology complexes are species-independent, so if the set of all complexes that are defined by such an entry is homologous then it is a family. Of course the condition is that the parts are always the same, i.e. they are from the same protein families. This can change and make sub-entries necessary later. However, at the moment this slowly growing part of Wikidata is to our knowledge the only existing database of species-independent protein complexes linking them to their parts families.
For an overview, there are currently 2,579 complexes in GO. We have annotated the parts of 35 of them, and encourage you to add those you are interested in. Query:
SELECT DISTINCT ?item ?label WHERE {
?item wdt:P31 wd:Q78155096.
?item wdt:P31 wd:Q5058355.
?item wdt:P2670 [].
?item rdfs:label ?label.
FILTER(lang(?label) = 'en')
}