Talk:Q8054
Autodescription — protein (Q8054)
- Useful links:
- View it! – Images depicting the item on Commons
- Report on constraint conformation of “protein” claims and statements. Constraints report for items data
- Parent classes (classes of items which contain this one item)
- protein (Q8054)
- biopolymer (Q422649)
- polypeptide (Q3084232)
- peptide (Q172847)
- carboxamides (Q355679)
- organic compound (Q174211) (℧)→
- →(§) macromolecule (Q178593)
- peptide (Q172847)
- gene product (Q424689)
- →(¤) biological macromolecule (Q66560214)
- protein (Q8054)
- Subclasses (classes which contain special kinds of items of this class)
- ⟨
protein
⟩ on wikidata tree visualisation (external tool)(depth=1) - Generic queries for classes
- See also
- This documentation is generated using
{{Item documentation}}
.
Instanciation of this item is usually not precise.
[edit]I see that some classes of proteins (such as tumor protein p53 (Q283350)) are reported as instances of protein (Q8054). This seems to me not precise, as these are not material entities, but classes of proteins. It would be more precise to represent them as subclasses of protein (Q8054). TiagoLubiana (talk) 18:52, 18 March 2020 (UTC)
- @TiagoLubiana: Your mistake is to take the label as defining the item, which comes from Wikipedia usage, where the title usually defines the topic. In Wikidata the statements are defining, especially if they have references. With tumor protein p53 (Q283350) you can see that every statement points to it being the specific human protein, in particular the external identifiers. By the way we have the p53 family as well: p53 tumour suppressor family (Q24738786) and tumor protein p53 (Q283350) is linked with it via part of (P361).
- However, there may be reasons to change tumor protein p53 (Q283350) to a subclass of protein (Q8054). In the real world an instance of a protein is a specific molecule that may have a unique distribution of element isotopes different from any other molecule of the same protein. In other words, what is an instance of (P31) of protein (Q8054) like tumor protein p53 (Q283350) is already a set of possibilities, including combinations of isotopes, mutational variants, post-translational modifications, splice variants, and, last but not least, structural conformations and homooligomers. --SCIdude (talk) 07:02, 4 May 2021 (UTC)
- @SCIdude: Nice reply, That last part is the point I was trying to make: a P53 protein is just a type of protein, all real-word moleclules instances of "P53 protein" are instances of protein. Using part of (P361) seems suboptimal, as protein families are really superclasses that englobe the species-specific "protein type". Maybe splitting protein in two items: "class of proteins" and "protein molecule". So P53 would be a subclass of (P279) "protein molecule" and a instance of (P31) "class of proteins". The PRO ontology uses a subclassing system of that sort. TiagoLubiana (talk) 13:53, 4 May 2021 (UTC)
- I also think that the IUPAC, which has proteins still as polypeptides, should go ahead and sort out the mess. --SCIdude (talk) 09:31, 28 June 2021 (UTC)
instance vs subclass
[edit]Here are recent stats of instances and subclasses of this class
P31 1003460 P279 768979 P279+ 798642 P279+ but not P279 29663 both P31 and P279+ 760986 P279+ but not P31 37655 P279 but not P31 17581 P31 but not P279+ 242406 P31 but not P279 251995
What distinguishes a subclass from an instance? It seems to me that it would be better to make some sort of determination that specific proteins are either instances or subclasses of this class, not both. Peter F. Patel-Schneider (talk) 17:00, 12 January 2024 (UTC)
- I can imagine that "subclass" usually used for families while "instance" for specific proteins. How would you distinguish them using only P279? --Infovarius (talk) 21:31, 13 January 2024 (UTC)
- One could use instance of (P31) for specific proteins and subclass of (P279)+ for families, but that doesn't seem to be what is happening because so many are both. So what is the distinction between instance of (P31) only, subclass of (P279)+ only, and both? Peter F. Patel-Schneider (talk) 22:19, 13 January 2024 (UTC)
- All proteins should use P279. The distinction between specific proteins and classes of proteins can be easily achieved using metaclasses like protein family (Q417841) with P31. Wostr (talk) 22:31, 13 January 2024 (UTC)
- That makes sense, but then the 760986 items that are both instances and subclasses should only be one or the other. How can this be fixed? Peter F. Patel-Schneider (talk) 14:26, 14 January 2024 (UTC)
addressing instance/subclass conflation in protein
[edit]Notified participants of WikiProject Medicine
WikiProject Molecular biology has more than 50 participants and couldn't be pinged. Please post on the WikiProject's talk page instead.
I am again looking at protein and I'm still seeing about three-quarters of a million items that are both instance and subclass of protein. Is there any reason for this? If not, what is the correct relationship between a particular protein and protein? Peter F. Patel-Schneider (talk) 17:28, 30 September 2024 (UTC)
- @Andrawaag @Andrew Su @Sulhasan Do you have any comments on this? Peter F. Patel-Schneider (talk) 17:41, 30 September 2024 (UTC)
- If you think about this issue the first step should be to think about the different concepts that we actually have. There's (1) a given amino acid chain. There is (2) that amino acid chain along with a few mutations that all exist within the same species. Then we have related proteins across species that all have the same function (3). Animo acid sequences with common evoluationry origin (4). Combinations of different animo acid chains that form together one complex (and that for all of the preceeding categories (5)/(6)/(7)/(8)).
- Then we have a different problem about the length of those amino acid chains and which count as proteins. The word protein is commonly used only for "long" amino acid chains. In English the word peptide is only used for short amino acid chains while in German "Peptide" is used for all amino acid chains, whether they are short or long.
- We currently have protein (Q8054)subclass of (P279)polypeptide (Q3084232)
subclass of (P279)peptide (Q172847) and peptide (Q172847) has the description "natural biological or artificially manufactured short chains of amino acid monomers linked by peptide (amide) bonds". - Before you change the P31/P279 of individual proteins, it would make sense to clean up the ontology and probably have one second order class for (1) to (8). I would say that things that are P31 (1) should subclass things that are P31 (2) which in turn subclass things that are P31 (3) which subclass things that are P31 (4).
- In practice (1) and (5), (2) and (6), (3) and (7), (4) and (8) should all have shared superclasses.
- If you use unprecise terms like "particular protein" which mean multiple different things, you can't clean up the ontology on that basis. We would first need a good ontology that distinguishes the different concepts and that doesn't say things like "long animo acid chain sequences" are "short amino acid chain sequences" before we can change the millions of existing items and then the question of how to automatically tell which item should go where becomes more complex. ChristianKl ❪✉❫ 09:55, 1 October 2024 (UTC)
- @ChristianKl I don't think that it is necessary to fully populate the ontology related to proteins. But you do make a good point that some of the superclasses of protein (Q8054) do not appear to be correct and should be changed. Is that something that should be discussed in the Chemistry project or in the Molecular Biology project? Hopefully a quick determination can be done here.
- But my main concern is the items that are both subclasses and instances of protein (Q8054) as this situation has consequences in the chemistry domain, for products, and for the Wikidata ontology as a whole. I'm hoping to have a discussion with a group of people and come up with a solution that is acceptable to all parties, even if the solution is not ideal. Peter F. Patel-Schneider (talk) 12:34, 1 October 2024 (UTC)
- I don't think you get a situation that's acceptable to all parties if you don't focus on the underlying ontological categories.
- I did raise the issue with the term peptide over at https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#peptide_(Q172847) ChristianKl ❪✉❫ 12:42, 1 October 2024 (UTC)
- @ChristianKl I definitely agree that something needs to be done for protein/polypeptide/peptide and that probably involves creating at least one extra class and untangling the conflation between the English and German labels and descriptions. This also requires unlinking protein from polypeptide because a protein can have multiple "peptide chains". (I don't know whether proteins can have branching peptide chains so I don't know whether branching is a distinction that needs to be made.) But as far as I can tell, this doesn't require creating classes for all the different possibilities. Peter F. Patel-Schneider (talk) 14:10, 1 October 2024 (UTC)
- There's a general question of what the word "protein" is supposed to mean. insulin (Q50265665) is a good example. Currently, Reactome says that it's a "Complex" and not just a normal protein because it has two chains. Other sources, see a protein complex as something that's made up of multiple proteins. We have currently a protein complex (Q420927) subclass of (P279) protein (Q8054) claim.
- amino acids, peptides, and proteins (Q77044953) does exist currently but is not marked as second-order class. Note that we also have items like eye proteins (Q76799283). ChristianKl ❪✉❫ 15:28, 1 October 2024 (UTC)
- @ChristianKl protein (Q8054) is linked to https://en.wikipedia.org/wiki/Protein which states "Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues." That's pretty strong evidence that protein (Q8054) deserves to be a superclass of protein complex (Q420927) and not a subclass of polypeptide (Q3084232). As far as I know, some proteins, like myoglobin (Q192642) and hemoglobin (Q43041) include components that are not peptide chains, which is further evidence for protein (Q8054) to not be a subclass of polypeptide (Q3084232). (But maybe these sources are being a bit loose with their terminology. Do you have pointers to the sources you mention?) Peter F. Patel-Schneider (talk) 13:26, 2 October 2024 (UTC)
- Wikipedia pages tend to mix all relevant definitions of a term together. If you care about good ontology, focusing on them as "pretty strong evidence" is a bad idea.
- Protein Ontology defines Protein as "An amino acid chain that is produced de novo by ribosome-mediated translation of a genetically-encoded mRNA, and any derivatives thereof".
- This happens to be a definition that's quite practical from a bioinformatics perspective and under that definition hemoglobin is not a protein but a protein-containing complex. If you look at the item for hemoglobin (Q43041) you can see that it does not have a UniProtID. On the other hand, Hemoglobin subunit beta (Q424422) does have a UniProtID because it's what UniProt considers a protein.
- I would suspect that most of our imports of data about proteins come from sources with a bioinformatics perspective (note, here that I'm biased towards Bioinformatics as I studied bioinformatics). ChristianKl ❪✉❫ 15:07, 2 October 2024 (UTC)
- @ChristianKl OK, perhaps English Wikipedia is not a great source, but it is linked to from protein (Q8054). The English description of protein (Q8054) is "biomolecule consisting of chains of amino acid residues" and hemoglobin (Q43041) is a subclass of protein (Q8054) so there is internal evidence that protein (Q8054) covers more than just amino acid chains. Admitedly, there is a conflict here that should be resolved but I think the best resolution of the conflict is to broaden the description by changing "consisting of" to "containing". Just because data from a source contributes to a Wikidata item does not imply that the Wikidata item exactly corresponds to the definitions in the source.
- The molecular biology community could decide that protein (Q8054) should use the narrow definition of protein but then they need to make the changes necessary to have protein (Q8054) correspond to that definition. And in any case, protein (Q8054) still wouldn't be a subclass of polypeptide (Q3084232) because its current instances make it be "protein or protein type". Peter F. Patel-Schneider (talk) 16:03, 2 October 2024 (UTC)
- If you want to know what's meant by protein (Q8054), you can look at the individual usages of it. Hemoglobin subunit beta (Q424422)instance of (P31)protein (Q8054) with a reference to UniProt seems to me a statement that's intended to mean that it's a protein in the sense of how Protein Ontology defines the term, while at the same time some molecular biologists might think about it as a subunit of a protein (maybe protein subunit (Q899781).
- When we mostly source instance of (P31) protein (Q8054) from UniProt it makes little sense to have ontological concepts around protein (Q8054) that define terms differently. ChristianKl ❪✉❫ 17:07, 2 October 2024 (UTC)
- @ChristianKl protein (Q8054) is linked to https://en.wikipedia.org/wiki/Protein which states "Proteins are large biomolecules and macromolecules that comprise one or more long chains of amino acid residues." That's pretty strong evidence that protein (Q8054) deserves to be a superclass of protein complex (Q420927) and not a subclass of polypeptide (Q3084232). As far as I know, some proteins, like myoglobin (Q192642) and hemoglobin (Q43041) include components that are not peptide chains, which is further evidence for protein (Q8054) to not be a subclass of polypeptide (Q3084232). (But maybe these sources are being a bit loose with their terminology. Do you have pointers to the sources you mention?) Peter F. Patel-Schneider (talk) 13:26, 2 October 2024 (UTC)
- @ChristianKl I definitely agree that something needs to be done for protein/polypeptide/peptide and that probably involves creating at least one extra class and untangling the conflation between the English and German labels and descriptions. This also requires unlinking protein from polypeptide because a protein can have multiple "peptide chains". (I don't know whether proteins can have branching peptide chains so I don't know whether branching is a distinction that needs to be made.) But as far as I can tell, this doesn't require creating classes for all the different possibilities. Peter F. Patel-Schneider (talk) 14:10, 1 October 2024 (UTC)
a proposal to address the instance/subclass conflation in protein
[edit]@Andrawaag @Andrew Su @Sulhasan Here is a proposal for addressing the issues caused by the unusual setup for protein (Q8054).
- Remove protein (Q8054)subclass of (P279)polypeptide (Q3084232) because the current instances of protein (Q8054) are not correct as instances of polypeptide (Q3084232).
- Change the label of Q8054 to "protein or protein type" and adjust the description and aliases accordingly.
- Add a Wikidata usage instructions (P2559) value to Q8054 saying that the class has protein types as both instances and subclasses.
- Replace protein (Q8054)instance of (P31)second-order class (Q24017414) with protein (Q8054)instance of (P31)variable-order class (Q23958852).
This is not an ideal solution, which would be to have protein types only be subclasses of protein (Q8054), but does alleviate the issues that protein (Q8054) is causing for the Wikidata ontology.
Comments please. Peter F. Patel-Schneider (talk) 13:36, 2 October 2024 (UTC)
- I don't think we should use variable-order class (Q23958852) here. Moving toward seeing it as second-order class is better (which is what the current status is) even if the work to get there isn't trivial. ChristianKl ❪✉❫ 15:10, 2 October 2024 (UTC)
- @ChristianKl If protein (Q8054) is a second-order class (Q24017414) then all its instances are first-order classes, i.e., classes whose instances are all non-classes, which is not currently the case. Adjusting protein (Q8054) to fit this requirement involves changing the roughly three-quarters of a million items that are both instances and subclasses of protein (Q8054). I would love to have this happen, but will it, particularly as there has been discussion of this point going back to at least 2018. (See https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2018/02#Both_instance_and_subclass_of_the_same_item.)
- You could view my proposal as a interim step in the potentially long process of cleaning up protein (Q8054). Peter F. Patel-Schneider (talk) 16:11, 2 October 2024 (UTC)
- I don't think it's a step forward to cleaning up protein (Q8054). It's rather a step to accept uncleaniness. ChristianKl ❪✉❫ 16:58, 2 October 2024 (UTC)
- Yeah, maybe, but is it possible to make the three-quarters of a million changes this year? If not, then maybe the best we can do is to accept the uncleanliness. With the changes at least the uncleanliness does not affect the rest of the Wikidata ontology, albeit at the cost of degrading the connections between molecular biology and the rest of Wikidata. Peter F. Patel-Schneider (talk) 18:04, 2 October 2024 (UTC)
- When it comes to doing well for the rest of the ontology of Wikidata, I would try to use variable-order class (Q23958852) the least amount of times that's possible. As far as changing hundreds of thousands items, that's likely a job for ProteinBoxBox, so I would wait on an answer from the people who run it. ChristianKl ❪✉❫ 23:21, 2 October 2024 (UTC)
- According to https://www.wikidata.org/wiki/Special:Contributions/ProteinBoxBot, it hasn't made any changes for over 2 years, so I'm not sure whether it is active at all. Peter F. Patel-Schneider (talk) 01:14, 3 October 2024 (UTC)
- Correct, our funding to run ProteinBoxBot expired, and we thought halting the bot was better than letting it run unsupervised. I'm afraid that also means I'm a bit out of the loop on the data modeling best practices here in Wikidata, so I don't have an opinion on the proposed solution here... Best, Andrew Su (talk) 05:24, 3 October 2024 (UTC)
- @Andrew Su Thanks for the information. Since 2016 or so, it has been considered bad practice to have items that are both a subclass and an instance of the same class. It appears that over the years, ProteinBoxBot made changes that make about three-quarters of a million items be both subclasses and instances of protein (Q8054). This point has been brought up several times before. The issue is now whether to live with all these and make adjustments so that they do not degrade the rest of the Wikidata ontology or to perform a bulk change so that these items are brought in line with best practices. Both approaches probably require buy-in from several Wikidata communities. Peter F. Patel-Schneider (talk) 13:45, 3 October 2024 (UTC)
- Correct, our funding to run ProteinBoxBot expired, and we thought halting the bot was better than letting it run unsupervised. I'm afraid that also means I'm a bit out of the loop on the data modeling best practices here in Wikidata, so I don't have an opinion on the proposed solution here... Best, Andrew Su (talk) 05:24, 3 October 2024 (UTC)
- According to https://www.wikidata.org/wiki/Special:Contributions/ProteinBoxBot, it hasn't made any changes for over 2 years, so I'm not sure whether it is active at all. Peter F. Patel-Schneider (talk) 01:14, 3 October 2024 (UTC)
- When it comes to doing well for the rest of the ontology of Wikidata, I would try to use variable-order class (Q23958852) the least amount of times that's possible. As far as changing hundreds of thousands items, that's likely a job for ProteinBoxBox, so I would wait on an answer from the people who run it. ChristianKl ❪✉❫ 23:21, 2 October 2024 (UTC)
- Yeah, maybe, but is it possible to make the three-quarters of a million changes this year? If not, then maybe the best we can do is to accept the uncleanliness. With the changes at least the uncleanliness does not affect the rest of the Wikidata ontology, albeit at the cost of degrading the connections between molecular biology and the rest of Wikidata. Peter F. Patel-Schneider (talk) 18:04, 2 October 2024 (UTC)
- I don't think it's a step forward to cleaning up protein (Q8054). It's rather a step to accept uncleaniness. ChristianKl ❪✉❫ 16:58, 2 October 2024 (UTC)
- Not sure, lost I am standing by to comment on a proposal but although I watch conversations like this, I do not edit significantly in this space and have no great insight. If anyone comes to consensus among some editors then I can help recruit others into commenting on best practice. Bluerasberry (talk) 15:32, 3 October 2024 (UTC)
OK, let me outline several other ways forward that I can see, including this one for constrast. I'll start a new topic for this. Peter F. Patel-Schneider (talk) 16:06, 7 October 2024 (UTC)
possible ways to address the instance/subclass issue in proteins
[edit]@Andrawaag @Andrew Su and others
I see several ways forward. I give here the main points of each of these ways and what I see as their advantages and disadvantages. Please comment on which ways you prefer.
- Leave the instance and subclass links into protein (Q8054) alone, document the current situation, and adjust the relationships between protein (Q8054) and the rest of the Wikidata ontology.
- Change the label of protein (Q8054) to "protein or protein complex or type thereof" to align it with its current status.
- Change the description of protein (Q8054) to "biomolecule or biomolecule complex largely consisting or chains of amino acid residues or type thereof" to align it with its current status.
- Add a Wikidata usage instructions (P2559) value further explaining the current situation.
- Replace the outgoing instance of (P31) and subclass of (P279) links from protein (Q8054) with protein (Q8054)subclass of (P279)entity (Q35120) and protein (Q8054)instance of (P31)variable-order class (Q23958852).
- Opionally create a new chemical metaclass "variable-order class of chemical entities" and add protein (Q8054)instance of (P31)variable-order class of chemical entities.
- Investigate the 155 subclasses of protein (Q8054) that have instances to see which of them need the same treatment or should be modified in some other way.
- Advantage: This has the least number of changes.
- Advantage: All external access to instances and subclasses of protein (Q8054) is unaffected.
- Disadvantage: The bad ontological status of protein (Q8054) is unchanged.
- Disadvantage: protein (Q8054) is separated from a large part of the Wikidata ontology.
- Create a new property for the instance of (P31) links into protein (Q8054).
- Modify the instance of (P31) links into protein (Q8054) to use the new property.
- The label of the property would be something like "(deprecated) multi-hop subclass of".
- The description would be something like "(This property is deprecated and only to be used to fix legacy uses of P31 that were stand-ins for multi-hop P279 links.) subclass of and somehow special".
- Advantage: The bad ontological status of protein (Q8054) is fixed.
- Disadvantage: All existing external accesses that use instance of (P31) protein (Q8054) have to be adjusted to use the new property.
- Disadvantage: The new property starts out as deprecated.
- Split protein (Q8054) into two classes, one for molecules and one for types of molecules.
- protein (Q8054) would be for types and the new class for molecules (probably, although this could be reversed).
- The labels would be "protein or protein complex" and "type of protein or protein complex".
- The descriptions would be "biomolecule or biomolecule complex largely consisting of chains of amino acid residues" and "type of biomolecule or biomolecule complex largely consisting or chains of amino acid residues".
- Replace subclass of (P279) links into protein (Q8054) to be into the new class.
- Repeat for any subclasses of protein (Q8054) that need the same treatment.
- Advantage: The bad ontological status of protein (Q8054) is fixed.
- Disadvantage: All existing external accesses that use subclass of (P279) protein (Q8054) have to be adjusted to use the new property.
- Disadvantage: The two classes will need to be kept synchronized.
- Combine the first way and the previous way by creating two new classes.
- Further adjust the labels and descriptions of protein (Q8054) to say that it is deprecated.
- Instead of replacing links, copy and modify them.
- Advantage: There are protein classes with good ontological statuses.
- Advantage: There new protein classes are correctly integrated with the Wikidata ontology.
- Advantage: All external access to instances and subclasses of protein (Q8054) is unaffected.
- Disadvantage: The three classes may need to be kept synchronized.
- Disadvantage: The bad ontological status of protein (Q8054) is unchanged.
- Disadvantage: There is a deprecated class that may end up being used instead of the correct classes.
- Remove all the instance of (P31) link into protein (Q8054) and its subclasses.
- If the item is not a subclass of the object of the removed link add a subclass of (P279) link from the item to the object.
- Advantage: The bad ontological status of protein (Q8054) is fixed.
- Disadvantage: Any external access to instances of protein (Q8054) has to be changed in a non-trivial fashion.
- Disadvantage: There is no way to distinguish between types of proteins and other kinds of protein classes.
Peter F. Patel-Schneider (talk) 16:18, 7 October 2024 (UTC)
- Regarding the 3rd option: are there any items in WD that could be classified as molecules, i.e. individual molecules, not types of molecules? I don't think there are any items like this and the problem here is a result of the incorrect use of instance of (P31) (and this is the same problem that was present for chemical compounds in the past). From my perspective the only way forward is the 5th option. All items linked to protein (Q8054) should use subclass of (P279) relation only (as there are no individual protein molecules described in WD) and instance of (P31) may be used for the addition of e.g. protein family (Q417841) and type of protein classes. I'm not sure, but differentiation between class and type might be possible to implement on the basis of external databases. Wostr (talk) 17:41, 7 October 2024 (UTC)
- None of these options depend on whether individual molecules are present in Wikidata or whether any individual molecules are eligible to be present in Wikidata, except that the absence protein molecule items makes it easier to implement the various options. (The situation with proteins is very similar to the situation with other classes that do not have notable instances, for example brad (Q111366998).) In each of the options above the class for (only) protein molecules would not end up with any instances after the changes required to implement the option are performed. In option 1 and 4 protein (Q8054) would end up with instances, but that would be because it is a union of protein molecule and protein type.
- I do believe that the absence of items for individual protein molecules made it easier to set up protein (Q8054) the way that it is currently set up. Peter F. Patel-Schneider (talk) 18:03, 7 October 2024 (UTC)
- Any of the above options that move statements could instead deprecate them with a reason for deprecated status that says something like "legacy use of P31 as shortcut for multi-hop P279". Peter F. Patel-Schneider (talk) 21:30, 10 October 2024 (UTC)