Presentation for PIDapalooza 2016. PIDs need to be used to achieve their intended persistence. Our research (reported at WWW2016, see http://arxiv.org/1602.09102) found that a disturbing percentage of references to papers that have DOIs actually use the landing page HTTP URI instead of the DOI HTTP URI. The problem is likely related to tools used for collecting references such as bookmarks and reference managers. These select the landing page URI instead of the DOI URI because the former is what's available in the address bar. It can safely be assumed that the same problem exists for other types of PIDs. The net result is that the true potential of PIDs is not realized. In order to ameliorate this problem we propose a Signposting pattern for PIDs (http://signposting.org/identifier/). It consists of adding a Link header to HTTP HEAD/GET responses for all resources identified by a DOI, including the landing page and content resources such as "the PDF" and "the dataset". The Link header contains a link, which points with the "identifier" relation type to the DOI HTTP URI. When such a link is available, tools can automatically discover and use the DOI URI instead of the other URIs (landing page, PDF, dataset) associated with the DOI-identified object.
1 of 51
Downloaded 10 times
More Related Content
PID Signposting Pattern
1. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Cartoon by Patrick Hochstenbach
Herbert Van de Sompel
LANL & DANS
@hvdsomp
http://orcid.org/0000-0002-0715-6126
Acknowledgments: Geoff Bilder, Shawn
Jones, Martin Klein, Michael L. Nelson, David
Rosenthal, Harihar Shankar, Simeon Warner,
Karl Ward, Joe Wass
A Signposting Pattern for PIDs
http://signposting.org
Signposting is funded by the Andrew W. Mellon Foundation
2. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• A Disconcerting Observation
• A Proposed Fix Using Signposting
• Signposting, The Bigger Picture
• Additional Signposting Patterns
Outline
3. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Large Scale Study into Reference Rot for
Links to Web-at-Large Resources Found in STM Articles
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. Under review.
4. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
STM Articles in the Study
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
STM articles published 1997-2012 arXiv PMC total
Per corpus 707,667 479,194 1,186,861
With URI references to articles 51,574 240,857 292,431
With URI references to web-at-large resources 142,134 156,160 298,294
5. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Articles that Link to Articles & to Web At Large Resources (PMC)
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
6. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
URI References in the Study
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
URI References arXiv PMC total
Per corpus 781,895 1,653,567 2,435,462
Excluded 1,555 428,036 429,591
To articles 434,163 744,678 1,178,841
To web-at-large resources 346,177 480,853 827,030
7. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
URI References to Articles & to Web At Large Resources (PMC)
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
8. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• When classifying URI references as linking to articles, we assumed
that filtering on http://dx.doi.org/* would do the trick
• But we found a lot of e.g. http://link.springer.com/article/*
• For example:
• http://link.springer.com/article/10.1007%2Fs00799-014-018-0
• Instead of:
• http://dx.doi.org/10.1007/s00799-014-0108-0
• We used CrossRef’s Reverse Domain Lookup to classify these URI
references as linking to articles and went on with our reference rot
research
A Disconcerting Observation
9. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Hiberlink Results: Link Rot - arXiv
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
10. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Hiberlink Results: Content Drift - arXiv
Shawn Jones, Herbert Van de Sompel, et al. (2016) Scholarly context adrift. Under review.
Under review
11. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Hiberlink Results: Robust Links
http://robustlinks.mementoweb.org
12. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
A Closer Look at the Disconcerting Observation
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
13. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
A Closer Look at the Disconcerting Observation - arXiv
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
14. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
A Closer Look at the Disconcerting Observation - PMC
Herbert Van de Sompel, Martin Klein, and Shawn Jones (2016) Persistent URIs Must Be Used to Be Persistent.
In: WWW2016. http://arxiv.org/1602.09102
15. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• CrossRef’s publisher baseURLs represents the state of the DOI
resolver at the time of the research
• Some shouldbeDOI may have been classified as web-at-large
because old publisher baseURLs are no longer in the resolver
• At the time of the research, no public information was available about
when a publisher started to assign DOIs
• Some references may have wrongly been classified as
shouldbeDOI because publisher was not yet assigning DOIs in
earlier years
• Findings for recent years do not suffer from the above
Caveats Regarding the Disconcerting Observation
16. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Content Types of ”200 OK” shouldbeDOI Resources, Year 2012
Content Type arXiv PMC
text/html 19,649 63,769
application/pdf 153 1,813
text/plain 7 3,924
image/jpeg 1 64
other 46 74
none provided 2,118 5,210
total 21,974 74,854
17. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Content Length of ”200 OK” shouldbeDOI Resources, Year 2012
Content Length arXiv PMC
1-50 k 6,084 7,215
50-100 k 772 12,804
100-150 k 225 4,835
150-200 k 33 7,885
200+ k 216 9,423
chunked 4,100 20,596
none provided 10,544 12,096
total 21,974 74,854
18. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Top Target baseURLs for shouldbeDOI Resources, 1997-2012
arXiv PMC
ams.org biomedcentral.com
adsabs.harvard.edu scripts.iucr.org
link.aps.org ncbi.nlm.nih.gov
stacks.aip.org frontiersin.org
link.aip.org ccforum.com
emis.de nar.oxfordjournals.org
springerlink.com nature.com
jstor.org elsevier.com
ncbi.nlm.nih.gov jcb.org
sciencemag.org jmir.org
19. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• A Disconcerting Observation
• A Proposed Fix Using Signposting
• Signposting, The Bigger Picture
• Additional Signposting Patterns
Outline
20. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• The PID URI is not in the browser’s address bar, when at:
• The landing page
• The PDF
• The dataset
• Any web resource that is part of the PID-identified object
• Status quo:
• Provide the PID URI in copy/paste-able manner in landing page
• Provide PID URI in a downloadable citation
• Embed PID URI in an XMP container
• Desired: The ability for tools to uniformely discover the PID URI when
at any web resource that is part of a PID-identified object
Status Quo
21. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links
Mark Nottingham (2010) RFC5988: Web Linking.
http://tools.iets.org/rfc/rfc5988.txt
22. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links
23. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links
24. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
25. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
26. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
27. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Links Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
28. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
HTTP Link Are Used
curl –I http://dbpedia.org/data/Reykjavik
HTTP/1.1 200 OK
Date: Thu, 27 Oct 2016 04:43:28 GMT
Content-Type: application/rdf+xml; charset=UTF-8
Content-Length: 1210
Link:
<http://creativecommons.org/licenses/by-sa/3.0>
; rel=“license",
<http://dbpedia.org/data/Reykjavik>
; rel="alternate"; type="text/n3",
<http://dbpedia.org/resource/Reykjavik>; rel="describes",
<http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/
data/Reykjavik>
; rel="timegate"
29. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• Registered in IANA registry
• Strings, e.g. license, alternate, describes, timegate
• Requires a formal specification, e.g. RFC
• Typically used for common relationships, generically specified
• Provides broad, coarse grained interoperability
• Minted by a community
• URIs, e.g. http://xmlns.com/foaf/0.1/primaryTopic
• Requires community agreement
• Can be as specific as desired
• Can provide community-specific, fine grained interoperability
HTTP Link Relation Types
30. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Proposal: Use HTTP Link with identifier Relation Type
curl –I
http://www.dlib.org/dlib/november15/vandesompel/11vandesompel.html
HTTP/1.1 200 OK
Date: Wed, 26 Oct 2016 12:36:37 GMT
Server: Apache/2.2.15 (CentOS)
Last-Modified: Thu, 19 Nov 2015 14:50:19 GMT
ETag: "205a5e-f5ef-524e5e0ab80c0"
Accept-Ranges: bytes
Content-Length: 62959
Content-Type: text/html; charset=UTF-8
Link: <https://doi.org/10.1045/november2015-vandesompel>
; rel=“identifier”
Michael Nelson and Herbert Van de Sompel (2016) Linking to Persistent Identifiers with rel=“identifier”
http://ws-dl.blogspot.nl/2016/11/2016-11-07-linking-to-persistent.html
31. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Proposal: Use HTTP Link with identifier Relation Type
http://signposting.org/identifier/dryad/
32. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• Can uniformly be used for all MIME types
• Accessible via HTTP HEAD (no content transfer):
• Works for large resources
• Can work for restricted content
• Unbelievable but True: Many publishers don’t support HEAD
• In many cases, HTTP identifier links can be implemented using
simple URI rewrite rules in web server
• The URIs of web resources that are part of PID-identified object
many times contain the PID
• Allows addressing many other patterns using basic technology
HTTP Links Are Pretty Neat
33. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• A Disconcerting Observation
• A Proposed Fix Using Signposting
• Signposting, The Bigger Picture
• Additional Signposting Patterns
Outline
34. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Signposting the Scholarly Web
http://signposting.org
35. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Herbert Van de Sompel and Michael L. Nelson (2015) Reminiscing about 15 years of interoperability efforts.
https://doi.org/10.1045/november2015-vandesompel
Reminiscing About Interoperability for Scholarly Communication
36. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
I Have Done My Fair Share
OAI-PMH
OAI-ORE
Memento
Shared Canvas
info URI
Open Annotation
ResourceSync
Robust Links
OpenURL
37. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• A highly distributed activity
• Try turning this distributed activity from a gathering of silos into an
ecology of collaborating repositories
• In the web context, this seems like a rather unique challenge
• Most web enterprises want dominance, not collaboration
• Interoperability as an enabler to connect resources from distributed
repositories
• Repositories expose uniform behaviors
• Multiple parties can interact uniformly with (resources of) these
repositories to create added-value
Research Communication on the Web
38. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Tools of the Web-Centric Interoperability Trade
• Resource
• URI
• HTTP as the API: HEAD/GET, POST, PUT, DELETE
• Representation
• Media Type
• Link
• Content Negotiation, e.g. for preferred Media Type
• Typed Link
• Controlled Vocabularies for Typed Links
W3C
Architecture of
the World Wide
Web
39. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Tools of the Web-Centric Interoperability Trade – RDF Stack
• Resource
• URI
• HTTP as the API: HEAD/GET, POST, PUT, DELETE
• Representation
• Media Type
• Link
• Content Negotiation, e.g. for preferred Media Type
• Typed Link
• Controlled Vocabularies for Typed Links
RDF, RDFS,
OWL
W3C
Architecture of
the World Wide
Web
40. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Used by various interoperability efforts, e.g. OAI-ORE, Open Annotation,
W3C PROV, Research Objects, …
• Provides extensive expressiveness for description
• Typically based on publishing documents that adhere to a certain
“profile” and reveal relations, properties, …
• Non-Trivial barrier to entry as illustrated by slow adoption, likely related
to unfamiliar technology stack
Interoperability via RDF, RDFS, OWL Stack
41. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Tools of the Web-Centric Interoperability Trade – HTTP Stack
• Resource
• URI
• HTTP as the API
• Representation
• Media Types
• Link
• Content Negotiation, e.g. for preferred Media Type
• Typed Link
• Controlled Vocabularies for Typed Links
HTTP Links,
IANA link
relation registry,
community link
relation types
HATEOAS – Hypermedia As The Engine Of Application State
http://en.wikipedia.org/wiki/HATEOAS
W3C
Architecture of
the World Wide
Web
42. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Used by Memento, ResourceSync, Signposting the Scholarly Web:
• Provides coarse expressiveness for navigation via IANA registered
relation types (expressed as reserved terms)
• Finer grained expressiveness via community-defined relation
types (expressed as HTTP URIs)
• Typically based on publishing typed links that support a client to
navigate among resources in an informed manner
• Low implementation barrier because of familiar technology stack
Interoperability via HTTP Links, IANA Link Relation Types
43. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• A Disconcerting Observation
• A Proposed Fix Using Signposting
• Signposting, The Bigger Picture
• Additional Signposting Patterns
Outline
44. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• Identifier pattern
• Publication boundary pattern
• Bibliographic metadata pattern
Currently at signposting.org
45. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Publication Boundary Pattern
http://signposting.org/publication_boundary/oxford/
46. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Bibliographic Metadata Pattern
http://signposting.org/bibliographic_metadata/springer/
47. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Bibliographic Metadata Pattern
http://signposting.org/conventions/
48. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Use Case: Resource Capture for Digital Preservation
Herbert Van de Sompel, David Rosenthal, and Michael L. Nelson (2015) Web Infrastructure to Support e-Journal
Preservation (and More). http://arxiv.org/abs/1605.06154
49. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
• Author pattern
• author link from DOI URI to ORCID URI
• author link from landing page to ORCID URI
• License pattern
• license link from web resources that are part of a scholarly
object to the appropriate license URI
• Resource type pattern
• type relation type on the web resource itself
• sem-type attribute on links to a web resource
• URIs to express resource types
• Which? How coarse/fine grained?
Expected at signposting.org
50. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Resource Type Pattern
51. Herbert Van de Sompel
PIDapalooza, Reykjavik, Iceland, 10 Nov 2016
Cartoon by: Patrick Hochstenbach
Herbert Van de Sompel
LANL & DANS
@hvdsomp
http://orcid.org/0000-0002-0715-6126
Acknowledgments: Geoff Bilder, Shawn
Jones, Martin Klein, Michael L. Nelson, David
Rosenthal, Harihar Shankar, Simeon Warner,
Karl Ward, Joe Wass
A Signposting Pattern for PIDs
http://signposting.org
Signposting is funded by the Andrew W. Mellon Foundation
Editor's Notes
#21: Few formats support an XMP container
Few tools read the XMP container
#30: Few formats support an XMP container
Few tools read the XMP container
#33: Few formats support an XMP container
Few tools read the XMP container
#35: Under the umbrella “Signposting the Scholarly Web” , I have started to pursue this interoperability approach for very common patterns in scholarly communication.
#42: The principle is that a client interacts with a network application entirely through hypermedia provided dynamically by application servers.
A REST client needs no prior knowledge about how to interact with any particular application or server beyond a generic understanding of hypermedia.