Content-addressable RDF
11. June 2020 (v0.1)
Most existing RDF content is location-addressed. The IRIs are pointers to hosts that hold the content. If the host goes down the content is no longer available. This happens frequently enough to seriously undermine the robustness of systems relying on RDF. In a content-addressable storage, data is referred to by an identifier determined by the data itself. Data can be replicated naturally and robustness of the system does not rely on individual hosts.
We propose a scheme that allows RDF to be content-addressed by presenting a grouping of RDF statements and a canonical representation that enables the grouping of statements to be referred to by an unique identifier determined exactly by the content of the statements. The scheme is compatible with existing RDF data and ontologies, allows the usage of RDF in decentralized systems and can be used to solve the problem of authenticity of RDF data.
Table of Contents
1 Introduction
Must a name mean something?
—Lewis Caroll, Trough the Looking Glass
The Resource Description Framework (RDF) is a data model for structured content. The data is modeled as graph with nodes and edges labeled with unique identifiers that are at the same time references to further data (Uniform Resource Identifier - URI).
Together with foundational principles such as the open-world assumption (no single agent has complete knowledge) and the unique name assumption (the same thing has the same name regardless of context) this graph structure makes RDF well-suited for decentralized systems.
Nevertheless, RDF and the Semantic Web (the vision behind RDF) have failed to provide a robust and decentralized foundation. The reasons for this has been argued to include low availability of content and challenges in scaling systems querying large data sets [Polleres_Kamdar_Fernández_Tudorache_Musen_2020].
Previous work towards more robustness has mostly focused on developing custom distributed systems that enable semantic queries [Cai_2004,Marx_2018,Aebeloe_2019,Kaoudi_2010,Karnstedt_2007,Verborgh_2016].
To improve robustness, we propose a scheme that allows content-addressable RDF. RDF data sets can be addressed by an unique identifier that is deterministically computed from the data. Compared to location-based identifiers (URLs) the availability of data is no longer dependent on a single host. This makes RDF much more appealing for decentralized systems.
We introduce a grouping of RDF statements (Fragment Graph) that allows a simple and efficient canonical representation. The hash digest of this canonical representation can be used as a unique reference for the statements. This enables content-addressed RDF.
We show how this scheme is compatible with existing RDF data and ontologies as well as with existing tooling. Furthermore we demonstrate how content-addressed RDF can be used over existing decentralized protocols.
2 Grouping RDF statements
The core structure of RDF [Wood_2014] is a triple consisting of two nodes connected with a labeled edge. The node from where the edge goes out is called the subject, the edge is labeled with a predicate and the node with the in-going edge is called the object:
Figure 1: A triple
There are three kinds of nodes: IRIs (an internationalized URI), literals and blank nodes. IRIs are reusable identifiers for properties and things, literals are nodes holding content such as an integer or a string. See the section on blank nodes for more details about blank nodes.
A subject may be either an IRI or a blank node. A predicate is always an IRI and the object may be an IRI, a literal or a blank node. Note that subject, predicate, object only denotes the position of a node in a triple. We denote a triple with \((s,p,o)\).
A set of triples form a graph. For example a note with an attached image about a clown performance might be described as following graph:
Figure 2: Two clowns ride a bicycle trough a loop
Round nodes are IRIs and boxed nodes are literals. Note that we abbreviate the IRIs for the ActivityStreams and Schema vocabularies with as:
and schema:
.
The graph is a combination of statements (triples) that have three distinct origins:
- The actor Alice (
https://test.example/alice
) - The note attributed to Alice (
https://test.example/notes/1
) - The description of the circus performance (
https://circus.show/performance-in-funky-town
)
The graph structure allows a very natural composition of these three different objects and shows the relationships between them.
However, once the statements are combined there is no clear way of splitting the graph back into the sub-graphs that were combined. There is no way of expressing “ownership” of triples1.
2.1 Fragment Graph
We define a grouping of RDF triples based on the URL fragment:
- Fragment Graph
- Given a base subject \(s\) that is an IRI without a URL fragment part, the Fragment Graph of the base subject \(s\) is the union of:
- All triples of form \((s,p,o)\) where \(o\) is either an IRI or a literal
- All triples of form \((f,p,o)\) where \(f\) is an IRI that is a fragment of the base subject \(s\) and \(o\) is either an IRI or a literal
For example http://example.com/something
does not have a fragment part and could be the base subject of a Fragment Graph. https://example.com/something#that-thing
is a fragment of http://example.com/something
and would be included in its Fragment Graph.
In the example above there are three Fragment Graphs with base subjects:
https:/test.example/notes/1
https://test.example/alice
https://circus.show/performance-in-funky-town
Note that https://test.example/images/1.jpg
does not have a fragment part but is never in subject position.
Figure 3: Three Fragment Graphs with base subjects colored cyan
This grouping reflects how RDF content is commonly published on HTTP servers. A single request to the base subject will return the base subject and all fragments.
The concept of Fragment Graph is very similar to the Minimum Self-Contained Graph (MSG) introduced by Tummarello et. al. [Tummarello_2005]. However we follow URL fragments instead of blank nodes. Blank nodes are not permitted in an Fragment Graph.
2.2 On Blank Nodes
Blank Nodes are special nodes that act like existential variables in mathematical logic [Mallea_2011]. In practice they tend to make things complicated [Cyganiak_2011,Carroll_2003] and we follow previous suggestions to not use blank nodes [Bizer_2007].
In order to use existing RDF data that does contain blank nodes we use Skolemization - we replace blank nodes with fresh IRIs. We suggest replacing blank nodes with random UUID URNs [Leach_2005] that can be readily generated locally.
The problem is that existing content that contains blank nodes will not receive the same identifier as replacing blank nodes is not a deterministic process. See the section on caching existing RDF data on how this can be handled.
3 Serialization for Content-addressable RDF
We present a serialization of Fragment Graphs for content-addressing based on Canonical S-expressions [Rivest_1997].
S-Expressions are data structures for representing complex data. They are either symbols, strings or lists of simpler S-Expressions. A list is represented by using parenthesis to surround the elements of the list. A list with a symbol in the first position is called a form.
The Fragment Graph for the base subject https://test.example/notes/1
can be represented with a rdf
-form:
(rdf (s "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" "https://www.w3.org/ns/activitystreams#Note") (s "https://www.w3.org/ns/activitystreams#Note" (l "The clowns just went trough the loop!" "http://www.w3.org/1999/02/22-rdf-syntax-ns#langString" "en")) (s "https://www.w3.org/ns/activitystreams#image" (f "image")) (fs "image" "https://www.w3.org/ns/activitystreams#url" "https://test.example/images/1.jpg") (fs "image" "http://www.w3.org/1999/02/22-rdf-syntax-ns#type" "https://www.w3.org/ns/activitystreams#Image"))
Note that the base subject (https://test.example/notes/1
) is not part of the representation. If it were included it would not be possible to compute a deterministic identifier for the base subject from the representation.
3.1 Forms
The proposed serialization of a Fragment Graph has 5 different forms:
rdf
- Top-level form identifying RDF statement. Elements of an
rdf
-form ares
-forms orfs
-forms. There is only onerdf
form per representation of Fragment Graph. s
- A statement form represents a triple with the base subject of the Fragment Graph as subject. It contains exactly two elements:
- The predicate which may be an IRI as string or a
f
-form. - The object which may be an IRI, a
f
-form or al
-form.
- The predicate which may be an IRI as string or a
l
- A literal form represents a literal node. It contains either two or three elements:
- The canonical lexical form [Peterson_2012] of the value.
- The datatype IRI
- If an only if the datatype IRI is
http://www.w3.org/1999/02/22-rdf-syntax-ns#langString
a third element is present which is the language tag.
f
- A fragment reference form represents an IRI to a fragment. It contains exactly one element which is the fragment identifier.
fs
- A fragment-statement form represents a triple with a fragment as subject. It contains exactly three elements:
- The fragment identifier
- The predicate which may be an IRI as string or a
f
-form. - The object which may be an IRI, a
f
-form or al
-form.
Given a base subject \(s\), every s
and fs
form can be translated to a RDF triple. The set of RDF triples in the rdf
form is the Fragment Graph for \(s\).
3.2 Canonical S-Expression
The canonical representation of an rdf
form is the UTF-8 encoded Canonical S-Expression [Rivest_1997] where the s
and fs
forms in the rdf
form are sorted based on their (binary) Canonical S-expression representation.
The Canonical S-expression is a binary format. Strings and symbols (as strings) are encoded as sequences of UTF-8 bytes. Every sequence of bytes is pre-pended with the length of the sequence and a colon (this encoding is called Netstrings [Bernstein_1997]). Lists are encoded with surrounding parenthesis.
For readability purposes we will show the string representation in the following examples and not the UTF-8 encoded binary representation which is the canonical representation.
The string “Hello World” is encoded as following:
11:Hello World
An s
form is encoded as following:
(1:s43:https://www.w3.org/ns/activitystreams#image(1:f5:image))
Canonical S-Expressions are very easy to parse and for every S-Expression there is an uniquely defined Canonical S-Expression. The representation is not optimized for being pretty or human readable.
The Canonical S-expression of the Frament Graph for https://test.example/notes/1
is (as UTF-8 string):
(3:rdf(1:s10:as:content(1:l37:The clowns just went trough the loop!53:http://www.w3.org/1999/02/22-rdf-syntax-ns#langString2:en))(1:s10:as:context45:https://circus.show/performance-in-funky-town)(1:s15:as:attributedTo26:https://test.example/alice)(1:s8:as:image34:https://test.example/notes/1#image)(1:s8:rdf:type7:as:Note)(2:fs5:image6:as:url33:https://test.example/images/1.jpg)(2:fs5:image8:rdf:type8:as:Image))
To compute the Canonical S-Expression it is necessary to sort the Canonical S-expressions of the s
and fs
forms. Note that sorting is done on the UTF-8 encoded sequence of bytes.
3.3 Comparison to other RDF serialization formats
The presented serialization is optimized for easy parsing, having a unique canonical representation and being usable for content-addressable RDF.
To illustrate the advantage of an optimized serialization format consider the Turtle [Becket_2014] serialization of the Fragment Graph for the note:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix schema: <https://schema.org/> . @prefix as: <https://www.w3.org/ns/activitystreams> . <> a as:Note ; as:attributedTo <https://test.example/alice> ; as:content "The clowns just went trough the loop!"@en ; as:image <https://test.example/notes/1#image> ; as:context <https://circus.show/performance-in-funky-town> . <#image> a as:Image ; as:url <https://test.example/images/1.jpg> .
The serialization is readable and fairly easy to parse and understand for humans. Note also that the base subject IRI is not part of the serialization. When parsing the Turtle document it needs to be specified as “base IRI”.
However this is not the only representation of the same set of statements: The choice of prefixes is arbitrary, the nesting of statements and the order of statements can be changed.
It is possible to clearly define rules on how to do make these choices and define a canonical Turtle representation for a set of RDF statements [Carroll_2003,Longley_2019]. However, this is complicated. Furthermore existing implementations of Turtle encoders can not be used as they usually do not provide fine control for exactly tuning how the statements are encoded.
We believe that the serialization based on Canonical S-Expressions is simpler, efficient and can be straight-forwardly implemented.
An S-Expression based serialization of RDF has previously been proposed [Prodromou_2006].
4 Content-addressable RDF
Given a Fragment Graph (a grouping of RDF statements), the canonical representation can be found. We can compute the hash of the canonical representation. This hash can be used as an identifier for the statements in the Fragment Graph.
For example the Fragment Graph of the note can be hashed with the Blake2b cryptographic hash [aumasson2013blake2]. The hash value can be Base32 encoded and used as an Uniform Resource Name (URN):
urn:blake2b:FBRTEWJUSPW2EDMMITZKWAB5TDBOVT2VOLWWIRUFZFBR72YADB2Q
Note that this URN is a valid IRI and can be used in existing RDF data and tools.
For example the statements can be serialized as Turtle:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix schema: <https://schema.org/> . @prefix as: <https://www.w3.org/ns/activitystreams> . <urn:blake2b:FBRTEWJUSPW2EDMMITZKWAB5TDBOVT2VOLWWIRUFZFBR72YADB2Q> a as:Note ; as:attributedTo <https://test.example/alice> ; as:content "The clowns just went trough the loop!"@en ; as:image <https://test.example/notes/1#image> ; as:context <https://circus.show/performance-in-funky-town> . <urn:blake2b:FBRTEWJUSPW2EDMMITZKWAB5TDBOVT2VOLWWIRUFZFBR72YADB2Q#image> a as:Image ; as:url <https://test.example/images/1.jpg> .
The data can be transferred over HTTP, be queried on an SPARQL endpoint or be stored in a HDT [fernandez2013binary] file. Regardless of conversions to other serialization formats, transport over networks the URN remains same and can be readily recomputed given all the statements.
4.1 Caching existing RDF data
Plenty of content is already published using location-addressed identifiers (URLs). Such content can be grouped appropriately and made content-addressable (as shown with the ActivityStreams note).
However, it makes sense to keep track of the original source of the content. We propose doing this by adding some more information describing how and when the content was cached.
Consider following vocabulary for describing cached content that builds on the Dublin Core ProvenanceStatement:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix voaf: <http://purl.org/vocommons/voaf#> . <> dcterms:title "Cache Vocabulary"@en ; a voaf:Vocabulary . <#Cache> a rdfs:Class ; rdfs:label "Cache" ; rdfs:subClassOf dcterms:ProvenanceStatement . <#creator> a rdf:Property ; rdfs:label "Creator" ; rdfs:comment "The entity that created the cache" ; rdfs:subPropertyOf dcterms:creator . <#cachedAt> a rdf:Property ; rdfs:label "Date Cached" ; rdfs:subPropertyOf dcterms:created . <#source> a rdf:Property ; rdfs:label "Source" ; rdfs:comment "The source of the cache." ; rdfs:subPropertyOf dcterms:source .
The URN of the vocabulary is:
urn:erisx:AAAABDK5YTJKPC4O4AHTEDQBRLZXYCZO6E2NW6E5GCY4SRAQ5FLHI4BRZTILMJMLRA6VBDJ5XK7NGYSBJV47GKZTZXHTPV73THA2C4FLQJMA
Note that we used the ERIS encoding to compute an identifier instead of simply computing the hash.
We can add cache information to existing RDF data:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix schema: <https://schema.org/> . @prefix as: <https://www.w3.org/ns/activitystreams> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix cache: <urn:erisx:AAAABDK5YTJKPC4O4AHTEDQBRLZXYCZO6E2NW6E5GCY4SRAQ5FLHI4BRZTILMJMLRA6VBDJ5XK7NGYSBJV47GKZTZXHTPV73THA2C4FLQJMA#> . <> a as:Note ; as:attributedTo <https://test.example/alice> ; as:content "The clowns just went trough the loop!"@en ; as:image <https://test.example/notes/1#image> ; as:context <https://circus.show/performance-in-funky-town> ; dcterms:provenance <#cache> . <#image> a as:Image ; as:url <https://test.example/images/1.jpg> . <#cache> a cache:Cache ; cache:created "2020-06-07"^^<http://www.w3.org/2001/XMLSchema#date> ; cache:creator <https://inqlab.net/pukkamustard> ; cache:source <https://test.example/note/1> .
We can compute the hash of this Fragment Graph to get a URN:
urn:blake2b:L4X4DH7Q3UD5ZQIVES7CNNPL7KUPCIQU43SVX56G4RDPPG4GVJJQ
Note that the URN has changed because we added the information about how the content was cached. The URN will change again when we cache the content tomorrow. This reflects the fact that the original content (at https://test.example/note/1
) can change at any time and an immutable content-addressable version of it is only a cache that may be outdated.
Considering the fact that location-addressed content can never be brought into a single canonical form deterministically (it depends at least on when content is accessed) makes the problem that RDF data containing blank nodes cannot be deterministically skolemized (see On Blank Nodes) not so bad.
Location-addressed content that is made content-addressable is always a view of the original data at a certain time from a certain perspective - a cached version. Adding this information to the cached version seems to be a good idea.
4.2 Existing ontologies
The scheme presented can also be used for content-addressable ontologies (definition of classes and properties).
For example the Lingvoj Ontology defines classes and properties to describe languages and their use.
The entire vocabulary is a Fragment Graph (with base subject https://lingvoj.org/ontology
). We can compute the hash from the canonical representation to get a URN that identifies the vocabulary:
urn:blake2b:AKG576SYQJAY7RY3EGZ3KZQG37NDTCJMZTH7IT4YWXYTW36WVNWQ
This URN can be used instead of the URL http://www.lingvoj.org/ontology
:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix lingvoj: <urn:blake2b:AKG576SYQJAY7RY3EGZ3KZQG37NDTCJMZTH7IT4YWXYTW36WVNWQ> . @prefix as: <https://www.w3.org/ns/activitystreams> . <https://test.example/alice> a as:Person ; lingvoj:nativeLanguage <http://www.wikidata.org/entity/Q143> .
4.3 Example: Javascript web application
We have implemented a JavaScript web application demonstrating how existing RDF content can be made content-addressable: Encoding for Robust Immutable Storage (ERIS).
The example uses ERIS, an encoding of content that improves on naive content-addressing.
The demo also shows how existing decentralized networks (e.g. IPFS) can be used to robustly store content-addressed RDF.
4.4 Signing Fragment Graphs
Proving authenticity of content-addressed RDF reduces to signing the identifier. We propose a vocabulary for describing Ed25519 keys and signatures that can be used to sign Fragment Graphs.
5 Conclusion
We present a scheme for making RDF content-addressable. We believe that this may be used to improve robustness of RDF based systems as well as pave the way for decentralized applications to embrace RDF as data model.
Related current work includes Linkd Data Proofs by the W3C Digital Verification Community Group. In contrast to Linked Data Proofs we emphasize the usefulness of content-addressable RDF by itself. The ability to sign and prove authenticity of content follows very directly from content-addressability (see Signing Fragment Graphs).
We intend to implement the work presented in an ActivityPub Server enabling offline-first and decentralized applications in the context of the openEngiadina project.
This work has been supported by the NLNet Foundation trough the NGI0 Discovery Fund.
Bibliography
- [Polleres_Kamdar_Fernández_Tudorache_Musen_2020] Polleres, Kamdar, Fernández, Javier David, Tudorache & Musen, A more decentralized vision for Linked Data, Semantic Web, 11(1), 101–113 (2020). link. doi.
- [Cai_2004] Cai & Frank, RDFPeers, Proceedings of the 13th conference on World Wide Web - WWW ’04, (2004). link. doi.
- [Marx_2018] Marx, Saleem, Lytra, Ngomo & Axel-Cyrille Ngonga, A Decentralized Architecture for SPARQL Query Processing and RDF Sharing: A Position Paper, 2018 IEEE 12th International Conference on Semantic Computing (ICSC), (2018). link. doi.
- [Aebeloe_2019] Aebeloe, Montoya & Hose, A Decentralized Architecture for Sharing and Querying Semantic Data, Lecture Notes in Computer Science, 3–18 (2019). link. doi.
- [Kaoudi_2010] Kaoudi, Koubarakis, Kyzirakos, , Miliaraki, Magiridou, Papadakis-Pesaresi & Antonios, Atlas: Storing, updating and querying RDF(S) data on top of DHTs, Journal of Web Semantics, 8(4), 271–277 (2010). link. doi.
- [Karnstedt_2007] Karnstedt, Sattler, Richtarsky, , Muller, Hauswirth, Schmidt, & John, UniStore: Querying a DHT-based Universal Storage, 2007 IEEE 23rd International Conference on Data Engineering, (2007). link. doi.
- [Verborgh_2016] Verborgh, Vander Sande, Hartig, , Van Herwegen, De Vocht, De Meester, Ben, Haesendonck & Colpaert, Triple Pattern Fragments: A low-cost knowledge graph interface for the Web, Journal of Web Semantics, 37-38, 184–206 (2016). link. doi.
- [Wood_2014] Wood, Lanthaler & Cyganiak, RDF 1.1 Concepts and Abstract Syntax, W3C Recommendation, (2014). link.
- [Tummarello_2005] Tummarello, Morbidoni, Puliti, Paolo & Piazza, Signing individual fragments of an RDF graph, Special interest tracks and posters of the 14th international conference on World Wide Web - WWW ’05, (2005). link. doi.
- [Mallea_2011] Mallea, Arenas, Hogan, & Polleres, On blank nodes, 421-437, in in: International semantic web conference, edited by (2011)
- [Cyganiak_2011] Cyganiak, Blank nodes considered harmful (2001)
- [Carroll_2003] Carroll, Signing RDF Graphs, The Semantic Web - ISWC 2003, 369–384 (2003). link. doi.
- [Bizer_2007] Bizer & others, How to publish linked data on the web (2007)
- [Leach_2005] Leach, Mealling & Salz, RFC 4122: A universally unique identifier (UUID) URN namespace, IETF Proposed Standard, (2005). link.
- [Rivest_1997] Rivest, S-Expressions, IETF Internet-Draft, (1997). link.
- [Peterson_2012] Peterson, Gao, Malhotra, , Sperberg-McQueen, Thompson, Biron & PV, W3C XML schema definition language (XSD) 1.1 part 2: Datatypes, W3C Recommendation, (2012). link.
- [Bernstein_1997] Bernstein, Netstrings (1997)
- [Becket_2014] Beckett, Berners-Lee, Prud’hommeaux, & Carothers, RDF 1.1 Turtle, World Wide Web Consortium, (2014). link.
- [Longley_2019] Longley & Sporny, RDF Dataset Normalization, W3C Draft Community Group Report, (2019). link.
- [Prodromou_2006] Prodromou, RDF serialization to s-expressions, Blog post (via Internet Archive Wayback Machine), (2006). link.
- [aumasson2013blake2] Aumasson, Neves, Wilcox-O’Hearn, Zooko & Winnerlein, BLAKE2: simpler, smaller, fast as MD5, 119-135, in in: International Conference on Applied Cryptography and Network Security, edited by (2013)
- [fernandez2013binary] Fern\'andez, Mart\'\inez-Prieto, , Guti\'errez, Polleres & Arias, Binary RDF representation for publication and exchange (HDT), Journal of Web Semantics, 19, 22-41 (2013).
Content-addressable RDF by openEngiadina is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Footnotes:
This could be solved with quads and named graphs.