Linked-Data URI Strategy for Organisations

Typically, organisations have tended towards publishing data using bespoke APIs or XML schema. The data is then transformed by recipient systems into their own bespoke schemas, and the unique identifiers of the key data elements often lost.

There are huge efficiencies to be made if instead these standard identifiers – or URIs - for key data elements are published on the web. The increased use of a common identifier reduces the cost of integrating disparate data. In addition, much like a link to a document url, they can be followed back to their source for more information or simply a definition.

Brand Trust & Provenance

Your URIs reflect your brand and provide an indication of trust for consumers of documents and data. Geonames is an open linked geographic dataset. It draws upon a variety of data sourced from data providers around the world. Geonames is now becoming more widely adopted as organizations look for sources of strong identifiers for geographic locations (BBC, NYT). This causes problems for well trusted data providers like the Ordnance Survey. OS data is used extensively within Geonames but it is impossible to tell which has come from OS, which has been added from another source. OS URIs exposed in the Geonames dataset would have provided a level of confidence in the subset of the Geonames data (an indication of provenance). Unfortunately OS have been relatively slow to provide these, so at best will have to be retrofitted back into Geonames which will be a cost to Geonames it is unlikely to want to incur.

Relevance & Value

The more URIs are used the more valuable they become. Companies House records all the registered companies in the UK. Accessing this data has in the past been a laborious and expensive process. By scraping Companies House data and combining it with similar data sources, OpenCorporates now provides a URI for every UK company (along with URIs for companies in many other jurisdictions as well). As a result, OpenCorporates has rapidly becoming the data hub for corporate information on the web and is in many respects more relevant than Companies House for identifying UK companies on the web. For example a number of local councils include OpenCorporates URIs in their published spending data. Companies House has announced they intend to publish URIs for companies, which seems possibly a vain attempt to displace OpenCorporates and regain its position as the central hub for corporate information.

Integration & Reach

Lowering the barrier to integration makes it more likely that a particular data service will be used or bought. MusicBrainz is an IMDB for Music and the standard offering is a free service that provides URIs. Because MusicBrainz provide URIs it becomes increasingly cost effective for its users, such as the BBC, to integrate new product features such as the LastFM service and Guardian music reviews through the common use of URIs. In addition the BBC pays to subscribe to the MusicBrainz premiere data service as it is now dependent on MusicBrainz identifiers and data.

Some Technical Considerations

Instance data URIs should be designed with only the very minimum amount of information embedded in the URI as required in order that they :

  • encapsulate the primary key (or possibly an alternate key) of the business domain
  • can be logically separated into API endpoints sufficient to aptly meet the requirements of your use cases.
  • can be routed within your physical architecture to the correct destination (the physical data silo)
  • One does not want verbose semantics embedded in linked data URIs as developers and consumers must not and should not be encouraged to infer semantic information from the URI. The URI is just an Identifier that must fulfil the 3 conditions described above.

The semantics are delivered through the underlying data referenced by the URI. Over time if the semantics of the underlying resource change you may end up with a URI that does not reflect the resource it refers to if the URI has been minted with embedded information.

While the single requirement for a data instance URI is simply that it encapsulates the primary key (or possibly an alternate key) of the business domain. It is in essense a data modeling unique identifier in URI format. Adding anything else e.g. to ensure that an HTTP request can be correctly routed within your physical architecture is at best an informed deployment compromise and possibly even an anti-pattern: as it creates an unintentional coupling of physical and logical architecture that is going to cause headaches further down the road when the physical architecture changes? Unless absolutely unavoidable, the physical architecture ideally should be adapted to work with your URIs rather than the other way round.