At the Press Association we have been busy building out a semantic publishing platform over the last few months. This technology allows the Press Association to accurately semantically annotate news assets (stories and images) with linked data concepts (locations, people, organisations etc), so PA can provide APIs for their consumers based on rich semantic and geospatial aggregations. These semantic APIs let us provide answers to consumer questions such as “Give me all the articles about Shale Gas Fracking within 30km of Blackpool”, or “Give me all the articles about The Arab Spring, that mention Barrack Obama and Bashar Assad”. Through relationships in the linked data concepts, we can provide more complex answers such as “Give me all the images of living, newsworthy people, born in Liverpool who are involved in politics”.
The core architecture is based on a content store (MarkLogic) that persist news assets (stories etc) as NewsML-G2, and a triple store (OWLIM) that persists semantic metadata (RDF) describing those assets. Assets are semantically annotated using a bespoke service built with and upon the GATE platform. This is a very similar core architecture to that which was built at the BBC, and is now successfully powering the BBC Sport website. Whilst any type of content store could be effectively used for this (the choice should be pragmatic based on the the type of assets one is storing) the semantic components are key to a semantic publishing architecture. One of the fundamental pieces of the jigsaw puzzle is the use of Ontologies and RDF throughout the tech stack. Here I will describe some of the main software and ontology engineering concepts and how the ontology design and the software design work together in unison.
The SNaP ontologies have been designed to allow the simple representation and rich annotation of news assets with linked data concepts in RDF. The Asset ontology is remarkably simple as it is not intended to model the asset content, this is left to the schema that defines the content (in this case NewsML-G2). Instead the asset ontology is simply designed to represent and identify assets in RDF, whist maintaining some pertinent metadata about the asset, such as its title, summary, published dates etc. These properties are enough to describe (and render) the asset in aggregations and indexes.
The Stuff ontology is designed to provide a world-model of concepts that the news assets can be annotated with. Classes of tangible (People, Places and Organisations for example) and intangible stuff are modelled with simple relationships between them. These relationships have been designed to be simple with high utility. Temporal relationships have been purposefully left out in order to reduce the overhead of instance data management, whilst retaining enough meaning to be able to deliver rich semantic aggregations of news assets annotated with these concepts. Relationships defined are designed and named such that (accurate) statements made based on these relationships will always be true and not time constrained.
The basis for these inter-Stuff relationships is the “notablyAssociatedWith” property and is intended as a strong relationship between stuff hence the use of the word notable. Notable association statements should only be made when the two concepts genuinely have a strong link, specialist non-temporal relationships then inherit from notablyAssociatedWith (for example placeOfBirth).
As we persist all our RDF into a triple-store with OWL and RDFS inferencing, then statements asserted using the more specialised properties also materialise statements that inherit from the parent predicate notablyAssociatedWith. This is important. It lets us bind core functionality to this parent relationship between instances of stuff while retaining the ability to make richer more specific queries via the more specialised sub properties. Binding to the notableAssociatedWith predicate is fundamental to the software engineering in the semantic annotation service.
At PA we have built full semantic annotation rather than tagging at the document level. This means we identify the location of a term in the text that corresponds to the underlying entity. This has been built using the GATE toolkit in partnership with a development team at Ontotext.
The semantic annotation service consists of a pool of GATE built processing pipelines that can service document annotation requests via a RESTful web service API. Having a pool of pipelines lets us farm out requests concurrently to available pipelines, whilst avoiding thread safety issues within an individual pipeline. Requests are then load balanced across a cluster of servers to add redundancy as well as performance. As the semantic annotation servers are independent and stateless, it is simple to add additional servers into the cluster to scale horizontally if required.
One of they key requirements for the semantic annotation service was that we wanted it to be client driven and thus self learning. We did not want it to be fundamentally rules-based which would require indefinite ongoing maintenance of knowledge rules to ensure the F1 scores would remain high (>90%) within the ever-changing context of news. To meet this requirement, outside of simple JAPE grammars to match dictionary terms in the text, the key entity disambiguation and text analysis processes in the semantic annotation pipeline are based around (1) ontological proximity and (2) statistical models :
Ontological proximity disambiguates entities in the text by looking at relationships between entities that have been matched in the gazetteers. Entities that have a close ontological relationship are deemed to be more likely to be correct. For example, if analysis of a given document identifies both David Cameron (Prime minister) and David Cameron (football player/manager) and also Samantha Cameron, then David Cameron (Prime minister) will be disambiguated due to the close ontological relationship with his wife. This is a powerful tool, and is where the value of the pns:notablyAssociatedProperty becomes apparent. By building (binding) the software that performs disambiguation by ontological proximity only to this relationship we gain powerful disambiguation, while retaining the ability to extend our ontology (and join to other cohesive public domain ontologies) without breaking the semantic annotation code.
As we extend the Stuff model with new properties or join onto an existing ontology (see previous post) as long as we ensure the new properties (or predicates from the public domain ontology) inherit from pns:notablyAssociatedWith via rdfs:subPropertyOf then instances added ot the dictionaries that conform to the new model immediately take part in disambiguation by ontological proximity without having to change the semantic annotation code in any way, or having to add any new rules.
As well as binding to the ontology at a low level with the semantic annotator, the architecture at the Press Association also calls for a set of RESTful web services to support (UIs for) instance data curation, semantic annotation (UIs and workflow), and the semantic aggregation of news assets. All of these web service APIs invoke SPARQL queries against the triple store. The APIs are based around the individual core SNaP ontologies. An Asset API that performs operations on news Assets binding to the Asset ontology, a Stuff API that performs operations on Stuff RDF entities that bind to the Stuff ontology. These APIs return different flavours of RDF based on content-negotiation. The Asset API will also return ATOM feeds of asset aggregations with the core Asset ontology properties transformed to the ATOM schema. When querying for multiple RDF entities (via a search or get-collection style request) they exploit SPARQL 1.1 under the covers to provide for paging of results sets using sub-selects with LIMIT and OFFSET clauses in the underlying query.
The ontology is then used to filter search results in the APIs by rdf:type, passing the URIs of the ontology class types to define the filter to apply. For example, in the Asset ontology, search results can be limited to Video, Text, or Image assets only or a combination thereof. Similarly on the Stuff API, stuff can be filtered by any subClass or combination of subClasses of Stuff. By utilizing the built-in Lucene full-text search index in OWLIM (indexing instance string literals and labels), the Stuff API combines full text search terms with Class types on Stuff entities. We can then provide a flexible search API that provides RDF or OpenSearch Suggestions JSON responses via content-negotiation. Using the OpenSearch Suggestions format lets us quickly build out rich and extremely functional type-ahead lookups into the CMS and Semantic annotation UIs that can be filtered on any Stuff subclass - “Find me People instances containing the word ‘Rooney’”.
Instance data curation UI’s are built directly into the CMS. This allows content authors to add a new entity into the triple-store via a CRUD API bound to the Stuff ontology. When a journalist is semantically annotating an article, if an entity in the text is not recognised in the system (for example a person whom has just become newsworthy), he has the option of creating a new entity. The UI form that is presented is generated directly from the Stuff ontology based on the type of entity the journalist needs to curate. Properties and their respective ranges from the ontology are used as a basis for the form fields in the UI. The journalist completes the form, submits it and the RDF for the new entity is POSTed to the Stuff API. The supplied RDF is validated to ensure it conforms to the correct Ontologies for the endpoint using Jena Eyeball, and the triple store and semantic annotation dictionaries are updated accordingly.
So why do we need another set of ontologies for this? Primarily the SNaP ontologies have let us build a robust yet efficient software architecture meeting the typical use-cases of news publication while embracing linked data.
What about rNews and Schema.org ? A semantic publishing architecture based on full-blown RDF and the SNaP ontologies is by no means mutually exclusive with rNews, schema.org micro data or any other schemas for that matter (in fact the SNaP ontologies already inherit from many common public domain ontologies). In this case rNews metadata can be added to documents post publication (downstream) either via mapping to SNaP or via transformation. We have identifed join points between SNaP and rNews to ensure this can happen.
The use of non-temporal relationships in the Stuff ontology reduces instance data maintenance overheads, reduces risk, and increases overall stability while still letting us make rich relationships between newsworthy things. Binding to specific parent property (notablyAssociatedWith) in the Stuff ontology lets us build a quality semantic annotation service that is easily extensible and maintainable.
The ontologies have been built with news use-cases in mind and allow us to build a robust and efficient enterprise software architecture, however this doesn’t prevent us materialising more commonly used properties and types from the public domain ontologies that consumers of our content might already be familiar with. It also doesn’t prevent us from materialising rNews or Schema.org properties on published content. The use of SNaP is not mutually exclusive with the these vocabularies.
The RDF/Ontology model is pervasive throughout the technical stack - with traditional relational database, the model is typical contained at the bottom of the stack, here though the ontology design is holistically fundamental to the overall architected solution. Thus the design paradigm has shifted from concealment of a backend (relational) model entirely, to complete exposure of the model all the way through the technical stack, and all the way to your consumers. This is a big architectural shift, but doesn’t stop you still adhering to the good software engineering principles and the SOA architectures that work so well.