LUCERO is all about making University wide resources available to everyone in an open, linked data approach. We are building the technical and organisational infrastructure for institutional repositories and research projects to expose their data on the Web, as linked data. It is therefore natural for the interface to this data, the SPARQL endpoint and server addressing URIs in this data to be hosted under http://data.open.ac.uk. The first version of the components underlying this site, as well as a small part of the data which will be ultimately exposed there have gone live last week, with a certain level of excitement from all involved.
What is there? The data
The “launch” of data.open.ac.uk happened relatively shortly after the beginning of the LUCERO project. Indeed, we take the approach that the basic data exposure architecture have to be in place, to incrementally integrate data into it. As a first step, we developed extraction and update mechanisms (see the previous blog post of about the LUCERO workflow) for two important repositories at the Open University: ORO, our publication repository, and podcast, the collection of podcasts produced by the Open University, including the ones being distributed through iTunes U.
ORO data concerns scientific publications with at least one member of the Open University as co-author. The source of the data is a repository based on the EPrints open source publication repository system. EPrints already integrates a function to export information as RDF, using the BIBO ontology. We of course used this function, post-processing what is obtained to obtain a representation consistent with the other (future) datasets in data.open.ac.uk, in particular in terms of URI Scheme. The ORO data represents at the moment 13,283 Articles and 12 Patents, in approximately 340,000 triples (see for example the article “Molecular parameters of post impact cooling in the Boltysh impact structure”).
Podcast data is extracted from the collection of RSS feeds obtained from podcast.open.ac.uk, using a variety of ontologies, including the W3C media ontology and FOAF (see for example the podcast “Great-circle distance”). An interesting element of this dataset is that it provides connections to other types of resources at the Open University, including courses (see for example the course MU120, which is being referred to in a number of podcasts). Podcasts are also classified into categories, using the same topics used to classify courses at the Open University, as well as the iTunesU categories, which we represent in SKOS (see for example the category “Mathematics”).
While representing only a small fraction of the data we will ultimately expose through data.open.ac.uk, the new possibilities obtained by exposing openly these datasets in RDF, with a SPARQL endpoint and resolvable URIs are very exciting already. In a blog post, Tony Hirst has shown some initial examples and encouraged others to share their queries to the Open University’s linked data. Richard Cyganiak has also kindly created a CKAN description of our datasets, for others to find and exploit.
The technical aspects
In a previous blog post, we gave an overview of the technical workflow by which data from the original sources would end up being exposed as linked data. The current platform implements parts of this workflow, including updaters and extractors for the two considered datasets. At the centre of the platform is the triple store. After trying several options, including Sesame, Jena TDB and 4Store, we settled for SwiftOWLIM, which is free, scalable and efficient, and includes limited reasoning capabilities, which might end up being useful in the future.
The current platform also implements the mechanisms by which URIs in the http://data.open.ac.uk namespaces are being resolves. Very simply, a URI such as http://data.open.ac.uk/course/a330 can either be re-directed to http://data.open.ac.uk/page/course/a330 or to http://data.open.ac.uk/resource/course/a330 depending on the content being requested by the client. http://data.open.ac.uk/page/course/a330 shows a browsable webpage linking the considered resource to related one, while http://data.open.ac.uk/resource/course/a330 provides the RDF representation of this resource.
A SPARQL endpoint is also available, which allows to query the whole set of data, or individual datasets through their namespaces, http://data.open.ac.uk/context/oro and http://data.open.ac.uk/context/podcast.
Of course, this first version of data.open.ac.uk is only the beginning of the story. We are currently actively looking at the way to represent and extract information about courses and qualifications from the Study At the OU website, as well as at information about places in the OU campus and regional centres (building, car parks, etc.)
More ways to access will also be soon made available, including faceted search/browsing, and links to external datasets are being investigated. All this is going to be gradually integrated into the platform while the existing data is being constantly updated.