What is going wrong with the Semantic Web?
The US Semantic Technologies Symposium was held at Wright State University a month ago where there were great discussions with Craig Knoblock about SPARQL servers reliability1, Eric Kansa about storing archeology data and Open Context, Eric Miller about the workings of the W3, mid West farmers and old bikes, Matthew Lange about tracking crops with LOD and a 'fruitful' talk with Evan Wallace about farm data storage standards.
Thinking through these conversations, I decided to outline what I think are the troubling conclusions for our area, namely that a) Semantic Web adoption is lagging, b) we keep rehashing old problems without moving on and c) our ongoing lack of support for our own projects after which I'll suggest a few solutions.
Semantic Web adoption is not where we'd like it to be
Very, very few people care about data management2. Even fewer people understand data management. I'd go so far to say that the majority of the IT community spends its time moving strings to the end-user's screen, focusing primarily on user communications and getting told what to communicate. Other developers may worry about analysis, networking stacks or storage, but the number of them that care about the data itself are few.
That leaves the database developer whose entire attention is on bread and butter issues. Why did we have the hubris to think that people would care about the Semantic Web when most of them have no data management problem to worry about? 31% of webpages are reported to contain schema.org3, primarily because Web Developers believe it will help them with SEO issues, not because it helps them with data management or interoperability.
Reinventing the wheel. Again.
I regret I didn't note the speaker who said "If your developers care about JSON, I don't care about your developers", because it goes to the heart of the matter about poor Semantic Web training and education. At this stage arguments about serializations are about as relevant as debating whether submarines can swim4. The was a lot of talk at the meeting about creating new json standards to handle corner cases without knowledge or regards for previous standards because "it's not JSON and people want JSON". The Semantic Web stack translates the model to whatever serialization is needed, in most cases negotiated without programmer involvement. JSON is really nice for web developers, RDF/XML for XPATH, turtle for authoring, n3 for throughput, et al. David Booth also noted the panoply of standards and vocabularies. A number of them have been beautifully engineered by domain experts (GeoSPARQL5, OWL-TIME, SOSA and PROV come to mind), it's an outright waste of everyone's time not to reuse them.
We expect research groups to act like service providers
The lack of reliable services and exemplars was also noted: the curated New York Times RDF dataset is no longer answering, the BBC has cut back on outward looking Semantic Web services and DBPedia, at the heart of the LOD cloud, is still running on a borrowed virtual machine with the DBPedia Association having a hard time raising funds. I would like to echo Juan Sequeda's post that we should set aside some grant monies for resources such as Linked Open Vocabularies, a great vocabulary/ontology location tool6 Getting operational funding is always a slog but we cannot advocate for a technology when the exemplars are not maintained and disappear overnight.
In the past we've gotten away with a lot by stuffing machines under graduate students's desks and getting them to write applications between course work and thesis submission. This is not sustainable and we need to make an effort on long term sustainability.
What we should be doing
The Semantic Web stack is annoyingly complex, not because of the technology but because of the problems that it is trying to solve. Its critics abound (even Hitler apparently) but there is no real alternative to deal with data at scale. Organizationally, it sits uncomfortably between two communities:
The first is the small group of developers that deal with web apis, mostly independently from each other. Integrations are done on an ad-hoc basis when the one-off business requirement presents itself. These are the people that came up with ideas like Swagger: simple documentation that focuses on programmatic operations with little semantics about the transaction itself. Want it in Orange? Set colour_id to 2. Why 2? Because that's the value some developer arbitrarily decided on at the time. Why is your self-evident use case not handled? Because no one has needed it before. Development is incremental. If an error occurs, put in a ticket into github, no harm done.
The second is the Enterprise Resource Planning crowd that has has been doing this for a very long time, albeit usually within a single organization and with massive amounts of corporate resources. Because they care deeply that orders of 5,000 sheets of 8.5x11 paper aren't interpreted as orders of 8,511 sheets of 5000in2 paper, they tend to document everything (a single API document may run 100's of pages) and have a neurotic attention to change management. There have been spectacular failures when implementing these mammoth7 systems, but generally you can order something from across the world and it will show up on your doorstep next week.
The Semantic Web has a lot to offer to both these communities: a ready made semantic modelling language8 that is reusable by web apis, URL-based global identifiers and a unified multilingual documentation framework than fits corporate needs. Bridges need to be built with application domain experts and with existing data eco-systems. Logistics systems such as Global Trade Item Number are pushing the limits of what we can do with barcodes and relational databases. We want the Internet of Things, the Internet of Food, a smart power and transportation grid and a bibliographic system that isn't going to split it's seams. The only way that we can achieve all of this is to have the data that is being generated supported by content and the Semantic Web.
- 1. Short version, my experiences with the Muninn Project, CWRC, CLDI and Myra have been positive. Overall SPARQL servers have had less engineering calendar time than other comparable software: Apache and Mysql have been worked on since 1995, Postgresql since 1986. In contrast, Virtuoso has had SPARQL since 2005, Alleograph 2004 and ARC2 2010. 10+ extra years of development work helps. Furthermore, Mondeca's SPARQL endpoint monitor show that SPARQL servers do have good uptime. The often misquoted 63% of endpoints being offline applies to every SPARQL server ever seen since 2013. The statistic that should be worrisome is that only 13% of them have ever had a machine readable description!
- 2. Data management is the simplest redux of the semantic web and ontology. I'm setting the bar low on purpose...
- 3. It would be interesting to see how many of these triples are well formed, sensical and form a data structure that makes syntactically.
- 4. With apologies to Edsger Dijkstra.
- 5. The name is somewhat of a misnomer since the standard contains both an ontology to describe both feature and geometry, as well as SPARQL extensions meant to spatially reason over the data. It is based on previous OGC work and is rock solid.
- 6. Developed by Bernard Vatant and Pierre-Yves Vandenbussche .
- 7. Without putting Alessandro Oltramari on the spot, it took Robert Bosch over a decade to get everything running and it is considered a case-worthy installation.
- 8. Notwithstanding some early OWL missteps mentioned by Deborah McGuinness, the basic ontological framework underneath the semantic web is extremely powerful and a godsend for data integration.