Big Data image via Shutterstock
By James Kobielus (@jameskobielus)
Big data demands a rich virtualization layer that can support the full range of searches and queries that might be made against any and all data: structured, unstructured, and all shades in between.
Semantic search, analysis, and categorization are key applications of unstructured data. These often rely on semantic-enabled big-data platforms with rich metadata repositories, computational linguistics libraries, ontology modeling, and interactive visualization tools.
Semantic query standards are an essential complement to the SQL query virtualization approaches that are taking root. The big-data cosmos is evolving away from a single master schema and toward data virtualization behind a semantic abstraction layer. Under this new paradigm, application developers require simplified access to the disparate schemas of the relational, dimensional, columnar, graph, and other constituent repositories that constitute a logically unified big-data resource.
Though big-data professionals may not be aware of it, there is already a semantic query standard available: SPARQL. Created under the World Wide Web Consortium’s Semantic Web activity, SPARQL can play nicely in the burgeoning big-data arena. As I’ve noted elsewhere, this and companion standard RDF (Resource Description Framework), provides an open framework for expressing and query the rich semantics of graph, document, and other stores.
RDF & SPARQL are the clear standard for an online world that is becoming dominated by unstructured data types. The defining characteristic of unstructured data is that it has non-explicit non-schematized semantics, hence data scientists must rely on some combination of manual tagging, natural language processing, text mining, machine learning, and other approaches to extract the semantics. And they must have a semantically rich metadata layer (built on RDF) and a semantically agile search/query layer (built on SPARQL).
I’m glad I’m not the only one beating this drum. As the author of this recent article states: “SPARQL is SQL writ large, intended for working with web based federated triple stores built upon the open world assumption…..It can be applied across a “database” that spans dozens or even hundreds of service “end points” across the Internet, but it can also be applied to an in-memory database on a web page in a browser….A SPARQL layer on top of HBASE, Hive or HDFS, or other map/reduce type architectures would be incredibly useful.”
What exactly is stopping the big-data industry from marching around SPARQL and RDF? We need to push the industry agenda in the right direction to overcome this counterproductive inertia.