when relevant content is
added and updated.
Data management image via Shutterstock
By James Kobielus (@jameskobielus)
Data management professionals know that how you model the data directly constrains how flexibly you can analyze it.
When you consolidate relational sources that embody divergent data schemas and definitions, you are inviting a world of pain. Rollup of those sources for unified drilldown can’t take place until you run it all through a gantlet of data integration, matching, merging, and cleansing. Even then, you generally have to make the resultant data set available in relational third-normal form.
And when you add unstructured sources to the mix, watch out! Querying across multi-structured sources might involve unstructured-data integration to transform the nonrelational data to relational schemas that support SQL access. Or it might involve keeping data in its source formats and offering agile query access through an abstraction that can do justice to the myriad semantics.
That’s where ontologies, taxonomies, and other data abstractions enter the picture. As multi-structured data moves into the mainstream, data scientists will increasingly require integration tools to help them analyze data within the semantic contexts expressed in these and other domain-specific abstractions. As noted in this recent article on ontologies, these and other abstractions have a clear analytic advantage over relational and other platform-specific models.
Ontologies, as author Malcolm Chisholm emphasizes, are principally oriented toward data’s analytical uses within and across disparate data-store implementations. Framed in Resource Description Format and other formats, ontologies are, he states, “analysis, not design, artifacts,” geared to semantic query and knowledge discovery. “An ontology is a view of the concepts, relations and rules for a particular area of business information, irrespective of how that information may be stored as data.”
In the broader perspective of multistructured analytics, ontologies support the following use cases:
- Building semantic models: Developers explicitly model semantics as RDF ontologies and/or related logical structures like taxonomies, thesauri, and topic maps. These ontologies are used to drive the creation of structured content that instantiates the entities, classes, relationships, attributes, and properties defined in the ontologies.
- Mediating between heterogeneous semantics. Developers use ontologies and other semantic models to drive the creation of mappings, transformations, and aggregations among existing, structured data sets.
- Mining the semantics implicit in unstructured formats: Developers use natural-language processing and pattern-recognition tools to extract the implicit semantics from unstructured text sources.
- Managing semantics in a consolidated repository: Application environments require repositories or libraries to manage ontologies and other semantic objects and maintain the rules, policies, service definitions, and other metadata to support the life-cycle management of application semantics.
- Governing semantics through comprehensive controls: Application environments require that various controls — on access, change, versioning, auditing, and so forth — be applied to ontologies; otherwise, it would be meaningless to refer to them as “controlled vocabularies.”
You might regard ontologies as metadata applicable to the deep analytic meaning of data. As such, ontologies are a key semantic stratum within which all data-driven insights are rooted firmly–and from which they all exude like liberated liquid energy.