Posted by: Sasirekha R
analytics, Apache, IBM, Internet, RDF, search features, Semantic Web, UIMA, unstructured data
Mining Unstructured Information using UIMA
Vast amount of knowledge is available as natural language text – web documents, reference books, encyclopedias, dictionaries, textbooks, technical reports, contracts, novels etc. Add to it the growing volumes of images, audio and video. Undisputedly unstructured information is the largest, most current and fastest growing source of knowledge. It is “unstructured” as it lacks explicit semantics (or structure) that is typically used by computer applications to process the same.
Unstructured Information Management Architecture (UIMA) is a framework for finding latent meaning, relationships and relevant facts from unstructured text. UIMA is useful for building analytic applications that analyze large volumes of unstructured information to discover relevant knowledge.
Unstructured information must become “structured” so that the applications can interpret it correctly. A typical UIM application would take plain text as input and identify entities (persons, places, organizations etc.) and relationships. UIMA standardizes semantic search and content analytics, providing a common method for meaningfully accessing data contained in text such as e-mails, blog entries, news feeds, and notes, as well as in audio recordings, images, and video.
Originally developed by IBM, now UIMA is now a top level Open Source project at Apache. In March 2009, UIMA is approved as an OASIS standard. Hopefully these trends would translate to more UIMA compliance from third party vendors.
UIMA’s objective to support interoperability among analytics is divided into four design goals:
• Data Representation - Support common representation of artifacts (the unstructured information) and artifacts metadata (results from analysis).
• Data Modeling and Interchange - Support the platform-independent interchange of analysis data in a form that facilitates a formal modeling approach and alignment with existing standards.
• Discovery, Reuse and Composition - Support the discovery, reuse and composition of independently-developed analytics.
• Service-Level Interoperability – Support concrete interoperability of independently developed analytics (software for analysis) based on a common service description.
The seven elements of UIMA specification are:
1. Common Analysis Structure (CAS) – Common Data structure shared by all UIMA analytics to represent the artifcact and the artifact metadata. The CAS is an Object Graph. The CAS representation can be easily elaborated for specific domains.
2. Type System Model – A collection of inter-related type definitions. Every object in a CAS must be associated with a type. The UIMA Type-System is a declarative language for defining object models. Type Systems are user-defined. Each type definition declares the attributes of the type and describes valid fillers for its attributes. Types can be single-valued or multi-valued, or constrained to a legal range of values depending on the needs of the application. UIMA adopts Ecore as the type system representation, due to the alignment with standards and the availability of EMF tooling.
3. Base Type System – Standard definition of commonly-used, domain-independent types. This establishes a basic level of interoperability. The most significant part of the Base Type System is the Annotation and Sofa (Subject of Analysis).
4. Abstract Interfaces – Defines the standard component types and operations that UIMA services implement. Processing Element (PE) is the supertype of all UIMA components PE interface defines getMetadata() and setConfigurationParameters(). Analyzer, CAS Mutliplier and Flow Controller are the subtypes. An Analyzer (most common) processes a CAS and possibly updates it contents. A CAS Multiplier processes a CAS and possibly creates new CASes – say for example dividing CAS into pieces or merging multiple CASes. A Flow Controller determines the route CASes take through multiple Analytics.
5. Behavioural Metadata – Declaratively describes what the analytic does – say what types of CASs it can process, what elements in a CAS it analyzes and what sort of effects it may have on CAS contents etc. Analytics are not required to declare behavioural metadata. But it means that an application using the analytic cannot assume anything about the operations of the analytic.
6. Processing Element Metadata – Defines the structure of processing element metadata and provides an XML schema in which PEs must publish this metadata. All PEs must publish metadata which describes the analytic to support discovery and composition. PE metadata has: Identification information, Configuration parameters, Behavioural Metadata, Type System and Extensions.
7. WSDL Service Descriptions – Specifies a WSDL description of the UIMA interfaces and a binding to a concrete SOAP interface that compliant frameworks and services MUST implement.
In UIMA the original content is not affected in the analysis process. Instead, an object graph that stands off from and annotates the content is produced. Stand-off annotations in UIMA allow for multiple content interpretations of graph complexity to be produced, co-exist, overlap and be retracted. Typically an analytic generates from the UIMA representation an in-line XML or an XMI or RDF document.
According to Apache, “UIMA is, by itself, an empty framework. Its purpose is to enable a world-wide, diverse community to develop inter-operable, often complex analytic components, and allow them to be combined and run together, with framework supplied scaled-out and remoting as needed”.
Apache site (http://uima.apache.org/) provides the framework, components and infrastructure. The frameworks (available in Java as well as C++) run the components. The framework provides a common platform for unstructured analytics, enabling reuse of analysis components – annotators, parsers and consumers.
UIMA Annotators are the ones that do the real work of extracting structured information from unstructured data. Apache site itself provides a list of annotators – like Regular Expression Annotator that detects entities like email addresses, URLs, phone numbers, zip codes or any other entity based on regular expressions and concepts, Dictionary Annotator that creates annotations based on word lists. In addition UIMA annotators including Natural Language Processors from various vendors can be downloaded from web (some of them listed in http://uima.apache.org/external-resources.html) .
A full analysis task for a search or intelligence application is a multi-stage process. As UIMA defines a common, standard interface, annotators from multiple vendors can be made to work together. The UIMA application can use the annotators without finding out how they work internally. The UIMA framework can take care of the integration and orchestration of the annotators.
Apache site also provides tools for either creating new interoperable text analytics modules or enabling existing text analytics investments to operate within the framework.
IBM has empowered its products and services with UIMA creating a channel for third-party vendors to deploy their text and multi-modal analytics in larger integrated solutions. IBM OmniFind Enterprise Edition provides UIMA for building full-text and semantic search indexes and Analytics Edition deploys UIMA for information extraction and text analysis.
Semantic Search applications can benefit the most by using UIMA framework and UIMA components for:
· Identifying the language of the specific document
· Language dependent linguistic processing (tokenization, lemmatization, and even speech detection).
· Analyzing the text contents for entity and relation detection.
Business Intelligence or Government Intelligence is another major area which can use UIMA. Sample applications include:
- Defect Detection and Early Warning System (gain insight from service and maintenance records)
- Customer support and self-service (analyzing the call center logs, emails etc.)
- Public image monitoring (finding out from internet forums and discussions the image pertaining to a product or company).
- Insurance Fraud analysis (identifying hidden relationships and patterns from claims documents).