Data management image via Shutterstock
By James Kobielus (@jameskobielus)
Data management professionals know that how you model the data directly constrains how flexibly you can analyze it.
When you consolidate relational sources that embody divergent data schemas and definitions, you are inviting a world of pain. Rollup of those sources for unified drilldown can’t take place until you run it all through a gantlet of data integration, matching, merging, and cleansing. Even then, you generally have to make the resultant data set available in relational third-normal form.
And when you add unstructured sources to the mix, watch out! Querying across multi-structured sources might involve unstructured-data integration to transform the nonrelational data to relational schemas that support SQL access. Or it might involve keeping data in its source formats and offering agile query access through an abstraction that can do justice to the myriad semantics.
That’s where ontologies, taxonomies, and other data abstractions enter the picture. As multi-structured data moves into the mainstream, data scientists will increasingly require integration tools to help them analyze data within the semantic contexts expressed in these and other domain-specific abstractions. As noted in this recent article on ontologies, these and other abstractions have a clear analytic advantage over relational and other platform-specific models.
Ontologies, as author Malcolm Chisholm emphasizes, are principally oriented toward data’s analytical uses within and across disparate data-store implementations. Framed in Resource Description Format and other formats, ontologies are, he states, “analysis, not design, artifacts,” geared to semantic query and knowledge discovery. “An ontology is a view of the concepts, relations and rules for a particular area of business information, irrespective of how that information may be stored as data.”
In the broader perspective of multistructured analytics, ontologies support the following use cases:
- Building semantic models: Developers explicitly model semantics as RDF ontologies and/or related logical structures like taxonomies, thesauri, and topic maps. These ontologies are used to drive the creation of structured content that instantiates the entities, classes, relationships, attributes, and properties defined in the ontologies.
- Mediating between heterogeneous semantics. Developers use ontologies and other semantic models to drive the creation of mappings, transformations, and aggregations among existing, structured data sets.
- Mining the semantics implicit in unstructured formats: Developers use natural-language processing and pattern-recognition tools to extract the implicit semantics from unstructured text sources.
- Managing semantics in a consolidated repository: Application environments require repositories or libraries to manage ontologies and other semantic objects and maintain the rules, policies, service definitions, and other metadata to support the life-cycle management of application semantics.
- Governing semantics through comprehensive controls: Application environments require that various controls — on access, change, versioning, auditing, and so forth — be applied to ontologies; otherwise, it would be meaningless to refer to them as “controlled vocabularies.”
You might regard ontologies as metadata applicable to the deep analytic meaning of data. As such, ontologies are a key semantic stratum within which all data-driven insights are rooted firmly–and from which they all exude like liberated liquid energy.
VMware image via Shutterstock
Missed all of the VMworld 2014 news? Not to worry, it’s all here in this week’s roundup.
1. With EVO: RAIL, VMware turns VSAN into a franchise – Dave Raffo (SearchVirtualStorage)
VMware’s EVO: RAIL allows hardware vendors to build hyper-converged appliances running VSAN and other VMware software.
2. Software-defined data centers pique IT’s interest – Margie Semilof (SearchServerVirtualization)
IT pros’ interest in software-defined data centers continues to grow as tools, such as VMware’s EVO:RAIL, offers IT an effective small business option.
3. Backoff point-of-sale malware hits over 1,000 businesses – Brandan Blevins (SearchSecurity)
In an advisory Friday, the U.S. government estimated that the Backoff point-of-sale malware campaign has struck over 1,000 businesses to date.
4. Apple and FBI launch iCloud hack investigation – Warwick Ashford (ComputerWeekly)
Apple and FBI investigate the breach of Apple’s iCloud causing fresh business concerns over cloud security.
5. Maxta Inc. develops MaxDeploy, seeks hardware partners – Garry Kranz (SearchVirtualStorage)
Like VMware, Maxta wants to sell its software-only, hyper-converged storage platform integrated on standard industry hardware.
Google image via Shutterstock
How will Google be able to combine IaaS and Paas? Tune into this week’s roundup to find out.
1. Google fills the gap between Iaas and PaaS – Trevor Jones (SearchCloudComputing)
Google wants to merge the worlds of IaaS and PaaS to create a single continuum of services for customers. It’s likely a sign of things to come from all the major public cloud vendors as they look to cover their bases in the maturing market.
2. Two-year PC replacement saves cost, raises productivity – Diana Hwang (SearchEnterpriseDesktop)
IT pros debate whether companies should replace PCs every two years instead of following conventional wisdom of three to four years. In today’s world, one size doesn’t fit all, and a two-year cycle may work in some cases.
3. Community Health breach shows detecting Heartbleed exploits a struggle – Brandan Blevins (SearchSecurity)
The difficulty of detecting Heartbleed exploits means that the Community Health breach is unlikely to be the last incident linked to the OpenSSL flaw.
4. New partnerships, SLAs make Google Enterprise services a UC option – Gina Narcisi (SearchUnifiedCommunications)
Consumer Google services like Hangouts weren’t always an option for enterprises. New partnerships with UC providers are making Google Enterprise Solutions more appealing as UC tools.
5. FC Bayern Munich partners with SAP for help with sports analytics – Todd Morrison (SearchSAP)
In this roundup, SAP inks a deal with FC Bayern Munich that includes sports analytics, and an Austrian retailer looks for better inventory control.
Cloud Computing image via Shutterstock
Will Microsoft be able to make a dent in Amazon’s lead in the IaaS cloud market? Find out in this week’s roundup.
1. IaaS cloud race far from over – Adam Hughes (SearchCloudComputing)
Amazon Web Services remains the frontrunner in the IaaS cloud market, but Microsoft Azure has made strides to improve its cloud. Can Microsoft capitalize on its advantages and make a bigger dent?
2. Microsoft issues critical IE patch, introduces whitelisting – Jeremy Stanley (SearchWindowsServer)
Microsoft patched two publicly known vulnerabilities in the August Patch Tuesday update. The company also introduced plug-in whitelisting in IE.
3. OpenStack market size will cross $1.7bn by 2016, says 451 Research – Archana Venkatraman (ComputerWeekly)
Free and open-source cloud computing platform OpenStack could reach an estimated market size of $1.7bn by 2016.
4. Internet of Things security issues rise to the fore at Black Hat – Brandan Blevins (SearchSecurity)
This year’s Black Hat showed that the Internet of Things security issues are going to demand increased attention in the near future.
5. Data explosion poses storage challenges to universities – Carol Sliwa (SearchStorage)
The incoming Michigan State CIO discusses the data storage challenges universities have to deal with and how to address them with cloud storage.
Algorithm image via Shutterstock
By James Kobielus (@jameskobielus)
People have invested the word “algorithm” with some sort of mystic power. In the popular mind, that word seems to stand for the secret sauce–or evil spirit–that animates big data.
Attributing the power of big-data analytics to some magical resource called “algorithms” isn’t terribly enlightening. It takes much more than algorithms–which are as diverse, malleable, and promiscuous as molecules–to extract meaningful insights from big data.
More than mere algorithms, what you need are data scientists who get the data in shape for statistical analysis and exploratory visualization. As I noted in this blog from last year, every step of the data scientist’s working method involves selecting from diverse options: analytic problems, subject populations, sources, samples, model versions, predictive variables, visualizations, and so on.
And, oh yes, of course….the right algorithms. Stepping through the standard methodology, as defined in the cited blog, is a sort of meta-algorithmic discipline at the heart of professional data science. If a data scientist makes the wrong choice at any step–including, but not limited to, selecting the right algorithm(s)–they may never find the underlying correlations they seek. Worse yet, they may “find” spurious correlations and thereby inadvertently deceive themselves and others regarding what’s actually going on in their problem space. There is no foolproof mental algorithm to steer statistical analysts in the right direction as they seek the baseline causal factors in any domain.
If you’re unfamiliar with statistical modeling best practices, you may think that the choice of algorithm is simple: just go with something that everybody talks about called “regression algorithms.” But you would be wrong. Not only are there other types of essential data-science algorithms (e.g., clustering and segmentation), depending on what you’re trying to accomplish, but as Vincent Granville states in this recent blog, even if you focus only on regression, there are hundreds of those algorithms to choose from. And you can blend them in countless permutations. You might even develop your own, if you have an especially astute mathematical mind.
The most enlightening aspect of Granville’s discussion is how he characterizes the statistical modeling scenarios within which each type of algorithm is best suited. For a working data scientist, the trade-offs and optimal blending of diverse algorithmic approaches must always be revisited in every new modeling exercise.
It’s clear that no one uber-algorithm will ever be suitable for illuminating the infinite range of statistical patterns that might inhere within real-world data.
VMware image via Shutterstock
What should you expect at VMworld 2014? Tune into this week’s roundup to find out.
1. Interests go beyond technology at VMworld 2014 – Tom Walat (SearchVMware)
The annual VMware event is expected to draw more than 20,000 but, for some, the company’s products aren’t the only selling point.
2. VMware-AirWatch integration details, new features revealed – Jake O’Donnell (SearchConsumerization)
VMware continues to drop details about its integration with AirWatch for end user computing. Among the new items coming is a mobile container that brings together many different technologies from both companies.
3. Study: Cloud app data sharing growth increases risks – Rob Wright (SearchCloudSecurity)
Netskope’s Cloud Report shows the average number of cloud apps used in the enterprise is growing — but the majority of those apps lack proper security and policy controls.
4. RackWare expands software into cloud disaster recovery – Sonia Lelii (SearchDisasterRecovery)
RackWare turns its cloud application migration software into an automated DR tool by adding failback, failover and building on migration.
5. Android vulnerability enables app impersonation, heightens BYOD risks – Sharon Shea (SearchSecurity)
News roundup: The ‘Fake ID’ flaw on Android devices allows malicious apps to impersonate trusted ones, putting confidential data at risk and reigniting BYOD security concerns.
Mobile image via Shutterstock
Between Android, iOS and Windows Phone, which is the best choice for you? Find out in this week’s roundup.
1. Android, iOS, Windows mobile OS war a positive for customers – Jake O’Donnell (SearchConsumerization)
Which mobile OS is best for your enterprise? IT pro Michael Thomason took a deep dive this week with the three leaders — Android, iOS and Windows Phone — and found pros and cons for all, which in the end means customers have real choices.
2. AWS expansion in Europe likely under data localization pressure – Beth Pariseau (SearchAWS)
An AWS Germany region is expected as part of the cloud behemoths expansion in Europe, along with stronger partnerships between local service providers — but IT pros say data localization is only one piece of the puzzle.
3. IBM SoftLayer a few pieces short of a finished puzzle – Ed Scannell (SearchCloudComputing)
IBM’s heavy investment in SoftLayer over the first 12 months got the attention of many IT shops. But some say IBM needs to deliver more before they can commit.
4. Are BlackBerry security features still an enterprise differentiator – Brandan Blevins (SearchSecurity)
While BlackBerry’s CEO touts the mobile platform’s security features, experts question whether the advantage over iOS and Android still exists.
5. Cloud growth good but SAP should do more, says one analyst – Todd Morrison (SearchSAP)
In this SAP news roundup, one analyst says SAP has to do more to truly become a cloud company despite strong growth, and SAP launches a new effort to help SMBs.
VMware image via Shutterstock
What can we expect from VMworld 2014? Find out in this week’s roundup.
1. VMware Marvin speculation and VMworld expectations – Nick Martin (SearchServerVirtualization)
In this podcast, Nick Martin talks with Christian Mohn about the VMware Marvin speculation and what we’re expecting to see at VMworld 2014.
2. Microsoft disses DaaS with Azure RemoteApp – Bridget Botelho (SearchVirtualDesktop)
The upcoming Azure RemoteApp cloud service from Microsoft bypasses DaaS and delivers apps to mobile devices without Windows. In part one of this two part story, we look at why Microsoft sidestepped Windows.
3. Windows 9 features may address unified apps and the cloud – Robert Sheldon (SearchEnterpriseDesktop)
Based on the Windows 8.1 update, it’s reasonable to expect Windows 9 features for universal apps and cloud integration. Will they entice enterprises?
4. July 2014 Oracle CPU: Java security problems persist – Brandan Blevins (SearchSecurity)
With another round of patches for several serious Java flaws, Oracle’s quarterly CPU showed that Java security problems are not receding.
5. Culture shock: Apple, IBM, Microsoft disrupt themselves – Francesca Sales (SearchCIO)
IBM and Apple’s pact to usher in analytics-enabled mobile apps to enterprises could be the start of a powerful friendship — and spell doom for rivals. Plus, Google Q2 earnings and Oracle tackles Hadoop, all in this week’s Searchlight.
Cloud storage image via Shutterstock
With Microsoft expected to make several cloud storage announcements in the near future, what does that mean for Azure? Find out in this week’s roundup.
1. Microsoft cloud storage may lift Azure skyward – Ed Scannell (SearchWindowsServer)
Microsoft will continue to blare its Azure cloud next week with several cloud storage announcements. Will users listen this time?
2. Amazon’s Dropbox answer leaves IT with big questions – Jake O’Donnell (SearchConsumerization)
Amazon introduced Zocalo into the hot file sync and share market. But questions about encryption keys might make it a tough sell in enterprises.
3. More Office 365 subscription plans, pricing changes ahead – Diana Hwang (SearchEnterpriseDesktop)
Microsoft will replace existing Office 365 SMB plans in October, increasing the user cap for all plans and cutting per-user monthly fees for some plans.
4. New VMware beta program aims to kill vSphere 2015 bugs – Colin Steele and Tom Walat (SearchServerVirtualization)
The vSphere 2015 beta party is no longer invite-only. VMware pros hope the new program will reduce the amount of bugs in the next version of vSphere.
5. July 2014 Patch Tuesday fixes two dozen IE vulnerabilities – Brandan Blevins (SearchSecurity)
Microsoft’s July 2014 Patch Tuesday release addressed two dozen flaws in Internet Explorer. Adobe also provided a critical update for Flash.
Big data image via Shutterstock
By James Kobielus (@jameskobielus)
Hadoop isn’t just about big data. It’s also about big–as in rich, deep, sophisticated, and diverse–algorithm libraries that execute within Hadoop clusters.
Your choice of a Hadoop analytic-application development platform–aka “sandbox”–is an important factor in realizing the aims of your big-data projects. The sandbox is where most big-data application developers–aka data scientists–will spend most of their productive hours. If you fail to provide them with a common sandboxing platform with a rich library of algorithms and models, you’ll make it difficult for them to pool their expertise on common projects using shared tools.
Developer productivity depends on having rich algorithm libraries that can tap into petabytes of data in HDFS and other storage resources, as well as into the MapReduce, YARN, and other execution engines in Hadoop platforms. For example, IBM PureData System for Hadoop integrates our BigInsights Hadoop analytics software platform and tooling. Key among its features is an extensible, built-in library of machine learning, statistical modeling, data mining, predictive analytics, text analytics, and spatial analytics functions.
As Andrew Oliver notes in this recent post, machine learning libraries are essential to the success of many Hadoop projects. In particular, Apache Mahout is the principal machine-learning library that is optimized for Hadoop, and it has wide adoption. Mahout includes algorithms for K-means clustering, fuzzy K-means clustering, K-means, latent Dirichlet allocation, singular value decomposition, logistic regression, naive Bayes, random forests, and other popular machine-learning approaches.
It’s important to note that Mahout algorithms don’t always need to be run in conjunction with MapReduce (or YARN, for that matter) on Hadoop clusters, so they can conceivably run faster and more efficiently. However, Mahout is by no means the only library that can work with Hadoop clusters or that has been optimized for this big-data platform. For example, you can also execute the algorithms in the IBM Netezza Analytics library directly on BigInsights without invoking the platform’s MapReduce engine.
Regardless of the merits of Mahout or alternatives, this discussion points to the fact that Hadoop is a versatile development platform that is not constrained to one library, one language, or approach for doing machine learning or statistical modeling in general. As Apache Spark takes hold in the Hadoop arena, we can expect its principal machine-learning library, MLlib, to take residence alongside Mahout in many data scientists’ sandboxes.
As you evolve your big data environment toward Spark and other new approaches, you should be protecting your investments in big-data analytic libraries. If you implement new big-data platforms but can’t leverage the rich trove of algorithms and models that you’ve implemented on older platform, you will have squandered intellectual property that may be the key to the success of future analytic initiatives.