Enterprise IT Watch Blog

Jul 7 2014   12:36PM GMT

Big algorithm libraries breathe life into big data

Michael Tidmarsh Michael Tidmarsh Profile: Michael Tidmarsh

Big Data

Big data image via Shutterstock

By James Kobielus (@jameskobielus)

Hadoop isn’t just about big data. It’s also about big–as in rich, deep, sophisticated, and diverse–algorithm libraries that execute within Hadoop clusters.

Your choice of a Hadoop analytic-application development platform–aka “sandbox”–is an important factor in realizing the aims of your big-data projects. The sandbox is where most big-data application developers–aka data scientists–will spend most of their productive hours. If you fail to provide them with a common sandboxing platform with a rich library of algorithms and models, you’ll make it difficult for them to pool their expertise on common projects using shared tools.

Developer productivity depends on having rich algorithm libraries that can tap into petabytes of data in HDFS and other storage resources, as well as into the MapReduce, YARN, and other execution engines in Hadoop platforms. For example, IBM PureData System for Hadoop integrates our BigInsights Hadoop analytics software platform and tooling. Key among its features is an extensible, built-in library of machine learning, statistical modeling, data mining, predictive analytics, text analytics, and spatial analytics functions.

As Andrew Oliver notes in this recent post, machine learning libraries are essential to the success of many Hadoop projects. In particular, Apache Mahout is the principal machine-learning library that is optimized for Hadoop, and it has wide adoption. Mahout includes algorithms for K-means clustering, fuzzy K-means clustering, K-means, latent Dirichlet allocation, singular value decomposition, logistic regression, naive Bayes, random forests, and other popular machine-learning approaches.

It’s important to note that Mahout algorithms don’t always need to be run in conjunction with MapReduce (or YARN, for that matter) on Hadoop clusters, so they can conceivably run faster and more efficiently. However, Mahout is by no means the only library that can work with Hadoop clusters or that has been optimized for this big-data platform. For example, you can also execute the algorithms in the IBM Netezza Analytics library directly on BigInsights without invoking the platform’s MapReduce engine.

Regardless of the merits of Mahout or alternatives, this discussion points to the fact that Hadoop is a versatile development platform that is not constrained to one library, one language, or approach for doing machine learning or statistical modeling in general. As Apache Spark takes hold in the Hadoop arena, we can expect its principal machine-learning library, MLlib, to take residence alongside Mahout in many data scientists’ sandboxes.

As you evolve your big data environment toward Spark and other new approaches, you should be protecting your investments in big-data analytic libraries. If you implement new big-data platforms but can’t leverage the rich trove of algorithms and models that you’ve implemented on older platform, you will have squandered intellectual property that may be the key to the success of future analytic initiatives.

 Comment on this Post

There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: