Enterprise IT Watch Blog

Jun 25 2015   7:46AM GMT

Spark as a fit-for-purpose big data component

Michael Tidmarsh Michael Tidmarsh Profile: Michael Tidmarsh

Tags:
Big Data
Spark

shutterstock_196477304
Big data image via Shutterstock

By James Kobielus (@jameskobielus)

Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. In a recent column, I commented on the likely sweet-spot deployment roles—fog, stream, and cloud–where Spark will prove its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype, such as in the headline of this recent article, that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

After all, you can’t spark an analytic combustion engine if there’s no data fuel in the tank.

 Comment on this Post

 
There was an error processing your information. Please try again later.
Thanks. We'll let you know when a new response is added.
Send me notifications when other members comment.

Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to:

Share this item with your network: