The trend that sees the SQL query engine appearing on Hadoop, is just the start of a movement; the SQL query engine running on data other than HDFS may follow. If these trends portend fitful change for users, they also affect vendors.
One vendor’s journey here is particularly telling. Starburst Data might be called a ‘re-start-up.’ The company was the brainchild of some young data technicians that included Daniel Abadi, an academic researcher who helped forward the notion of column-store parallel databases in the early 2000s. In 2011, he helped form Hadapt — one of the first Hadoop-on-SQL providers.
In 2014, the company was purchased by Teradata. The timing proved a bit odd, as it nearly coincided with Facebook ceding much development responsibility to Teradata for Presto, a SQL-on-Hadoop tool that the social media giant had forged in-house, and which has subsequently been endorsed by no less than Amazon for its Athena SQL engine. The former-Hadapt group within Teradata shifted its efforts to improved performance for a Presto-compatible SQL query engine.
At the end of 2017, Hadapt principals within Teradata spun-out to form Starburst, with Teradata’s blessings. A Starburst goal is to bring SQL engine prowess to SMBs that are still outliers in Teradata’s more familiar big player universe. An early effort for standalone Starburst has been a Cost Based Optimizer for Presto, built in collaboration with Facebook technicians. For the many lovers of SQL joins, the new optimizer supports Join Reordering and Join Distribution Choice.
The picture emerging shows differences in use cases between plain vanilla Hadoop and SQL on Hadoop – the difference is between Hadoop being fit for the purposes of small data science groups and skunk works to Hadoop being useful for the interactive needs of wider groups of SQL business analytics users. We are also seeing HDFS, the file system at the base of Hadoop, giving way as more people choose to pursue these types of applications on the cloud rather than on the premises.
Listen to the latest Talking Data podcast, which features Starburst Data CEO Justin Borgman. We left a noisy restaurant to record the interview, and found a noisy Boston waterfront, with massively loud construction if full throat. Enjoy! – Jack Vaughan
It’s been said Oracle leader Larry Ellison advises his troops to focus on one competitor at a time, and in recent years that has been Amazon. What started out as an online book store eventually morphed into a general mega-store, and then, surprisingly, a mega-IT-outsourcer. In many ways it created the cloud computing formula.
Like other leading lights of enterprise computing, Oracle is in the midst of efforts to shift focus from customers’ on-premises data centers to its own cloud computing centers, and to keep those customers in the Oracle camp. Oracle’s counter thrusts to Amazon are one of the defining aspects of technology today. But it is a balancing act.
The Collaborate 2018 conference at Mandalay Resort and Casino in Las Vegas would seem an apt place to take measure of Oracle’s progress toward cloud. Recorded as the event began, this remote edition of the Talking Data podcast sorts through challenges the Redwood City Calif. -based IT giant faces on the road to cloud.
Underlying its cloud efforts are moves in both databases and applications. Those are key columns of Collaborate, which brings together IOUG Oracle database users, OAUG Oracle eBusiness applications users and Quest JD Edwards/PeopleSoft applications users.
Steady cloud movement, but less than a startling shift, seemed to be the basic cloud status takeaway from the event.
These databases and applications suites are well entrenched in organizations, usually in very large enterprises. Moving these into cloud is a multiyear project in most cases. While complex enterprise applications stay home, new applications are driving to the cloud.
A state of steady cloud movement — but less than a startling shift — seemed to be borne out at Collaborate 2018.
At the event, we asked: “How is the Oracle database and applications cloud migration going?”
“We are not seeing enough large scale movement to really tell . It’s really just one-off stuff,” Stephen Kost, CTO at Integrigy, a security software and services provider, told this reporter at the conference. “People are moving to small web applications.”
In Kost’s view, much of Oracle’s strength is in large companies that may have 1,000 databases – but he has seen, in many cases, only a handful of those have yet to be moved to the cloud.
Up the Las Vegas strip from Collaborate this same week, perhaps not so coincidentally, another Oracle-related conference took place. NetSuite SuiteWorld 2018 was built around the cloud ERP offerings that became part of the Oracle portfolio via acquisition in 2016. As at Collaborate, much of the discussion was around embedding AI into applications.
Oracle’s purchase of NetSuite was a tacit admission that “cloud is different” and that it needed a wholly separate product line to attract small- and medium-size business customers to its applications.
It was also an admission that it saw cloud migration as a multi-year effort that needed to be addressed from several directions. In a phone call after Suite World, Holger Mueller, Constellation Research, told us Oracle has avoided the temptation to roll NetSuite together with its incumbent applications suites. At the same time, he said, it has been expanding NetSuite globally, and injecting elements of its AI and machine learning research and development.
That is also what Oracle has begun to do with the e-Business Suite, JD Edwards, and PeopleSoft portfolio. Still, for now, Oracle’s cloud application migration might be described as a delicate balancing act within a delicate balancing act.- Jack Vaughan
Click here to see a video version of this podcast.
Recent convocations of the Strata big data conference have seen a move away from sessions focused on data infrastructure and Hadoop and toward analytical applications and data science tools. Where is Strata going? Strata, it seems, cannot contain itself, when it comes to software containers.
News around Kubernetes and containers figured prominently in coverage at the recent Strata Data Conference in San Jose, Calif.
Containerized apps have significant benefits – so big data developers are discovering. Big data applications that sprouted up indiscriminately in organizations are now being folded into central data lakes and the like, and containers are increasingly seen as a flexible way to handle a procession of distinct and ever changing analytics jobs. Putting jobs in self-contained units seems a way to bring order to the workloads.
A key appears to be container orchestration, which, in the form of the Kubernetes open source standard, seems poised to usher in a new way of handling big data work. That is according to TechTarget Senior Executive Editor Craig Stedman, speaking with Senior New Writer Jack Vaughan.
Among vendors that showcased their container-related efforts at the conference are such as Blue Data, data Artisans, Intel, MapR Technologies, Pepperdata and others. They almost universally cite recommendation engines, machine learning, and AI as apt first targets.
Still, a message that arises in this podcast is that it is early. While improved speed of development and deployment seems in the offing, a lot of preliminary work must be done before pushing that magic button of container automation.
Containers may be one of those unique cases where software engineering takes significant inspiration from work in other industries. That is because the container metaphor comes straight from logistics advances that, beginning in World War II, transformed global shipping.
In the world today, 90% of non-bulk cargo is convey by containers – perhaps, someday, something similar will be said about software containers for persistent data. There is much that needs to be developed when it comes to persistent containers for big data, but important efforts are now underway, as discussed in this edition of the Talking Data podcast. Listen, and learn more. – Jack Vaughan
C-suite folks and others have taken notice this week as Facebook finds itself in a sack of woe. Data privacy is at issue.
The Silicon Valley high flyer has gained the kind of publicity you don’t want, in the wake of news that its social media platform was used to gather up Facebook profile information of thousands (or millions — the full details of the story are still coming in) of users for political consultancy Cambridge Analytica for still to be determined purposes.
Cambridge Analytica did its gathering via a third-party that fielded a “thisisyourditigallife” survey app that walked users through personality quizzes and, incidentally, scraped info from their and their friend’s profiles.
Facebook in its responses to this news at first stridently held this was not actually a data breach, but was instead a rogue third-party developer’s wandering from company data policy. Let’s call it a “data faux pas of the developer kind.” In the face of general uproar, that somewhat dismissive response was somewhat dialed back.
Which brings us to GDPR. Facebook’s latest egg-on-its-face moment comes only a month after COO Sheryl Sandberg had assured a Brussels, Belgium crowd that the company was ready for the General Data Protection Regulation, or GDPR, which is the EU’s comprehensive data privacy edict, due to go into enforcement in late May.
GDPR is intended to give web users more control over their on-line data profiles – that is just the type of control that seems now to have been lacking in the Facebook-Cambridge Analytica episode.
Data professionals have been grappling of late with GDPR which, by some measures, is meant to bring about a firm style of data governance — one that might have gone missing during the big data gold rush of which Facebook was a big part.
How ready data professionals are for GDPR is a matter of conjecture, and it is the topic of this edition of the Talking Data podcast, recorded after Sandberg’s Brussels pronouncement, but just before the news on Facebook and Cambridge Analytica broke.
The podcasters discuss GDPR, and the different uptake rates it’s likely to exhibit in different industries and departments within companies. Marketing, they suggest, may be a little late to the GDPR party but starting to get involved.
Whether you call the Facebook-Cambridge Analytica event a data breach, a data faux pas, or what have you, it is fair to say the latest doings will quickly bring data privacy and GDPR to more people’s attention. Many viewers expect Facebook will be among the first to hear the GDPR enforcers knocking. – Jack Vaughan
We’ve all heard the scare stories about AI eliminating jobs, but we’ve also heard the Pollyanna voices saying everything will be fine on the jobs front.
The reality may be somewhere in the middle, according to Goldman Sachs analyst Heath Terry. In a presentation at the AI World Conference & Expo in Boston, held in December, Terry talked about how the risk of people losing their jobs to AI in the short term is very real. Jobs may recover long-term, but people should expect some disruption soon, he said.
He also talked about the likelihood that established tech players like Amazon and Google are not likely be the dominant vendors when it comes to AI. Instead, he said, he expects some new, unknown vendor is likely to innovate.
Listen to this podcast to hear more about Terry’s predictions and how they may play out in the jobs and tech markets.
The scale of HDFS continues to soar upward. For large social media and cloud providers, the size of Hadoop clusters is such that it is hard to test out this basic component of classic Hadoop at scale before roll outs. That is another one of those niggling issues that slows Hadoop adoption.
At LinkedIn, the challenges of successfully making even small configuration changes across broad arrays of HDFS led a team to create Dynamometer. This load and stress test suite uses actual NameNodes, combined with simulated DataNodes to prove out changes to settings across large the Hadoop data farms that help LinkedIn link folks together.
Today, many issues are impossible to test out without running a cluster that is similar in size to what were is used in production, according to Carl Steinbach, senior staff software engineer at LinkedIn. He said that one of the goals of the project is to have positive “upstream effect” on the work of Apache community members’ releases, to, in effect, make testing at scale foremost in the effort.
Steinbach’s colleague, engineer Erik Krogen, adds that HDFS developers are looking toward the day when they can find bugs before new versions are committed, rather than six -months later when new software comes to cover very large scale clusters.
In this edition of the Talking Data Podcast, the crew speaks with analyst Mike Matchett, of the SmallWorldBigData consultancy, to get a better view into Dynamometer. Testing tools of this kind will only gain in importance going forward, he suggested.
“Whatever big data we have today, it’s going to be bigger tomorrow,” said Matchett.
Also on tap in this podcast is a discussion of TensorFlow from Google. The company has supported work on this machine learning framework on CPUs, GPUs, and, most recently TPUs – these being Tensor Processing Units build especially to accomplish highly iterative neural network computations. Word is that Google is preparing to open up TPU processing to outsiders that use its Google Cloud Platform. – Jack Vaughan
The inaugural edition of the Talking Data podcast for 2018 features James Kobielus, analyst, Wikibon, who helps us take the racing pulse of data today. AI, machine learning, deep learning and analytics all come in for consideration. Buckle your seat belt, listen to the podcast and get ready for another tumultuous ride down the big data slope. – Jack Vaughan
Machine learning was the big story of 2017, and we here at SearchBusinessAnalytics spent a lot of time talking with businesses who use the technology.
In this edition of the Talking Data podcast, we recap some of the best interviews we did on the topic. The interviews look at everything from the role of engineers to avoiding black box functionality in models.
Talking with people who actually use a given technology is generally one of the best ways to learn about its real importance, and we spend a lot of time trying to get this perspective from our stories. As we bring 2017 to a close, we hope you can learn something useful from our efforts to drive your machine learning initiatives into the new year. Happy holidays!
As 2017 winds down, we invite you to take a look behind the big data curtain. There, you will find data engineers, data scientists, end-users and others working to move a big data concept into production. It doesn’t take much digging to find that more self-service capabilities are needed at each stage in the data life cycle.
That is among the take-aways from this latest edition of the Talking Data Podcast. In this and a subsequent episode, Ed Burns and I discuss recent user stories that graced the editorial pages of SearchBusinessAnalytics.com and SearchDataManagement.com – ones that speak to some of the outstanding trends of the year just winding down.
One of the telling threads we found was self-service; that is, self-service as it relates to ETL, as it relates to interactive data queries, and as it relates to cluster configuration. In the latter case we have as example restaurateur Panera Bread. The chain is among the company’s with particularly aggressive web initiatives underway.
More and more, when lunchtime arrives, incoming orders come in via cell phone. That can stress operational systems. Aware of this threat, Panera Bread built a Spark-Hadoop system to analyze computing needs for the processing involved in handling the lunchtime crush. It was the first in a series of Hadoop apps that Panera is spinning up quickly, after deciding to use automated container configuration software.
Panera announced earlier this year that annual digital sales had gone past $1 billion, and that projected digital sales could double by 2019. The ability to let individuals spin up big data jobs at will become handier going forward, one of the company’s engineering leads said.
Self-service that empowers more individuals in the data pipeline is a fact of life that IT has generally come to accept. It seems now to be a big part of moving at the speed of innovation. Listen to this podcast and feel free to come back for seconds.
Tableau currently has a comfortable relationship with a number of data preparation vendors, most notably Alteryx. But that hasn’t stopped the popular data visualization vendor from developing its own self-service data preparation tool, known as Project Maestro, set to be released before the end of the year. So what does that mean for Tableau’s data prep partnerships?
We explore that question in this edition of the Talking Data podcast. We look behind the news to think about how it could ripple throughout the world of analytics software. Will customers still look for standalone data preparation software when they can access good-enough functionality in the higher level software they already own? Will the secondary features of data prep software, like reporting and predictive analytics, be enough to entice customers?
For now that all remains up in the air, but we look at the questions in this podcast to see how they may play out. There are still more questions than answers in the still-hot analytics software market, and much remains to shake out.