Posted by: CarlBrooks
big data, Hadoop, HPCC Systems, LexisNexis
Premier data analysis and records search firm LexisNexis has dropped a large rock into the next generation database pond: it’s releasing the core of its own data management and search technology as an open source software (dual license: free/community edition and paid-for pro edition).
HPCC Systems is three major components, according to Armondo Escalante, CTO at LexisNexis. The data processing engine (“Thor”) that organizes and stores the data is a massively multiparallel batch processing server written in C++ that runs on Linux and commodity x86 servers.
“That gives it a big advantage at run time,” said Escalante, over Hadoop, the open source offshoot of Google’s MapReduce platform written in Java. Escalante claims Thor is four times faster than Hadoop when running certain queries. It functions much as Hadoop does; it’s a distributed file system that requires several nodes and runs inquiries in as many parallel jobs as possible.
“Roxie” is the data delivery engine; Escalante says it, like Thor, is a clustered architecture running on Linux, for delivering transactions. Point your front end at it and connect with SOAP or JSON to interact. The third element is the interface language used to control these engines, call ECL.
Escalante says that there is no practical limit to how many nodes or how much data these tools can scale to, which makes sense; this kind of architecture is familiar territory for grid and HPC users doing massive data processing jobs. This is not for processing math problems and crunching datasets, however- it is for long term storage and access to a dynamic pool of unstructured data in very large amounts; amounts that would seem utterly ludicrous when LexisNexis began building out this platform a decade ago.
“Ten years ago we were doing big data. 18 terabytes online serving to our customers,” said Escalante. “Ten years ago that was big data. Not so much now.” Today firms like Google, LexisNexis and Microsoft casually talk about storing petabytes of data and the need to sort through it all. Large enterprises are also sitting on exponentially expanding stores of business data even if they aren’t in the Information game or the online ad business.
Escalante says LexisNexis’ motivation for this was twofold: one, was the desire to get free innovation from the scientific and database communities that have a need for operations on this scale and the other was to tap into the trend for data management at very large scale and with unstructured data stores. “Three years ago we started going to the Hadoop conferences and said finally someone’s talking about this, and we’ve seen the growth [and] we believe our software is superior,” he said.
That’s not an absurd claim; While not the household word that Google is, LexisNexis is the leading research and data location service in the world, and stores and searches truly vast empires of publications and data, including sources deliberately not able to be searched by Google and data stores that Google doesn’t bother to make available. LexisNexis is a serious research tool with serious performance and charges a pretty penny for the privilege too. Google makes its search results available for free because it sells ads around them, although both firms derive their value from correctly linking disparate kinds of data together.
Google is also famously secretive; whatever it’s using for MapReduce is years ahead of Hadoop. LexisNexis, also justly famous for tight lips, claims that the open source HPCC will be developed just as its internal platform develops. The pro edition gets you LexisNexis’ other managment tools it has developed around HPCC and support. Escalante said the original driver of HPCC was to get out from under the thumb of Oracle and other “shrink-wrap” software vendors, since the amount of money LexisNexis would have to pay to run their business on Oracle would probably buy Larry Ellison another couple of yachts. He said traditional relational databases could certainly get the job done for big data but the back pocket pain was extreme.
“You can buy 100, 200 big Oracle systems and maybe do it but it’ll cost you a fortune,” he said. Now LexisNexis thinks enterprises will look seriously at HPCC as an alternative to more Oracle in their data center, although Escalante admits it’s going to be a tough sell, since enterprise are always interested in stuff that works and never interested in being someone’s science project.
Maybe having a legitimate information management firm backing HPCC Systems will make it easier to get in the door; maybe not. MySQL, another free database got some entry to the enterprise when backed by a commercial firm, but MySQL mostly took off on the web where there was a hole to be filled. IBM and Oracle didn’t exactly go down in flames because another free database showed up; they’re probably not quaking in their boots now, either.
This also means LexisNexis has decided that their infrastructure technology has minimal value to them as a trade secret (and they aren’t shy when it comes to revenue grabs, believe me) and more value to them as a service business, which is in its own way an interesting reflection on the cloud computing trends of today. It will also, of course, run on Amazon Web Services, and Escalante said there are plans to run HPCC as an online, pay-as-you-go data processing service at some point.
Will it be a “Hadoop killer”? Probably not, open source doesn’t work that way. Will it turn you into Google over night? Probably not, but it’s nice to see another legit contender join the fray and the possibilities are only positive for anyone dealing with large amounts of data and a bent towards experimenting.
A DNA sequencer can generate a terabyte of raw data a day. Currently there’s no good way to deal with that data except to crunch it, look at the results and put it away. What if you could keep that data alive and do as many searches in as many ways as you like on it, on the same server hardware you’ve already got?