This Hadoop tutorial provides thorough introduction of Hadoop. The tutorial covers what is Hadoop, what is the need of Hadoop, why hadoop is most popular, Hadoop Architecture, data flow, Hadoop daemons, different flavours, introduction of Hadoop componenets like hdfs, MapReduce, Yarn, etc.
Hadoop is an open source tool from the ASF – Apache Software Foundation. Open source project means it is freely available and even its source code can be changed as per the requirements. If certain functionality does not fulfill your requirement, you can change it according to your need. Most of Hadoop code is written by Yahoo, IBM, Facebook, Cloudera.
It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a group of systems connected via LAN. Hadoop provides parallel processing of data as it works on multiple machines simultaneously.
It is inspired by Google, which has written a paper about the technologies it is using like Map-Reduce programming model as well as its file system (GFS). Hadoop was originally written for the Nutch search engine project when Doug cutting and his team were working on it but very soon, it became a top-level project due to its huge popularity.
Hadoop is an open source framework which is written in Java. But this does not mean you can code only in Java. You can code in C, C++, perl, python, ruby etc. You can code in any language but it is recommended to code in java as you will have lower level control of the code.
It efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is developed for processing of huge volume of data. Commodity hardware are the low end hardware, they are cheap devices which are very economic. So hadoop is very economic.
Hadoop can be setup on single machine (pseudo distributed mode), but real power of Hadoop comes with a cluster of machines, it can be scaled to thousand nodes on the fly ie, without any downtime. We need not make any system down to add more systems in the cluster.
Hadoop consists of three key parts – Hadoop Distributed File System (HDFS), Map-Reduce and YARN. HDFS is the storage layer, Map Reduce is the processing layer and YARN is the resource management layer.
Let us now understand why Hadoop is very popular, why Hadoop has captured more than 90% of big data market.
Hadoop is not only a storage-system but is a platform for data storage as well as processing. It is scalable (more nodes can be added on the fly), Fault tolerant (Even if nodes go down, data can be processed by other node) and Open source (can modify the source code if required).
Looking back at the DevOps Enterprise Summit in San Francisco, there was a wealth of speakers representing a wide range of organizations from vendors and enterprise users to subject matter experts. The varied panel of guests spoke about how DOES has evolved over the past few years, offered industry and technical insights into how DevOps is intersecting with the enterprise, and revealed what’s on the cutting edge of this concept. Here are some tidbits from four of the popular speakers at the conference.
Cloud and DevOps march forward together
Trace3 Principal Technologist George Kobari pointed out the rather obvious reason why DOES is becoming ever more popular. “A lot of enterprises today are realizing they must be in the DevOps Space or their businesses will not survive.” They are making this decision in concert with some other big changes. “From a technology standpoint, a lot of them are just getting to the point of using cloud. That’s relevant for DevOps because you have to deploy to a cloud. That’s the foundational layer they are sitting on. To capture growth, it’s necessary to take advantage of infrastructure on demand.”
Is DevOps impossible without the cloud? Not everyone agrees. Some would say that it is indeed feasible to implement DevOps on premises utilizing many of the same tools such as Puppet and Chef. But there’s certainly agreement that Cloud enables the process in a way that’s difficult to achieve otherwise. For enterprises that often have a mix of on-premise and cloud resources, the goal should be to implement DevOps principles across the organization, leveraging the additional benefits of cloud where possible.
The database is the new application
Robert Reeves, CTO at Datical brought up an interesting point about where DevOps stands to make the greatest strides in the next few years. “The application is the first place to implement DevOps since it involves the most people and gets the most attention. But once you automate that and bring DevOps to it and are moving the entire code from Dev to Test to Production, then you look for the next thing.”
According to Robert, that next thing is the database. “The database does become the bottleneck once you have brought DevOps to the application.” Ideally, it should be possible to bring automation and efficiency to the database using similar principles. However, applications don’t have state to worry about. With continuous deployment to an app server, it is fine to simply blow away the old version or roll back to a previous version as needed. It doesn’t matter so much what the app did yesterday, it matters that it is doing the job right now.
This approach isn’t possible with a database since consistency and accuracy of the data itself over time is critical. Datical aims to provide better tools for DB DevOps. These include a forecast feature that allows developers to preview a change without actually making it, a rules engine that automates without anyone watching and enforces standards such as naming conventions, and a deployment packager.
Tooling for DevOps
Electric Cloud CEO Steve Brodie spoke about the increased interest of large enterprises in the latest approaches to development and deployment. “If you look at the enterprise, they have some legacy apps that are still monoliths and some things they are starting to do with microservices only—and hybrids that they are refactoring with some traditional architecture paired with microservices and containers.” They need plenty of flexibility in tooling to accomplish everything on this continuum.
To enable DevOps in this space, Electric Cloud seeks to model containers as first class citizens and orchestrate them through the pipeline on their own or with other components. Adding an abstraction layer also allows enterprises to deploy to Kubernetes, Amazon, or Docker Swarm with equal ease. Just as with other aspects of infrastructure, allowing Dev and Ops to focus solely on the app without worrying too much about configuration helps streamline DevOps for the enterprise.
Additional industries are showing interest in DevOps
Electric Cloud Author Chris Fulton mentioned financial services as one example of a vertical that is showing increased interest in DevOps. Requests for consultations from these prospective clients is leading to some interesting discussions. The scope of the conversation has to range far beyond software and into very specific business processes. “We haven’t really thought a lot before about how DevOps works with processes. When you’ve got all these legacy processes that you follow along with a bunch of government restrictions, how do you do DevOps in that environment?”
The speed of DevOps may never be as lightning fast in FinServ as it is in other, less regulated industries. But the fact that the underlying principles and tooling promotes better quality of code, easier rules enforcement, consistency in processes, and more visibility into what’s going on with code, it may well end up being an excellent match. In fact, next year’s DevOps may include some interesting stories and case studies from an even wider range of clients in unexpected industries.
Why do DevOps initiatives sometimes fail, and how can they be more successful? Gene Kim, author of The Phoenix Project, admitted that most of the stories that get told and retold about DevOps transitions are glowing successes. This survivor bias means that the fiascos don’t get all the attention they deserve. Glossing over the disasters makes it hard to assess the overall state of DevOps. Some degree of failure is actually very common on the road to DevOps. Kim admitted, “That may actually be a more telling indicator of the movement than the ones that actually survived.” Of course, there have been enough problems for experts to have a good idea of what typically goes wrong.
What makes DevOps implode?
Scott Wilson, Product Marketing Director of Release Automation, touched on why and how DevOps initiatives fail during his presentation. Despite the idealistic notion of Dev and Ops traipsing through a meadow hand in hand in a united utopia, the reality is quite different. The primary reason DevOps fails is because Ops gets left out. Dev is very Agile and has all the cool tools and processes. But they are still throwing code over the wall for deployment. According to Wilson, “We need to focus on making Ops more Agile.” Failing to respect and invest in the role of Ops is a fatal error. When Dev gets all the attention, “You reinforce the wall, especially if Ops is using different deployment tooling or automation mechanics.”
Open source addiction is also to blame
Another key reason that DevOps can fail in a typical enterprise is because of an enthrallment with open source. It’s easy to become dependent on open source in the DevOps world because it is such an integral part of the overall culture. But the DIY effort and security failings that come along with open source don’t always make it a good fit for every business model—even in an era where “every company is a software company.”
In practical terms, “If you are an insurance company, you generate revenue by selling insurance policies. Do you really want to install a bunch of open source software that you have to maintain, that you have to write the glue for and do the API plugwork and then make certain to update to the latest libraries to shield against vulnerabilities?” Scott advocated a balance of open source and vendor supplied code to relieve some of this unnecessary burden on internal DevOps teams.
Predictors of success in DevOps
Although getting teams to “buy in” and support this type of transition is important, popular enthusiasm is clearly not sufficient to effect change at the enterprise level. Wilson pointed to Heather Mickman’s account of transitioning to DevOps at Target. “The real progress was when the new CIO came in.” This senior executive had the clout and vision to roll out DevOps across the entire organization. This seems to be a typical story.
The IT Skeptic, Rob England, agreed. As a consultant, he has noticed that it usually takes some kind of moment of reckoning for upper management to step in and claim ownership of change. Then, Rob recommended pointing to the DevOps efforts that have been happening in small teams at the grassroots level as an example of how to do things better on the big stage. “You can use those quick wins to drive change.” For the enterprise, DevOps may start at the bottom, but it gets its staying power only when it gains support from the top. When an enterprise fully commits as an organization, that’s when things really start to work.
Enter PostOn 8th Nov-16, when honourable Prime Minister of India Narendra Modi began his first ever televised address to the nation, there was great curiosity among the people to know what it was all about. But suddenly there was a shock to the nation when PM announced that from the same day the Rs.500 and Rs.1000 currency notes would be discontinued to track black marketers and the black money they carry. He also announced that anyone having money in these denominations can get them exchanged or deposit them in their accounts with some limitations that were also declared
Now the question arises – How is the government or IT department going to track the black money that has been deposited and how will they segregate black money holders from genuine tax payers? With more than 1.25 Billion populations and 100s of millions bank accounts, it is a big question that how IT department will find out discrepancies? Similar to Software industry Income Tax department is also going to use latest and hottest technology Big Data
If you ask Damon Edwards, Founder of SimplifyOps, and IT Skeptic Rob England, there’s a dirty little secret in the IT industry: a very high percentage of IT professionals hate their jobs. In fact, this is pretty well known in the technology field.
A woeful history of personal, professional, and organizational suffering
According to Edwards, “Life is not good for everyone in the IT industry.” Yet people continue working in this sector for a couple of reasons. First, it often pays better that whatever they would otherwise be doing for a career. Or, they are in it because they love the technology and want to be able to tinker with it. If getting a steady paycheck for doing intriguing work was the simple reality, IT professionals would, by and large, be a happy lot. “Unfortunately, it’s everything else that’s layered on top of that which makes things miserable.” IT professionals are overworked, given insufficient resources, and expected to keep up an insane level of context switching. They tough it out year after year, hoping to put away enough money for the kids’ college funds and maybe a decent retirement. In the meantime, they suffer from burnout, resentment, frustration, and fatigue.
While the impact on a personal level is troubling, an entire organization suffers from loss of productivity, effectiveness, and innovation when IT workers are stressed. Fortunately, some businesses are starting to take this matter seriously. Rob said, “I’ve noticed in a number of organizations, one of the KPIs for IT is sustainability. They’re not talking ecology, they’re talking about whether they are working in a sustainable way in terms of technical and cultural debt. Can they keep up the pace?” If not, the organization pays the price in competitive advantage, customer satisfaction, revenue, and market share. That’s the stark business reality.
Focusing on people comes first
Both DevOps speakers agreed that, as corny as it might sound, people are still the number one resource in any company. Developing people, creating an environment where they want to work, and retaining them over the long term is essential for success. England pointed out that, in the move to the information age, many businesses have kept a manufacturing mindset from the industrial era, seeing people as clerical cogs. “When you move to a knowledge worker model, you have to respect and empower them.” IT resources may be replaceable, but the cost of having to start over due to poor retention is far too high.
Of course, there are lessons IT can learn from modern manufacturing when it comes to best practices. Edwards noted that at a company like Toyota, executives are almost never on the list of most successful people in their field—yet this business is one of the most efficient and effective in its industry. Instead of giving glory to those at the top, the focus is on excellence throughout the organization, helping people and systems flourish. IT organizations could take a lesson from this model, determining whether executives are in operating a servant/leader capacity or simply as slash and burn artists in search of short term wins that they can add to their resume before heading off to the next cushy job.
Improved processes are the pathway to a better professional experience
Above all, improvement in IT culture requires a clear understanding of the ALM process and how things get done. Again, a manufacturing model can be handy, in Edwards’ view. “When you make the work visible, you can turn it into supply chain management. You don’t have to understand what goes on inside the boxes, if you can just see how things go from idea to cash, how long it takes, and how painful it is.”
One doesn’t even need to be a technology specialist to see ways to improve the ALM process. They just need to be able to visualize the workflow and make intelligent adjustments. Damon mentioned several choices that can make the system more tolerable for IT professionals: working in small batches, building slack into the system, and avoiding overloading people or any piece of the system.
DevOps provides a peek at a brighter ALM process
What kind of impetus does it take to shift culture for the better? Rob mentioned crises as frequent instigators of change. When upper tier executives finally realize that there is a serious problem and step in to make changes, it’s good to be able to point to at least some small pockets of a company where things are being done in a better way. “That’s the time to pull out DevOps.”
The grassroots movement taking shape in small teams can be used to demonstrate that the method works. “You can use those quick wins to drive change.” DevOps teams that are pioneering this new way of working should take heart, even if they are the minority in their organization. They may well be leading the way to a more humane IT culture once enough pressure builds up to force a full system overhaul.
Tools, culture, and other popular topics were the focus of much attention at the DevOps Enterprise Summit this year. Yet security was still the undercurrent of concern running just below the surface. Fortunately, a number of speakers addressed this issue and offered insights and best practices for large scale organizations that want to mobilize DevOps for teams without losing sight of security and risk management objectives.
One massive organization takes two big leaps—securely
Phil Lerner, Senior IT Executive of UnitedHealth Group’s Optum arm, offered a unique perspective on continuous security monitoring from down in the trenches. UHG recently made the decision to adopt cloud and DevOps simultaneously, a bold move that made sense because of the synergy between the platform and the methodology. As part of a highly regulated, compliance conscious industry, the organization put security first in both initiatives.
“We’re bringing good, solid security infrastructure practices into the cloud and surrounding the pipeline with as few tools as possible to make management easy. That brings security to the forefront where it’s transparent for DevOps folks. But we’re constantly looking for risks, determining what the threat levels are, logging, and monitoring. We have gates we’ve built between zones and really took a network security approach to what surrounds the pipeline.”
In Lerner’s view, the tendency to think about DevOps as a set of tools is not necessarily the best approach. Instead of trying to completely retool the enterprise, the UHG approach focuses on optimizing processes and adding specific capabilities as needed. “To me, it’s more about the culture and using the tools we know in the enterprise and leveraging them end to end. We know how to manage them very well. We innovate around them and push our vendors to build APIs to do things we would like to do to innovate in the virtual security space.” With a staff of about a thousand IT security specialists in a team of about ten thousand total IT professionals at UHG, it certainly makes sense to use DevOps with the tools that Dev, Ops, and Sec already know.
Some standards persist, but fresh challenges have appeared
Akamai’s Director of Engineering Matthew Barr alluded to some typical best practices that organizations of all sizes should adhere to. Architecting applications to prevent unauthorized access is a no-brainer. “We don’t send a password to anything that is not one of the active directory servers. You don’t want to use LDAP on the application side, because then you have to worry about having credentials that might be reused.” He spoke further about Atlassian’s newest options for SSO and how they enable greater security for the enterprise across the application stack.
But with the increasing popularity of virtual and remote teams across the enterprise, there are new concerns that sometimes fly under the radar. “Some people may not realize, when you look at the Git logs, you see the committer username and email which are actually set on the laptop. You can change that any time you like. The server doesn’t authenticate that information. Without using GPG keys to sign your commits, there’s no proof who actually wrote something.” This represents a change from svn or Perforce where it would be reasonably accurate to assume that the person committing the code is, indeed, the committer listed. Matthew painted a scenario in which a backdoor might be discovered in code. When security goes looking for the culprit, they will find a name in the Git repository—but they have no way to determine if that was actually the person who inserted the malicious code. It would be far too easy to set up a patsy to take the fall. This is just one of the ways risk management is changing as DevOps teams become more distributed.
Open source continues to pose problems for enterprise security
The Heartbleed incident will likely go down in history as one of the greatest open source debacles of all time. This massive security hole in the OpenSSL cryptographic software library went unnoticed for a couple of years, putting the lie to the idea that having many eyes on open source effectively alleviates the risk of serious vulnerabilities. This is one reason that Automic Product Marketing Director for Release Automation, Scott Wilson, argues that enterprises should not overuse open source.
“You have to ask yourself what you are really in business to do.” For most companies outside the technology space, from banks to healthcare, transportation, and insurance, the goal is not to create software. It is to generate revenue by selling other products and services. Open source should only be used insofar as it enables that objective. This decision entails weighing the risk of undetected vulnerabilities as well as all the ongoing maintenance and customization that open source brings along with it.
What’s the solution? According to Wilson, in many cases it’s good to bring on third party vendors to balance things out. These vendors are devoted full-time to maintaining and tweaking software for their clients, providing support on a continual basis. It’s simple, “They make money supporting you.” And they may be able to do it more cost effectively than taking a DIY approach. Even though it might be true that ‘every company is a software company’, not every company needs to do it all in-house. It takes internal teams, the open source community, and the vendor ecosystem working together for a more secure enterprise IT. Perhaps DOES itself will one day morph into the DevSecOps Enterprise Summit to take things one step farther.