SOA Talk


September 9, 2016  6:42 PM

Does Google’s new acquisition mean APIs got cooler?

Fred Churchville Fred Churchville Profile: Fred Churchville
API, API management, api stack, Apigee, Google, Google API

What should we make of Google’s acquisition of API management solution provider Apigee? Is it the forging of a game-changing alliance? Is it a mission of mercy to save a company that, according to Suraj Kumar of Apigee’s competitor Axway, has struggled to gain traction in the market? Or is it simply an attempt by the technology behemoth to gobble up market share as fast as possible?

Whatever it is, it says something about business attitudes towards APIs. While not a huge transaction, it was a big one for Google and begs the question of why they were eager to get their hands on the company. In this space, they are lagging behind competitors like IBM, Oracle and AWS, but someone must have decided that Google had no choice but to get in on the API game for a $625 million buy-in.

Either way, it is refreshing that the value of APIs is finally garnering attention. Too many companies have doled out money for things like EDI VAN services when a simple API is capable of doing the same thing. But it seems that the promising abilities that APIs have demonstrated in connecting mobile apps, especially in the global communication space, is what has caused businesses to turn their heads.

It will be interesting to see where this partnership goes, as well as whether the prediction by Forrester analysts that the market for API management solutions will quadruple by 2020 comes true. In the meantime, maybe this can serve as a reminder to take a look at your own API management strategy.

What are your thoughts on Google’s acquisition of Apigee? Let us know with your comments.

Fred Churchville is a writer and editor for the TechTarget Application Development group. You can reach him for questions or comment at fchurchville@techtarget.com or on Twitter @TechTargetFred.

September 1, 2016  3:38 PM

Things that bug me: The Delta and Southwest crashes

Fred Churchville Fred Churchville Profile: Fred Churchville
Airline, Crash, Server crashes

These days, passengers have to be worried about more than one kind of crash when they travel. Over the past two months we’ve seen two major airlines, Delta and Southwest, experience huge computer system crashes that resulted in flight delays, frustrated passengers and revenue losses totaling in the billions.

The two computer crashes had unique origins — one was a power problem and the other was a router problem. However, it seems to me like Southwest and Delta are both guilty of the same thing: they weren’t ready.

While some were blaming local power utility companies for the Delta computer crash that left tens of thousands of customers stranded earlier this month, the airline’s representatives ultimately admitted that a power control module failure cut off power to the main computer network. And while there were backup systems in place, some critical backup systems didn’t kick in, resulting in halting instability.

Some experts speculated that this was the result of years of acquisitions complicating computing systems by creating patchwork systems that may or may not integrate with each other smoothly. However, this has been debunked as Delta spokeswoman Susan Hayes confirmed that Delta has always relied on a single computing system. No, this is just a question of unpreparedness.

There is an added problem: A lot of Delta’s computer systems are just old. Antiquated technology just can’t perform at the same rate as newer software and hardware, which could be why some of those systems didn’t start back up — they were just too old for the backup system to handle.

Now let’s talk about Southwest. According to the company, the crash was the result of a router failure. Did the router have a backup system in place? Yes, however the router only experienced a “partial failure,” meaning that the backup system was not alerted to start up.

The first thing that bothers me about this is the explanation. I understand that the backup system doesn’t get alerted unless the router experiences a complete failure. But why? Why isn’t there a backup router that can be used in case anything goes wrong with the router? Wouldn’t that make sense?

The other thing that bothers me was the reaction of Southwest CEO Gary Kelly, who equated the company’s delay causing router failure to a “once-in-a-thousand-year flood,” in that it was a partial failure of the router, something he said they could have never seen coming. Kelly also said that the partial failure “isn’t a drill you can run.” I don’t understand why this is a drill that you can’t run. If you can test a complete shutdown of the router, why can’t you test what happens when only part of it fails? Also, isn’t the point of a backup system or plan to be ready for the unexpected? I doubt they would regard a critical system failure in an airplane with the same sort of “well, it happens” attitude.

Could both of these airlines have taken steps to prevent this from happening? It’s possible that if they had run a complete test run of their backup systems, they would have known ahead of time which ones would start up correctly and which ones cause trouble. However, I’m not unsympathetic to the plight of those tasked with running an airline’s computer systems. The problem with the airline industry is that it never stops. It doesn’t seem like IT personnel don’t get the luxury of “off hours” when they can perform elaborate security and backup checks on their systems. People have to fly all day, every day.

But doesn’t something have to be done here? Computer systems are getting more and more complicated. Over three billion people fly per year. Can we just expect things to stay the same and for more computer crashes to occur? Will these airlines pony up the cash to replace legacy systems? Will they do that at the cost of forgoing upgrades to their planes? I don’t know about you, but I’d prefer a computer crash to the traditional kind when it comes to flying.

Software engineers, developers, testers and other experts: What should Southwest and Delta have done? What should they do now? I’m interested in your thoughts, so let me know with your comments or via email.

Fred Churchville is a writer and editor for the TechTarget Application Development group. You can reach him for comment at fchurchville@techtarget.com or on Twitter @TechTargetFred.


July 29, 2016  8:24 PM

What can we expect from AnDevCon Boston 2016?

Fred Churchville Fred Churchville Profile: Fred Churchville
Android, Android mobile, Mobile Application Development

What will I learn? Will it be fun? Should I leave my iPhone at home?

These are the things we want to know about AnDevCon Boston 2016. So we spoke with Katie Flash, director of conference programs at BZ Media (the coordinators of the conference) about what this year’s AnDevCon will be like.

The event, which has been running since 2011 and will be hosted at the Boston Sheraton, is designed to help developers working with Android learn about important software and tools, acquire new skills and get valuable hands-on experience. According to Flash, they hope that this conference will “continue the education of the Android development community” and help enforce its status as a credible source of learning and information about the Android platform.

“Not only are [attendees] learning high-level skills, there’s a lot of hands-on content,” said Flash, adding that developers will have access to code that they can bring back to the office and gain “actionable insights that they can use immediately.”

Flash said that another key feature of AnDevCon is their focus on providing info about new developments in popular software and integration tools. This year, she said there is a heavy focus on Firebase, a backend as a service that was acquired by Google and is making efforts to become a unified app platform for Android, iOS and mobile web development.

To provide unique insight into Firebase and other Google-based tools, Flash said that for the first time AnDevCon is featuring the “Google Learning Zone.” This is a dedicated space where developers and representatives from Google will be available to share information about and provide demonstrations of their projects and products. Developers will also get the chance to “roll up their sleeves” and try out the tools for themselves, she said.

Flash said that this is unique because Google personnel have been perceived as distant from their customer base in the past. The Google Learning Zone gives those customers and potential customers a chance to connect with the company and get answers about products, such as Firebase, straight from the source. This includes hosting “Office Hours” all day Tuesday, where attendees can meet one-on-one with Googlers to ask their most burning questions about development.

“For a long time, they weren’t really accessible,” she said. “Now people can come down and talk to the people that are actually developing [the products].”

Flash said that enterprise developers will definitely benefit from this conference in addition to those that develop independently or for small companies.

To cater to enterprise-based attendees, Flash said that they are offering sessions such as a two part series on Tuesday titled “Enterprise Mobility Management with Android for Work.” Flash said there will be numerous enterprise-centric topics covered in other sessions, particularly scaling at the enterprise level and security. Of course, there will be plenty of sessions geared towards smaller-scale developers as well.

“Our attendees really run the gamut of developers,” Flash said. “There really is something for everyone.”

Flash also said there will be a number of special events running in addition to the sessions that developers and other attendees should watch out for, including two hack-a-thons: one centered around Firebase and another hosted by the technology company Zebra which features its TC8000 enterprise mobile touch computer.

Another cornerstone event that Flash said is a staple at AnDevCon is the Women in Android luncheon, led this year by members of Girl Develop It Boston, a nonprofit organization committed to providing development training for women. “It’s a really great platform for these women to come together and share ideas,” she said.

Finally, Flash highlighted the “Fireside Chats (without the fire).” This is an afterhours event where Android experts and “Googlers” will lead a casual conversation about Firebase, but she said it is also a “time for people to hang out and be laid back.” Winners of the hack-a-thons will also be announced at this event.

Stay tuned for more coverage of AnDevCon Boston 2016.


July 27, 2016  4:01 PM

At the intersection of Legos, Agile and SOA

Valerie Silverthorne Profile: Valerie Silverthorne

So your Agile teams aren’t working as efficiently as you’d like. Maybe you have specialists spread too thinly through your teams, or your product is just to big and broad for a traditional organization. It’s SOA and Legos to the rescue — really.

At an Agile 2016 seminar in Atlanta this week, speakers Catherine Louis and Raj Mudhar presented their ideas to make Agile team reorganization easy and visual. The concept — “Whole Team Dynamic Organizational Modeling” — is a bit of a mouthful but it’s based on service oriented architecture principles. And it uses Legos, paper, Sharpies and your imagination.

Louis and Mudhar asked each table to decide on a problem, then create a key where each color Lego represented a part of the team. At our table, white Legos were DevOps folks, purple were DBAs, etc. Then the idea is to simply create a Lego team and see how it works by doing some creative role playing. Using the Sharpie you can actually draw how communication and requests happen between teams and you’ll almost instantly see why something works, or doesn’t. In some teams a single team member may wear several hats (scrum master or acting product owner) and it’s easy to see that because one Lego is covered by another.

The service oriented architecture comes in at this point because for most companies, a single “team” really can’t have every skill needed. So by creating “pools” of scarce talent (again using Legos), you can visualize how those pools can be used by the teams to improve the flow.

The participants at our table agreed that this low key and fun strategy might be a good way to broach the subject of reorganization without putting stress or pressure on anyone. If you’d like to learn more about using SOA (and Legos) to reorganize your team, Louis and Mudhar have a website with resources.


July 22, 2016  4:41 PM

Can enterprise app developers learn from Pokémon Go?

Fred Churchville Fred Churchville Profile: Fred Churchville
Mobile Application Development, Mobile applications

If Pokémon Go was the bubonic plague, we’d all be dead. Even if you’re not playing yourself, you’re probably in a room with someone who does. At the time that this post is being written, current estimates float somewhere around 9.5 million daily active users.

Some enterprise mobile app developers must look at this phenomenon and think “if that many people will pick up Pokémon Go, why can I not get users to adopt my company’s mobile app?” Often times I hear about unwillingness amongst peers in my office to download mobile productivity apps because it’s either too complex, not reliable or they simply don’t want the app on their phone. But Pokémon Go seems to attract users so much that even my technology-phobic mother — who is the furthest thing from a Pokémon fan — wants to play the game.

Pokemon Go Footage

And for the purposes of this post, I was forced against my will  to become of those users and play in my backyard.

Niantic is a relatively small company that has managed to create one of the biggest mobile apps of all times. A Huffington Post blog post talked about how Pokémon Go may be fundamentally changing the application development scene. Maybe enterprise mobile app developers can use some of these ideas to increase adoption of their organization’s mobile apps.

I’ll be the first to admit that all of these ideas are pretty idealistic, but maybe it’s worth at least just thinking about.

Learn what makes users happy

Surely one of Pokémon Go’s strongest building blocks is the fact that Pokémon established itself as a cultural craze once before. I’d imagine anyone who says they’ve never at least heard of a Pikachu today is either lying or has been living in forced quarantine for almost two decades.

Of course, a mobile enterprise app is never going to be able to compete with Pokémon’s established pedigree. The point is that the desire for anything Pokémon existed before the app — the app simply exploited it. While everyone is absolutely enamored with the app’s unique GPS and virtual reality features, I’d imagine a lot of those people simply wanted to play Pokémon on their phone.

Ask yourself: What do your users really want? And how can you build an app that scratches that itch? For instance, perhaps people within your company are constantly frustrated because they can’t find certain areas or meeting rooms. Maybe it’s worth creating a GPS-enabled mobile app that can provide real-time walking directions to locations in the office.

The learning curve aspect

As the HuffPost piece pointed out, Pokémon Go nails the learning curve. It’s simple enough up front to get people into it, but offers plenty of complex features to players to look into once they are familiar enough with the app.

There’s a fine line to walk with apps: It has to be complex enough to meet the business need, but not so complex that it makes users tear their hair out. So, is your organization’s mobile productivity app easy to use from the get-go, or do they have to sit down with IT countless times to get things to work? Maybe it’s worth considering the inclusion of a simple tutorial for the apps basic functions. Then users can pick up the more complicated functions as they move along.

The social element

The HuffPost piece also talked about that the social aspect of Pokémon Go is drastically impacting its adoption and use. You can join teams, help each other out, look for Pokémon together and more. This turns the use of the app into a larger social experience.

Is there some way this same kind of social aspect could be weaved into an enterprise mobile app? Is there any way that the app can encourage collaboration between team members, or reward them for linking up with other users? Can it help facilitate the creation of a team, like Slack does with its groups? Maybe turning the app into something that people can use together can turn it from a technological burden to a blockbuster.PokeBall

Is this worth looking into? Are my ideas just totally out of the scope of reality? Let me know with your comments.


July 14, 2016  10:35 PM

Is digital transformation software making people sad?

Fred Churchville Fred Churchville Profile: Fred Churchville
Enterprise architecture

A survey conducted by the software company BiZZdesign and The Open Group has revealed that while businesses are eager to jump into a “digital transformation,” they may not necessarily be happy with the software support available to make that transition happen. It also found that business culture is often a major inhibitor to business transformation — a barrier that could potentially be broken by making key changes to processes and cultural mindsets within the organization.

Part of the survey, which also had support from the University of Twente and the Association of Enterprise Architects, focused on the tools that enterprises are using to enable a significant digital transformation. It also sought to determine how happy they were with those tools — a finding that had mixed results, as Peter Matthijssen, senior consultant at BiZZdesign, explained in a webinar about the study and the process of digital transformation.

Responses indicated that about 85% of businesses are using app and solution integration software for business transformations, and about 80% are using enterprise architecture software. About 80% of those using app and solution integration software indicated that they are at least fairly happy with how well it supports their efforts. But when it came to enterprise architecture software, it seems that this number drops to just over 50%. The webinar did not dive further into specific complaints businesses had, so it is uncertain whether this frustration is stemming from problems with the software, a lack of the skills needed to manage the software, costs — or all of the above.

Matthijssen noted that the cultural and process barriers that inhibit an organization’s ability to become what he calls an “adaptive enterprise,” one that is able to use technology to make rapid, continuous changes to business processes and adapt to changing market conditions. He said that often organizations let “business as usual” get in the way of making a digital transformation, meaning that the money and time dedicated to everyday tasks and managing legacy systems simply stops any business transformation efforts in its tracks. Matthijssen also said that many organizations experience what he calls a “lack of organizational commitment.”

“I think here there is a large cultural part — where we can’t get the people on board to move in this new direction,” he said.

Matthijssen did offer numerous pieces of advice for organizations looking to improve their ability to transform, many which revolve around the idea of changing corporate culture. One piece of advice he gave urges businesses to simplify their systems and businesses policies. He illustrated this with a mythical anecdote about the U.S. spending millions of dollars to create a pen that could write in space when Russian astronauts were able to accomplish the same goal with a pencil.

“A lot of organizations have those very expensive pens when they could do the job with a pencil,” he said.

Another piece of advice Matthijssen shared is the idea of embracing what he calls a “culture of failure” where organizations embrace project failures and shortcomings as part of the transformation and learning process. He stresses that this failure does not have to occur on a large scale, and encourages that organizations should aim to “fail on a smaller scale” and use each failure as a learning experience.

Other pieces of advice Matthijssen offered included reducing legacy complexity and “bridging” the gaps that may exist between software and business professionals within an organization. By reducing legacy complexity organizations may be able to avoid the “innovation squeeze” that occurs when money that organizations want to put into a digital transformation end up overtaken by legacy costs. And bringing the disciplines within an organization, he said, will allow organizations to pursue digital transformation by bringing ideas together and empowering personnel. He encouraged participants to “use the wisdom of many, not the power of one.”


July 7, 2016  8:39 PM

They want to adopt DevOps, but does anyone actually know what it is?

Fred Churchville Fred Churchville Profile: Fred Churchville
DevOps, DevOps - testing / continuous delivery

A look at tech’s most popular identity crisis

It seems that the term “DevOps” suffers by straying from one of its own guiding principles: an established, single source of truth. A single Google search for “what is DevOps” provides a whirlwind of answers. Yes, they all talk about the notion of having developers and operations teams work together. But a lot of the similarities end pretty much right there.

Right off the bat, I notice all the contradicting things said about the DevOps adoption. Voices from EMC will tell you that you don’t have a choice in the matter. At Computing’s DevOps Summit, Finbarr Joy, group CTO at Lebara, said companies that don’t adopt DevOps now are doomed. Then you have the very own DevOps.com warning organizations to adopt practices organically instead of hastily implementing a DevOps program. Some developers say implementing DevOps is a great way to kill your developers. Certain programmers who have experienced “DevOps” plainly say that DevOps is bulls**t, complete with examples of failed implementation attempts.

If you do manage to get past the adoption debate, the next step is to make sense of the dizzying array of advice available about DevOps and what the most important element is. Some authors advocate “creating a DevOps culture” by having developers and system admins just become better friends. One CTO at Chef boldly says that the tooling doesn’t matter at all. DevOps Digest, on the other hand, produced a list of 30 must-have DevOps tools. Another CIO said in a blog that the core of DevOps lies within creating an integrated service model, cross-functional teams and a management framework.

Looking at this, it doesn’t surprise me that so many companies are struggling and failing to adopt DevOps. Perhaps the best thing is for developers and operations folks to simply learn the fundamentals of DevOps on their own. Then then can try to garner the concepts that work best for them, rather than blindly joining the DevOps stampede.

And if DevOps isn’t confusing enough for you, perhaps you should try NoOps on for size.


June 30, 2016  1:53 PM

QCon New York sessions – The F#orce awakens

Fred Churchville Fred Churchville Profile: Fred Churchville

While collecting and working with data is a burning passion for some, for many it is simply another critical business requirement that is only getting more complicated. In this session, data guru Evelina Gabasova showed participants how the F# language can help make the task of working with data a little more manageable.

Gabasova is definitely a member of the passionate group of data workers. She is a postdoctoral researcher at MRC Cancer Unit, and works with a lot of data. But she is just as interested as anyone else in making the task of working with that data easier.

Parsing with active patterns in F#

Gabasova provides an example of parsing with active patters in F#.

“I study the genome…and that can get very complicated,” she said. Gabasova said that she relies heavily on the use of F# to work with the massive amounts of data she comes across in her research, particularly because of its strong ability to parse scripts through active patterns. Active patterns allow coders to define input data with names that can be used in a pattern matching expression.

The features are strong with F#

In order to demonstrate how effective F# is, Gabasova demonstrated how, using publically available copies of the Star Wars scripts and an API called SWAPI – the “Star Wars API” – she was able to determine exactly who the most important character in the Star Wars universe is. SWAPI, which calls itself the “world’s first quantified and programmatically-accessible data source for all the data from the Star Wars canon universe,” was able to provide Gabasova with detailed information like characters’ heights, birth dates and other intimate details.

One of the biggest reasons Gabasova advocates F# is because of the availability of built-in type providers that can automatically provide the types, properties and methods you need to work directly with tables in, say, a SQL database. In this way, those working with diverse information sources without having to manually write repetitive lines of code or add on files with a code generator. And she proved how this worked by showing us how easy it was for her to determine the average height of a stormtrooper (5′ 9″, I believe?) and verify that Luke actually was a little short for a stormtrooper.

“F# makes it easy to specify certain attribute,” Gabasova said. “Type providers are amazing!”

May the visualizations be with you

One of Gabasova’s major points was the importance of visualizations. Without visualizations, she said, it is not possible to glean the insights you may want from your data. She then went on to reveal the in-depth visualizations she put together outlining the key “social network analysis” factors that determine the importance of a character, such as centrality and density of network.

“Whenever you do data analysis, always visualize it,” Gabasova said. “Always.”

So…

…who is the most important character in Star Wars according to Gabasova’s research? In her words, “Darth Vader still rules the universe”.

Gabasova then went on to explain how these same F# techniques can be used within organizations to analyze the network connections that exist within their own companies. By taking data from things Slack communications, email and other social platforms, it may be a lot easier to garner serious insights into the social structure of your company with the help of F#’s unique features. However, she insists that F# can really be used for all types of data and projects.

Uses for F#

Gabasova lists some of features and use cases that make F# worth considering.

“I would just encourage you to play with the data you have,” she said.

Audience takeaways

Sameer C. Thiruthikad, a software developer at the Qatar Foundationan and attendee of Gabasova’s talk, said he enjoyed the talk and the way Gabasova chose to present it.

“It was good,” he said. “It was really interesting because they used Star Wars to tell the story.”

Thiruthikad’s team currently makes use of C#, and said they face the very common challenge of organizing and visualizing large sets of data. But he said the session encouraged him to try and use F# going forward as a solution.

“I will surely get into F#,” he said. “I’ll just test the waters and see if there’s anything interesting there.”


June 28, 2016  3:28 PM

QCon New York Sessions – Fault injection with Microsoft

Fred Churchville Fred Churchville Profile: Fred Churchville
Fault isolation, Software testing

When it comes to testing software, many of today’s organizations rely heavily on comprehensive testing, especially unit testing, to minimize the risk of outages. But in this session, Michalis Zervos of Microsoft talked to audience members about what some consider the “next generation” of creating software resiliency: actually taking those anticipated faults and forcing them to occur to your software.

“Fault injection,” as Zervos refers to it, can be performed on everything from virtual machines, to custom applications to hardware. And this is a practice Zervos’ team at Microsoft actively uses and promotes in order to see not just how particular services and such are affected by certain unwanted events, but also how the dependent services and software are affected as well.

Fault injection benefits

Zervos explains some of the reasons to adopt fault injection alongside testing.

“We create ‘storms in the cloud’ to see how it performs under pressure and failure and use that to create resiliency,” he said. And according to Zervos, fault injection can be used for more than just testing resiliency. It can also be used for things like testing new features, training and verifying staged deployments.

Zervos covered the numerous faults that teams could consider injecting, including creating a kernal panic, “hooking” and disrupting critical service code, crashing critical processes and even pulling the power plug on your data center. He also suggested a few publically available tools that development teams can use to make the process easier, such as Consume.exe, Sysinternals tools and “managed code fault injection” through TestApi, a library of test and utility APIs.

Zervos did warn audience members that fault injection cannot be performed without certain precautions and considerations in order to achieve accurate results and avoid creating more problems. He cautioned that teams need to still follow fundamental security principles such as the least-privilege principle, make extensive use of code signing, create a “safety net” for the automatic removal of faults should they get out of a tester’s control and have a “kill switch” available, which he said can save developers and testers “a lot of grief.”

Zervos also stressed this importance of extensive verification and reporting when it comes to fault injection. He also instructed audience members that it is useful to manage fault injection from a centralized location.

“If you are not able to verify what happened, you don’t get the most out of your system,” he said.

System architecture for fault injection

Zervos presents his own system architecture in relation to a centralized fault management service.

One of Zervos’ final points was that it is not enough to simply perform fault injection every now and again. He stressed that teams need to integrate fault injection as a continuous part of the production cycle and find creative ways to encourage teams to adopt its practice. One suggestion he made was the idea of “recovery games,” in which one team member simulates an attack on a particular system and another team member, often a trainee, must record what occurs and take the proper steps to mitigate the risks of an outage. By implementing these types of programs, Zervos said his organization was able to increase adoption of fault-injection and also garner helpful insights about the behaviors of team members, such as that some spent too much time debugging and not enough time actually mitigating the problem.

“It needs to be part of the engineering process and part of the culture of the company,” Zervos said.

Fault injection recovery game goals

Zervos provides examples of the goals that can be achieved through adoption and training programs such as “recovery games.”

John Billings, technical lead on one of the infrastructure teams at Yelp and attendee of Zervos’ talk, said he thoroughly enjoyed the session and believes that fault injection is “the next step in actually testing resiliency of production systems,” he said.

Billings, who also held a talk at QCon on the “human side of microservices,” said he particularly liked the fact that Zervos spent his time discussing the general principles of fault injection rather than talking about specific technologies. And while his company does already make use of fault injection techniques, he is hoping to push the adoption of this strategy even further within his company and hopes that others will as well.

“Tests can only cover so much that you’ve thought about beforehand,” he said. “If you actually have fault injection happening all the time in production, you get that additional level of reliability that otherwise would be very difficult to achieve.”

Billings also said he liked the idea of introducing “fault injection games” as an approach to encouraging the adoption of this strategy, but believes that these adoption strategies must be align with a company’s individual culture. For instance, he noted hearing about the idea of a “badge-based system” that awards teams particular badges for completing and adopting certain testing and production techniques.

“You have to experiment and just see what works for your particular culture and your company,” he said.


June 23, 2016  6:14 PM

QCon New York Sessions – Incident Response with Etsy

Fred Churchville Fred Churchville Profile: Fred Churchville
Incident response
“Incident response – what makes it so terribly difficult?” – John Allspaw at QCon New York

“Anomaly response does not happen the way we might imagine it does,” John Allspaw, CTO at Etsy, said in his opening keynote presentation at QCon New York, “Incident Response: Trade-offs Under Pressure.”

Can we trust tools?

One of the first notes that Allspaw made is that organizations cannot simply rely on tools to make it easier to understand how and why incidents are occurring. Instead, teams need to rely on processes and reasoning in order to truly respond to anomalies. And they cannot, he said, treat these outages as a mystery that is constantly developing over time.

John Allspaw on tools at QCon New York

Allspaw believes that tools designed for incident response may never actually simplify the process.

“An outage is not a detective story,” Allspaw said. “It’s static, and it’s there.”

A model of reasoning

In order to properly deal with outage-causing anomalies, Allspaw recommended that organizations implement a “model of reasoning” that does not “distinguish between diagnosis and therapy.”

John Allspaw presents "model of reasoning" at QCon New York

Allspaw presents this model as an ideal strategy for anomaly response.

Avoiding “cognitive fixation”

Listeners were also warned not to fall into the traps of “thematic vagabonding” and “cognitive fixation” – meaning that those debugging the code can become so wrapped up in simply fixing bugs and symptoms that they fail to delve further into discover the actually root cause of the issue.

“As one thread of diagnosis comes in, you start running to more,” Allspaw said. He said that avoiding this requires developers and testers to communicate about what they are seeing and not get stuck alone on a path of just fixing bug after bug.

In fact, he provided a list of “prompts” that teams can use to frame particular question, dividing the questions into four “stages” of incident response: observations, hypotheses, coordination and suggesting actions. By asking these questions, team members may be able to avoid “cognitive fixation” and get to the root of the problem.

Allspaw's "questions to ask" at QCon New York

Allspaw provided a list of question ideas and prompts that can help move anomaly response forward.

Final notes

Allspaw also talked about the importance of linking anomalies to any known, recent changes in the code or application and, more so, of having peers review your hypotheses.

“Validate the hypothesis that most easily comes to mind,” he said, while also adding that anyone who begins to build confidence about discovering a certain cause of an outage should always check that confidence with a peer review.

The "punch line" of John Allspaw's talk at QCon New York

Allspaw sums up his presentation at QCon New York by saying teams need to rethink how they approach incident and anomaly response.


Forgot Password

No problem! Submit your e-mail address below. We'll send you an e-mail containing your password.

Your password has been sent to: