I recently posted about Google’s Postini – cloud email security service – delivery issues. This is a follow-on post about the incident root cause analysis and corrective actions. Maybe there’s some lessons learned here that you can use in your organization’s service delivery.
The impact on customer email services lasted more than 24 hours while Postini engineers worked to resolve the issues. So, this was not an insignificant event. During this period, messages were delayed and users were not able to get to their quarantines to release messages trapped by filters. Administrators were also unable to access the administration console. The Postini support portal was unreachable at times due to the high volume of users trying to get updates on the event. The support phone line queues were very long and it took a long time to reach a support agent. Nothing like this has happened before in all of the years we have been a Postini customer.
I just received the incident report about the service disruption and wanted to share some of the information with IT Trenches readers. Continued »
Since very early today, US Eastern Daylight Time, Google’s Postini services have been experiencing some service issues. It is unknown as of this writing as to the cause or full scope of the issue. However, when logging into the Postini support portal, an administrator is given the following status indicators:
We have been Postini customers over 4 years now and this is the first time an outage like this has happened. It’s not a full outage as messages are still coming in although at a trickling rate rather than normal expected volumes. This outage is so bad that my ability to login to the support portal is impacted. I receive either an internal 500 server error or “Too many connectionsCould Not Select DB”. A recent update notification said that a secondary Postini secondary data center has been enabled.
The recent GMAIL outage raised some concerns about cloud computing. I wonder if today’s Google Postini outage is a symptom of some deeper Google service delivery problem.
Thanks for reading & let’s continue to be good network citizens! Hopefully you are not trying to send me any messages, who knows how long it might take for the message to reach me today. Otherwise, let me know what you think here in the comments.
In America, October is the time when haunting, evil spirits and curses come to mind. Earlier today I posted a blog entry titled Can IT education bring an end to the recession? I used a quote that is attributed to a series of Chinese curses that go in ascending order of severity. After I used it, I pondered on the other two curses and their applicability to IT services.
According to Wikipedia, the three curses are:
- May you live in interesting times.
- May you come to the attention of those in authority (sometimes rendered May the government be aware of you)
- May you find what you are looking for
Well, by my title I don’t mean entirely end the recession, and especially not just through IT education alone. I was listening to the radio the other day and heard a snippet about an upcoming story. Unfortunately, I was unable to hear the entire story. However, the topic of the upcoming story raised an interesting question about the link between economic stimulus and education. I tried finding the story online but have been unable to find it to cite here. Maybe I just imagined it, but the story topic was about the GI Bill and how it helped a nation recover economically after the Great Depression and a costly war. The story preview continued to say that the nation grew economically and scientifically in the years following the war. It is as a result of those who were educated under the GI Bill that a new world of technology was shaped:
- Man in space and man landing on the moon
- Solar cells
Just think about all the marvels that have appeared in our world since the mid-1940’s. The relationship between those educated under the GI Bill and these technological advances is easy to see. Now, fast forward to today and the current economic conditions. What will happen in the next few years in the world of technology as a result of those who have lost jobs, being retrained in new skills and starting new careers? Maybe technological advances won’t be as rapid as those in the post WW2 era, but I expect some life-changing advances due to the education and skill changes resulting from the current education stimulus packages.
Electronic medical records, for example, will change both the patient’s and health care professional’s lives. The technology advance may not be sexy like lasers, but it may have a greater impact on the country we live in as a whole. Maybe we are living under the Chinese curse that goes “May you live in interesting times!”
Thanks for reading and let’s continue to be good network citizens!
Have you ever wondered if vendor case studies are actually solutions to real life issues or if they are stories about compensated organizations using a particular vendor solution? Well, I am here to tell you that I know of at least one case study that is about an organization addressing real-life issues that was featured in an award winning vendor case study. The organization is the company I work for and the case study is about the challenges we faced with replacing an under-performing legacy Frame Relay network with a more efficient and flexible global solution that delivers high availability, remote access, and integrated security. For the record, no compensation was given for being the subject of this vendor case study.
The case study won the 2009 Best Deployment Scenario – VPN/IPSec/SSL and was featured in the Info Security Products Guide. The winning case study and announcement can be found at Manufacturing Company Achieves Security and Performance Goals with Virtela’s Remote Access Services from the Cloud.
See all 2009 Best Deployment Scenarios and Case Studies. This would be a good time to look at these and see if any of the solutions may meet some of the information security needs of your organization. Consider putting the solutions in your 2010 budgets.
Feel free to leave comments here or contact me through ITKE if you would like more information. Thanks for reading & let’s continue to be good network citizens.
I recently came across an excellent article on the topic of TCP resets. TCP is a connection-oriented protocol as opposed to the connectionless nature of UDP. So, if there are TCP resets on your network, this is not a bad thing and is just inherent in the protocol. Without TCP resets, a host could have a lot of partial connections established which are in the wait state awaiting further transmissions. This can exhaust the number of available sockets and cause the host to become unresponsive. This is what happened several years back with the TCP SYN flood and LAND denial of service attacks. Another reset type includes the ACK/RST. This is where a client attempts to connect to a service that is not available on that destination host.
If you manage a network and have taken packet captures to work on a problem and have seen RST packets or if you need to do this at some point in your career, you need to understand the purpose and source of the RST packets. Take a few minutes, read this excellent article that is the best explanation that I have seen on this topic. You will become better informed and better able to understand the nature of the network beast.
Thanks for reading and let’s continue to be good network citizens.
Okay – if you support networks and have to explain why the network is slow or application performance is not what the users expect, why not use some of the following responses? These statements may or may not have been used in real life. What responses have you given to users when there really wasn’t a problem?
- Unfortunately we have run out of bits/bytes. Don’t worry, the next supply will be coming next week.
- The routing tables are all filled. There is going to be at least a 15-20 minute wait until you can be seated.
- Those packets have to go uphill to their destination. Gravity impacts network performance when you access services at that location.
- That is due to a BNC error. (i.e. brain not connected)
- The developer used a spell checker on that program. The fix will be delayed.
- The parallel processors are running perpendicular today.
Maybe a smile came to your face today while reading this. Maybe you have some similar comments to share with ITKE readers. Feel free to leave some words of wisdom for other IT Trenches members.
Thanks for reading & let’s continue to be good network citizens.
In part one of this series, I discussed ping and pathping. These tools are good for some interactive realtime testing. However, what do you do when you want to run these types of tools over an extended period and then do statistical analysis? In cases like this I use the fping tool. I recently completed an analysis task requiring comparison of network ping times against web server response times. The tool I used for measuring webserver response (time to first byte) is called URL ping. Users were reporting slow webserver (Sharepoint) performance. Everyone was saying it is a network issue. Since there are so many “moving” parts between the users and the webserver farm, I wanted to prove to them that the network was not the issue but that something inherent in the way the webserver responds to the requests is the real issue.
I didn’t realize how much I really didn’t know about CPU performance monitoring until I read this Microsoft Technet blog on Interpreting CPU Utilization for Performance Analysis. As the article says: If you rely on CPU utilization as a crucial performance metric, you could be making some big mistakes interpreting the data.
Take some time and review this recent (August 2009) posting on this issue. If you manage/monitor Windows servers and watch server performance, this article will give you a better understanding of the ins/outs of interpreting CPU utilization.
Here’s 4 of the top 9 takeaways that you will learn by reading this article:
Summary of Key Takeaways
Key takeaway #1: Processor of type A @ 100% utilization IS NOT EQUAL TO Processor of type B @ 100% utilization
Key takeaway #2: 2 HW threads on the same package @ 100% utilization IS NOT EQUAL TO 2 HW threads on different packages @ 100% utilization (for better or worse)
Key takeaway #3: 2 HW threads on the same logical core @ 100% utilization IS NOT EQUAL TO 2 HW threads on different logical cores @ 100% utilization (for better or worse)
Key takeaway #4: 2 HW threads on the same NUMA node @ 100% utilization IS NOT EQUAL TO 2 HW threads on different NUMA nodes @ 100% utilization (for better or worse)
Thanks for reading and let’s continue to be good network citizens!
Well, that may not be news to you. However, there is a recent trend in malware propagation that uses Google as the portal to deliver payloads to visitors. Unsuspecting users go to Google and search for topics such as Patrick Swayze’s death or the controversy about Serena Williams cursing at the line judge in her recent US Open tennis match. When a user selects one of the Google search results and visits the page, malware is downloaded to the client computer since the referrer is Google. However, if someone were to just visit the page on their own or through another search engine, the website does not serve up malicious software.
For more information see this Register.com article Swayze death exploited to serve up fake anti-virus – I’ve had the crime of my life. Seems like malware is bombarding us from all directions now. You can’t even trust ads on the NY Times these days.
Thanks for reading & let’s continue to be good network citizens!