Recently we had an issue at a site where outbound messages larger than 1MB were backing up in the outbound message queue. The messages were tagged with a 421 4.4.2 Connection dropped error. This was a puzzling issue since the smart relay host was on the local LAN, and in fact, on the same switch as the Exchange server. We checked the switch ports and NICs for errors. None were found. We knew messages were successfully coming inbound through this site because the smart relay host was processing hundreds of them per hour (we use regional hubs and this is one of our hub sites).
We first contacted the vendor for the smart relay host appliance and opened a support ticket. No real issues were identified at first review. Since the errors were being reported at the Exchange server, we contacted Microsoft and opened a support ticket. We spent hours testing and changing configuration to another regional smart relay host which seemed to get the messages delivered successfully, but we were still not able to find out what was causing the conversations with the local smart relay host to timeout.
So, we went into deeper debug mode since the application and server event logs did not shed any light on the issue. The Microsoft engineer enabled protocol logging on this particular send connector. The protocol logs did give a little more information on the situation. A snippet is shown below.
2009-01-08T22:36:19.495Z,SendConn,08CB3FF87FA34699,17,exchsvr:20709,relayhost:25,<,”250 Requested mail action okay, completed.”,
2009-01-08T22:36:19.589Z,SendConn,08CB3FF87FA34699,19,exchsvr:20709,relayhost:25,<,”354 Enter mail, end with “”.”” on a line by itself.”,
2009-01-08T22:37:25.431Z,SendConn,08CB3FF87FA346A1,0,,relayhost:25,*,,attempting to connect
The conversation seemed to go fine at the beginning but something was happening at the end. Since this log did not freely give up that information, we used Microsoft’s Network Monitor 3.2 (btw-if you are still using an older version of Network Monitor, you should upgrade to v3.2. It does have some nice features that make it more user friendly – but not as nice as Wireshark) to capture the actual packets between the Exchange server and the smart relay host. We ran Network Monitor directly on the Exchange server.
At this point, we were able to capture the transaction failures. The results were very interesting and a good lesson in packet analysis versus protocol analysis. The packet analysis showed that TCP was working well. Everything at layer 4 and below seemed to be working well. This was a relief. However, it appeared that the actual problem existed at layer 6 & 7. The Exchange server was ending the SMTP (Simple Mail Transport Protocol) conversation with the “.” command (a single dot on a line by itself). The Exchange server was then waiting for the smart relay host to reply with a 250 2.6.0 status message saying the message was successfully queued for delivery. The Exchange server would then reply with a QUIT command and end the SMTP session. Since the smart relay was not responding at all with the expected status message, the SMTP conversation was timing out and messages were building up in the queue.
We found out that there were some patches for the smart relay host so we applied those. Once that was done, the messages seemed to flow normally. The other puzzling thing about this is that we have two other hub sites with the same configuration that are not experiencing this problem. So, sometime today we will be rolling out the patches to those smart relay hosts to prevent this problem from happening at those sites. This issue started out of the blue but seemed coincide with the same time Exchange Server 2007 rollup 5 was applied.
The point of this whole blog posting is that while the TCP protocol was working fine and everything looked good there, the SMTP protcol was not working correctly. It is important for a network engineer to understand networking through all of the OSI layers. You cannot just assume that if things are working well at the lower levels that things at the higher levels will work too. The reverse logic is true also. So, understand the protocols at the lower layers and also the PROTOCOLS at the upper layers if you really want to be an effective troubleshooting expert.
Let’s be good network citizens out there!