Started very suddenly, one morning we came in to find:
• Outbound Exchange mail queue building (inbound ok).
• Some inbound users not able to access our external services, depending on where they were trying to connect from, connecting but timing out on web, terminal server etc.
• Some outbound users having problems accessing certain pages on certain websites.
Obviously, because it had the biggest impact, attention was focused on the Exchange Server and through our ISP (as a Gold Partner) Microsoft got involved and diagnosed that we were suffering from a “Black Hole Routing Issue”.
Defined broadly as:
A router along the network path somewhere, not responding appropriately when receiving a packet which has the “do not fragment” flag set and is too large for the router to process. Instead of responding with the message “Packet needs fragmenting but DF set” (to tell the source to fragment the packet) it simply drops the packet. This has the rather misleading characteristic of enabling an initial connection, but timing out if larger packets are sent.
Microsoft added the EnablePMTUBHDetect registry key to the Exchange Server, which forces the server to proactively detect “black holes” for each new given connection and then reduce the packet size appropriately.
We duplicated this mod onto the rest of our outward facing servers, plus some client machines that were having browsing problems from the network, which got us back into an operational situation.
However, this is not a permanent solution, the registry mod dramatically increases network traffic and we certainly do not want to have to implement it on every machine that ever connects to our network, so we need to actually track down the “black hole”. . . .
The generally accepted method for testing for “black holes” is to ping with the –f (do not fragment) flag set and increasing the packet size till you either receive a “Packet needs fragmenting but DF set” message back (healthy) or the ping times out (black hole).
Therefore all the ping test conducted below are done so with DF set!
Because all the problems we were experiencing related exclusively to our in/out connections to the internet this is where we focused our testing.
The first point out past our Firebox (router/firewall) we can test to is our ISP’s WAN gateway. What we found was that when pinging this up to a size of 1468 (plus header) we got a response, any higher it timed out, until we reached 1472 when it provoked a response of “Packet needs fragmenting but DF set” (I suspect that this message was being generated by the PC itself rather than the gateway, as 1472, including a 28 byte header is the 1500 MTU set on the PC).
So...this would then confirm that there appears to be a “black hole” on our WAN and any packets over 1468 (plus header) are dropped!
What our ISP suggested was to set the MTU on our Firebox WAN interface from the default 1500 down to 1400, on the basis that this might pre-negotiate all outbound traffic down to an acceptable size.
Because this ‘appeared’ to do nothing (we could still ping out onto the internet with 1468) we then also set the LAN interface down to 1400 as well. This also had no affect so we set it back again.
Now this is where things start to appear illogical. . .
We seem to have a situation where the machines on our network are split into one of two groups:
• Have suffered from the “black hole” issue from the start (i.e. either could not provide services or browse certain sites etc)
• Can ping the WAN gateway, through the Firebox with a size greater than the MTU set on the Firebox interfaces
• Can ping any other Group A machines with a size up to 1472 and get a response
• Cannot ping Group B machines with anything greater than 1372, otherwise connection times out
• Did not seem to suffer any problems initially
• Now seem to have adjusted their MTU down to match the settings we temporarily put on the Firebox, even though it has since been lifted.
o Now.. if we ping anything with a size greater that 1372 (1400 less the 28 byte header) it generates its own “Packet needs fragmenting but DF set” message.
o This is being held by the machine somehow, to the point where even on my home Wi-Fi network (different adaptor) I get the same results.
To really demonstrate this problem we took a Group A and Group B machine, removed them from the
Network completely and connected them together with a single cable.
If we ping from A to B with a size of over 1372 it times out, if we do the same from B to A we get “Packet needs fragmenting but DF set”.
So in summary, it appears that we may have a “black hole” on our immediate WAN, however cannot really start to try and prove this as our machines either ignore the MTU settings on the firebox altogether or generate their own “Packet needs fragmenting but DF set” regardless.