The information you provided really isn’t enough to track down the problem. What kind of switches are you using? What is the distance to the switch? Now that most installations are 100Mbit or even 1Gbit, length limitations are more critical than they were with 10Mgit networks. Is the problem consistent or intermittent?
Are you sure you were using ATM in the old installation? This doesn’t make sense to me.
Do you have any problems saving the files locally? This test will help show if the problem is really reaching the server or something on the machine.
Try pinging the server when the problem occurs. Can you do a capture of traffic during a failure? (Your company may have rules about this). You can get ethereal for free. A capture should help explain the nature of a network failure. You also really need to know if there are problems in the switch logs.
We had some problems similar to yours. The intermittents were driving me nuts because I couldn’t diagnose the problem before it went away. People had trouble saving files, logging onto the domain, staying connected to the exchange server, and getting files from the servers. There were no errors in any of the switch logs. As far as I could see, the network was working fine.
Finally the problem got bad enough, long enough for me to track it down. It turned out to be our new dell Gbit switch. A warrenty replacement over spring break seemed to fix everything until the students came back.
When I called dell support again they were unable to figure out what was happening and insisted I was running some new protocol that was messing up the network. A quick look at my captures didn’t show anything unusual. (If I had time to look deeper into the captures I would have seen the retrys).
Finally, I discovered a pattern in ping failures to the switch. They would fail for a while then succeed for a while. I figured out that when they were failing, if I pinged the router, this “fixed” my pings for a while. I immediately suspected the CAM table. The dell tech showed me where to find it and I was surprised to see is was only 1K in size. We have more than 1000 nodes at our main campus.
Later that day I moved our servers over to a slower cisco switch and the problems went away. My interpretation is the CAM table overflowed and the switch started silently dropping packets. We are in the process of ordering an HP switch to replace the dell switch. The CAM table in the HP switch is 16K.
The point of this long story is to show you can have problems even when all of the network logs say there is no problem. If I had been set up for a capture during one of the intermittent failures I would have found the problem sooner.
Discuss This Question: 10  Replies