Posted by: Dan O'Connor
I came across a very strange time issue on a linux machine that I manage. I use ntpd to keep time synced across the network like anyone should, I have a handful of machines that actually talk to the internet servers and the rest use them as a reference. This has worked for years up until this point, then pop one day the clock falls behind.
Stopping NTPd and re-forcing a sync with ntpdate fixed it for all of five minutes. NTPd would start back up, connect to the servers in the list but never get a * lock on one. You could also watch ntpq -pn and see the time slowly drift away from the reference servers.
Typically I don’t use iburst with a server declaration in the ntp.conf file. This will speed up the synchronization but I have never run into a situation where it was explicitly needed. Here I gave it a shot, and it seem to work! I got a lock on one of the servers and a + and – on the others. Well I thought it strange that all of a sudden I started to need to use iburst when it has never needed it before. Keeping watch on the server using ntpstat and ntptime I could see the sync status along with ntpq -pn. Again it started drifting away!
Here is a quick dump of a properly running ntp sever.
remote refid st t when poll reach delay offset jitter
+220.127.116.11 18.104.22.168 2 u 644 1024 377 64.714 4.926 1.400
*22.214.171.124 126.96.36.199 2 u 926 1024 377 56.436 4.623 0.687
-188.8.131.52 184.108.40.206 3 u 75 1024 377 42.255 3.743 1.738
+220.127.116.11 18.104.22.168 3 u 938 1024 377 53.608 4.713 0.742
In this case I would get this working for a bit then I could watch the offset slowly increase on the servers. It would hit a point where it would give up on a server and not mark it at all then move on to another. It would keep doing this until finally it knocked them all of the list and would error out saying it could not connect to a ntp server.
I dumped the interface after restarting the service and it appeared to be working fine I could see the frames going back and forth as it slowly started to ignore them.
I was starting to run out of things to even look at when on a whim I ran a fsck, and there it was. Errors on the file system, lots of them.
Just like that after a reboot and fix then time locked back on. I have not tried to locate the exact cause of where this was causing an issue, I am just surprised this was it.