Posted by: David Vasta
8.5, Domino, Domino 8.5.X, IBM i, Lotus, OS/400, System i, V5R4, V6R1, V7R1
I have been fighting a problem at my current employers environment that has been hard to diagnose. The mail servers run on IBM’s Power Platform and iOS (OS/400) V5R4. The Domino version is 8.5.1 FP4. That said you would think that everything is going to be just fine. There is plenty of memory, plenty of CPU and about 1.5TB of extra disk space. Evertyhing should be perfect, but it’s not and you know this because I would not have started out by saying I was “fighting a problem”.
What was the problem?
There are two servers. One primary and one secondary. The majority of the users would run on the primary until it crashed then as expected would jump over to the secondary. Problem was when they started to move to the secondary the server would run up 300% CPU (Which is ok in a LPAR system if you have the CPU) and eventually just crash as well leaving NO mail server, which is a bad thing.
Ater begging for help from IBM Support I finally got some answers about the issue. Seems there is a problem with OS/400 V5R4 that does not make Domino 8.5.X very happy. Seems the issue lies in the way OS/400 does a look up on the files/DBs and OS/400 adds a 4K record to every record looked up. On a server with over 5000 users and a Domino Directory with 25,000 users and 65,000 Groups that is going to get messy. So when the fail over started the server would go nuts trying to index, compact, failover and do lookups to make sure everything was right and going into this spiral of doom eventually crashing. You add 4K to every record, every lookup and every little document while it is failing over and you get all kinds of issues.
So not only was my primary down, not my secondary was down as well.
So with both server down a few questions come up;
1 . Why and 2. When can we move to Exchange.
1. I am working on it and 2. Never!
I am still looking for the IBM Article that shows the problem. I will try to contact IBM Support again this weekend get it. I am sure some of you want to know more.
The fix is simple (I use the term loosley) upgrade the OS to V6R1 or V7R1 and it fixes the problem. But in a big IT environment it’s not that easy. So I have spent the past 6 months building new servers and moving them into place to replace the old one. Still in the process of doing this today and things are going well. The secondary server is doing good and the primary is due for a swap out in the next few weeks.
Now the question is WHY?
Why is this an issue?
Why did IBM fail to fix it years ago and lastly WHY is V5R4 still supported?
I know in the next few months V5R4 will fall out of support. I think it is later this year, but as some of you have pointed out this is unacceptable and the question needs to be asked. Why did IBM not want to invest the time to fix what I consider one of it’s most stable and most robust OS’es (short of OS/2 )