Saturday, May 21, 2011

True Tales From IT Land

Before I joined a financial services company I was a systems programmer, but for the last 15 years I've been doing a variety of jobs in what's known as "IT" (Information Technology). During those 15 years some goofy things occurred which I will share with you reader.

The Never Ending Elevation
No, this doesn't have anything to do with an erection lasting longer than four hours. An elevation is the introduction of new software into the production environment. Think of it as software having to get a job following college. Well this particular elevation involved the company's Intranet application with some major new features: single-signon with Windows and personalization of the home page.

Note: These two are interdependent. The single-signon allowed us to identify the user and this permitted the user's preferences to be stored and recalled.

This release had been delayed for several months because the project was dependent on another team to deliver the single-signon solution. When this feature was finally ready we scheduled the elevation to start on Friday evening and finish on Saturday. From the start things did not go well. There were configuration changes that could only be tested in the production environment because it was deemed too costly to replicate the hardware. By Saturday evening it was apparent that the elevation would continue into Sunday.

This was right at the time of my life when I was suffering from severe lower back pain. The stress of a major elevation (BTW I was the Technical Lead responsible for the elevation) was compounding my physical discomfort as I followed the proceedings by an extended conference call.

By Sunday morning we had the configuration right so that the single-signon part was working. Testing of the home page preferences was giving mixed results, however. Sometimes it worked and sometimes it did not. What was particularly strange was that the tests always worked in our test environment.

We finally isolated the problem to the database -- the production database was not behaving like the test database. It was time to get the on-call DBA on the line. This was my job and I spent four hours on the phone with the guy. He couldn't resolve the problem but I couldn't let him go because my boss wasn't going to let us go home until the problem was solved.

Finally the DBA called one of his colleagues and learned that a patch was needed to the production database driver. The patch had been applied everywhere else, but not to production. Finally the patch was applied, and everything worked. The elevation had taken around 40 hours to complete. 40 hours of pure hell.

The Urge to Kill
For some reason as Technical Lead I was expected to know every operational aspect of the Intranet. In other words when there was any operational problem, I had to diagnose it and fix it. The common term for this procedure was to find the "root cause" even if this meant tracking down the cosmic particle that flipped a bit on the memory chip of the server.

Remember I said that our Intranet had single-signon? This means that if you signed on to your Windows computer you did not have to signon to the Intranet application as it already recognized your login. This was no small feat because our Intranet ran on Unix servers with all kinds of crazy network devices in the picture.

For 99.999% of the time the single-signon worked perfectly but there were occasions when out of the clear blue, the home page would present a login form. We had seen this during testing and knew that the cause was some process consuming an unexpected amount of CPU such that the server could not complete the Kerberos protocol within a given number of milliseconds.

Usually this problem wasn't noticed and everything went back to normal. But one day operations noticed it and filed a trouble ticket. I was assigned to resolve the trouble ticket. Now our Intranet involves at least a half dozen servers not to mention specialized network devices. The problem was sporadic and did not follow a pattern. Here's your haystack; have fun finding the needle.

To make matters worse I had to report status via a daily teleconference called the DSR (Daily Status Report). The DSR is designed to improve the overall operations at the firm by the relentless pursuit of "root cause." Actually it's a means to pillory Tech Leads who have no idea why the software occasionally misbehaves.

Anyway, day after day I had to report no progress. Then suddenly someone from operations was on the call and said that he had traced the problem to one of the servers that was used to store employee pictures. He further stated that the problem occurred when a certain script was run by the database group. He arranged for a rep from that area to attend the DSR the next day.

The next day a manager from the database group was on the teleconference and was confronted with the miscreant script. He said "we don't have to run that script." The case was closed without the slightest apology to me or the rest of my team who had spent hours trying to diagnose it.

The Computer is in Control Here
My last tale is one of the most mysterious that I've ever had in almost 40 of computer work. Without going into tremendous detail I'll just say that I made a configuration change in the way one server authenticated to another server using stored credential (username/password) data.

To improve operations I needed to change the credentials the server used. First I tested this on my desktop by running the first server there. The credentials change worked fine.
Next I proceeded to the production environment. To ensure it worked as I made the change I monitored the second server's log file. If the new credentials were rejected it would show immediately in the logs, and I would be forced to back out the change. But the change worked and I could see it working in the log files.

About 30 minutes passed and the next thing I know my colleague tells me that he's seeing authentication errors on the first server. So I go directly to the credentials file that I had changed and see that it had reverted my change and is now using the previous username but no password. This is truly bizarre and marks the first time I have seen the system spontaneously reject a configuration change made by a human.