Saturday, March 17, 2007

What Happened with the Nation's Tax Software?

Canada's tax software system shut down on March 5, 2007. It's apparently back up as of March 15, 2007.

There are very few details available on what actually happened - here is the summary based on news reports (see http://www.cbc.ca/canada/story/2007/03/07/tax-glitch.html and http://www.cbc.ca/money/story/2007/03/09/taxglitch.html):

- Canada Revenue Agency (CRA) shut down the entire system that processes tax returns on March 5, 2007.
- CRA stated that neither hackers nor viruses got into the computers.
- By March 7, CRA stated that they knew what caused the problem by "working through the night".
- The problem started after 20 "maintenance operations" were carried out over the weekend.
- There are 75 databases and they were all impacted. CRA stated that they were bringing them back online in stages.
- The databases that allow the e-filing are the "most complex" and would be the last to come online.
- Any previously submitted tax returns were put in suspension, e.g. stuck in the queue until the fix was put in place.
- Business returns were not affected.
- CRA didn't have a re-launch date until the last minute - even on March 9th they were not offering a re-launch date.
- CRA stated that "our solution is working and in the past 24 hours we have restored several databases" on March 9th, presumably meaning that their fix meant some sort of long running process, e.g. they had to re-test all the data, do a migration to the data, or somehow process the data in order to fix the problem.
- CRA stated that the problem was an "infrastructure problem".

All of this leads to some pretty scary questions and point to some basic deployment methodology problems:

1. Why would you be doing any changes to your software on March 5, at the prime season of tax filing? What about something called a CODE FREEZE?!

2. Why would it take 2 days to figure out the problem? Surely with a system that large you would have enough change management in place to know exactly what was deployed?

3. What kind of "maintenance operation" corrupts the tax software system?

4. Why does an application suite with 75 databases have such tightly coupled dependencies that you have to take down all 75 of them because of a patch? Why is it that a re-launch couldn't be done in phases, e.g. all 75 databases had to be brought back into production in one big bang re-deployment?

5. Has anyone heard of a ROLL-BACK plan?

6. Aren't there audit records that tell what data was changed and when? Couldn't CRA tell what tax returns were clean and what ones had been changed after the deployment?

7. What kind of software patch requires such an extensive re-scanning of the entire 75 databases over a 10 day period? Speculating here, this points to two scary scenarios - 1) the bug corrupted 75 databases in a very short period of time; 2) no one was sure what data was corrupted and therefore had to re-scan or re-test the entire data set to make sure it wasn't corrupted.

8. What confidence does the Canadian public have now that its back up and running that the problem is fixed? In addition, has the CRA now made adjustment for the fact that millions of Canadians who have been waiting to efile are all going to try at the very same time on Thursday/Friday to get back in?

9. How long was the problem going on? CRA states that the problem occurred over the weekend of March 3-4 but that's a pretty small window to corrupt 75 databases...

10. What kind of "infrastructure problem" kills an application? Infrastructure to me means power, hardware, network, maybe OS, etc. It shouldn't include things like database changes, application changes and probably not even things like server updates, OS patches, JDK updates, etc. unless you're really liberal with the term "infrastructure". In my experience, the infrastructure is the least probably culprit for an application failure and the easiest to detect - if the power shuts off you know it immediately. If a database transaction error occurs occasionally because some coder forgot to declare the transaction boundaries in one method, that could kill your application and be difficult to detect. That's not an infrastructure problem.

11. Where was QA?

I'm not sure we can get any answers to these questions. No one in the press is technically savvy enough to ask these questions and the CRA is obviously trying to keep the story simplified and use very user friendly language (using words like "glitch" and "irregularities") to keep the public from being nervous.

If anyone has any insider details on how the tax system of this country is architected, I would love to hear the details - send them to cwoodill@hotmail.com.

I for one am going to wait a few weeks before efiling - I'm not sure I want to be the first through the gates of a newly patched ("we swear, this time we're good to go!") software application.

No comments: