The Six Stages of Field Service Support(TM)
by Frank Durda IV

[Copyright 1981,1984,1987,1996,1998-2012 Frank Durda IV, All Rights Reserved.
Mirroring of any material on this site in any form is expressly prohibited.
The official web site for this material is:  http://nemesis.lonestar.org
Contact this address for use clearances: clearance at nemesis.lonestar.org
Comments and queries to this address: web_stories at nemesis.lonestar.org]

This is a living document, and it is updated from time to time. This document was last updated October 16, 2012.

If you have always wondered about the various stages of support action your computer maker provides when your system self-destructs, here is a handy guide that will tell you everything you ever wanted to know about the Six Stages of Field Service Support and how to identify the symptoms of being at each stage.

Much of this knowledge is based on twenty-five years of careful observations at sites with DECsystem-20s, DEC Alphas, IBM 370/155, VAX 11/780, IBM 4341, and other systems with many different service organizations. Some machines needed more help than others. For example, the DEC-20 once managed to get over 300 hours of down-time in a single month, so it needed lots of support. And that wasn't one of the months when the computer room filled with sewage.

DISCLAIMER #1

I have dealt with other computer companies and the following stages apply to just about every vendor out there, so don't think that I am just picking on Digital Equipment Corporation (DEC), now a proud division of the Compaq Computer Assimilation Corporation, er wait, make that the Hewlett Packard Assimilation Corporation. These rules of the service and support universe also apply to almost every mechanical or electronic thing, including that toner-belching photo-copier you love to hate.

DISCLAIMER #2

For those of you that think I didn't like the DEC-20, don't get me wrong. It was one of the best computer architectures I have ever dealt with, and TWENEX and TOPS-20 were two of the best operating systems around at the time, certainly better than OS/MVT, VM, HASP, MS-DOS and the other junk that the competition were pushing. And the TOPS-20 clock won't go boom in a few years like most other systems I know. DEC learned their lesson in 1978 when all the PDP-11s fell over.

Regardless of who made your computer or how large or small it is, these events may seem hauntingly familiar.

THE "YOU HAVE BEEN WARNED" DISCLAIMER

In case you needed to be told, there are barrels of :-) here.

"May the road rise to meet you, and may you never go beyond Stage Two."
- Ancient greeting, date unknown.

Stage Zero
The Return of the BUGHLT: SWPUPT

or How I hate the message "%DECSYSTEM-20 NOT RUNNING"

Stage Zero is where your journey begins when all is well and then suddenly all the terminals around the university or office go dead and the French-fryer" beepers on the DECwriter terminals all simultaneously go off. This is the way that the TOPS-20 operating system told everybody that it had crashed and that it had also lost what everybody was doing. Alerted by the beeping sounds or the cries of anguish from the users (or both), the keepers of the system rush to the machine room. On arriving in the machine room, you may smell the problem or just see the flames coming out of the processor.

This type of event invariably happens four minutes after the daily service contract period ends, which means it will cost big bucks to get the Field Service Engineers (FSEs) to come out to fix the problem right now. Your management has also disappeared for the day, leaving you with no authority to spend money to get help.[1]

Meanwhile, the users are already starting to press their noses and cheeks against the computer room windows, as if they think that their concerned stares will somehow make things better. Believe me, it does not. To avoid gazing in that direction, everybody in the computer room will avoid looking towards the windows, even to the extreme of walking backwards and feeling behind themselves for the manuals in the bookcase that was placed too close to the visitor viewing windows. More often than not, this results in more accidents, such as knocking over the one gallon jar of jalapeno peppers that formed the complete daily evening meal for the console operator. Seriously.

Left with all the options that don't involve spending new money, you go and call DDC, the Digital Diagnostic Center (saying this is always accompanied by a jarring chord of music, such as that heard in the film "Monty Python and the Holy Grail" when they say "A SHRUBBERY!").

For those of you who don't deal with DEC, DDC is this neat service that you call when your computer system starts doing strange things. DDC can run diagnostics on your computer from where they live (used to be Colorado Springs, but its latest incarnation seems to be in a facility in Atlanta which seems to be named after a space alien who used to appear in badly-written sitcoms - ALF) or study core dumps your machine may have emitted, and these tests help isolate the problem before the local service office even knows that there is a problem. Well, that was the idea anyway.

If the support structure for your computer doesn't have something equivalent to DDC, proceed to Stage One.

Assuming you do have a DDC-a-like, you give them a call, and they will take your name, phone number and serial number. (The serial number of the computer, not yours.) They may even ask what the trouble is.[2] Now they will tell you that they will call you back as soon as they have a service representative available. Actually, this delay is deliberate and gives DDC time to check their records to make sure that the serial number you gave them really resides at the phone number you gave them when you called. They learned this precaution from the pizza delivery industry, either that or else DEC had a lot of problems with the wild guys over at the Delta-I-Q fraternity house at MIT calling in prank trouble tickets on the campus computers (or perhaps on computers belonging to other schools). I hear that "wild" stuff like that still happens all the time up there.

Just substitute the name of your remote support service company where it says "DDC".

Anyway, sometime later, someone from DDC will call you over some trans-Atlantic phone line.[3] Going to a quiet room to talk to the service representative won't help as they always have you go back into the machine room and load the field-service pack and set switches on the front-end processor and "boot from SW" or perhaps they have you just stand in front of the computer and see if the floppy drive light comes on. For machines with more than one switch, the DDC reps always seem to insult your intelligence by giving the switch settings like: "Set the two right-most switches down and then skip two switches and push the next one down." People who have called DDC more than once learn to ignore these instructions since DDC always asks for the same switch settings and so you just set the proper octal value on the front-end processor control panel and say "uh huh" a lot to the rep.

DDC can now take over your system via the front-end processor and run diagnostics which test the various parts of the machine as well as testing the amount of paper you have left in the console DECwriter terminal.

You must not leave the area while these tests are being run, because the DDC person will probably contact you next, not by calling you on the phone, but by typing messages to you on the console. If you aren't sitting right there so you can respond, the DDC rep may go away and you get to start the problem reporting process all over. You need to hang around anyway because someone needs to be standing by to un-jam the console printer.

If the tests eventually find something wrong, DDC will contact the local field service office who will come out to your location with "all the parts necessary"[4] to work on the diagnosed problem whenever the on-site service period resumes. By having a FSE arrive at your site, you proceed to Stage One.

If DDC is unable to run any diagnostics because the front-end processor is dead or the smoke that is pouring out of the system is too thick for you to see if the floppy drive light is coming on, proceed to Stage One.

If DDC doesn't find anything wrong, count the number of times that the system has crashed in the last week from unexplained causes or problems that cannot be diagnosed that you have reported to DDC. If the number of crashes is greater than a secret quantity which you do not know and will not be told, proceed to Stage One.[5] If you haven't reached the magic number, reboot the system and remain in Stage Zero, although after each crash, DDC might give you a slight change to make to the system configuration that will help cloud the issue later when the troops do arrive.

Important Murphy Law:: Never hang-up the phone before the system comes back up or else the system will immediately fail again and you will have re-train another service rep on your ability to work with octal numbers.

If the system crashes again, call DDC again and repeat Stage Zero.

Notes for Stage Zero

[1]

A no-win situation. You will be blamed for not taking any action as the down-time will cost the business money because the systems were down, and you will be blamed for spending unauthorized money if you do take action. The best solution is to take the back door and get out now.

[2]

Don't try to force details of the crash on the first person you reach. They are usually trained only to get a few contact details and if you try to give them information on the actual problem, they are just as likely to overwrite your phone number with the problem description, delaying the return call further.

[3]

Actually the remote support people do not always use a trans-Atlantic phone line. With the discount phone services now available, your phone call can also be routed over those phone lines you see in the country laying in the ditch and bushes along the railroad tracks. All of these phone lines have the requisite Signal-to-Noise Ratio of about 0.2dB. This low quality phone line actually has a valid purpose. Since it is so difficult to communicate with the support representative, you are less likely to start any lengthy conversations about how much smoke is coming out of the malfunctioning system, how many students are standing behind you with final exams next week, or how much money your firm will lose if the machine isn't fixed. Subsequently, this allows the rep to spend more time looking at the diagnostic results and frees up reps so that they can help other sites quicker.

[4]

"All the parts necessary" really means, "all the spares kits that should contain the right parts necessary". These black cases can truly be a Pandora's box, because sometimes the boards within have come out of another machine across town and may possess their own special problems that can now be added to yours.

An added complexity is that these days for certain systems, they seem to bring one and only one card out at a time, meaning you have to wait if any associated cable, screw or other part is discovered to be the real culprit.

[5]

Although not revealed, you will be told by phone if you have reached the requisite number of crashes or not. Phrases like "this is a real strange one" mean you are getting close but not quite there. On the other hand, "I'll pass this call on to the local office" means that you are closer to zero than you are to the magic number. "I'll get right back to you" means that the local office is going to call you next, not the person at DDC. I never understood that one.

If the FSE was on site when the problem initially occurred, or if you have fallen back to Stage Zero from a higher level, the phrase "I'm going to get some other kits out of the car" or "we [Royal We] are going out for some lunch" indicates that you have not had sufficient reproducible failures to warrant a gutting of the system. The hope is that the machine will either repair itself before they get back, or it will completely melt-down, allowing them to skip to Support Stage Two, where your service call escalates and the problem becomes Someone Else's Problem (SEP).

Stage One
Field Service Arrives

or "Why you should invest in some ACME handcuffs and chains"

You arrive at Stage One in one of five ways:

1.: DDC was called and the diagnostics found something wrong with your system.
2.: DDC was called many times and the diagnostics have not found any cause for the crashes. You must reach the magic (and secret) number of crashes and phone calls to support to use this reason. Accompanied by this reason is a mandatory replacement of a piece of hardware. When pressed, the vendor may actually admit that none of the diagnostics actually identified this as being the failing part, but they felt they had to bring at least one part to your site.
3.: DDC has recommended some combination of the application of OS upgrades, patch kits, backing-out patch kits, reformatting and reinstalling the OS without the patch kits, putting anti-nuclear-blast sticky-tape on the windows, downgrades to earlier OSes, small animal sacrifices, and none of these things have made you go away. You, not the problem.[6] Having exhausted the standard list of software-related causes, the ball gets lobbed-into the hardware group even if the problem started the instant you applied a new OS upgrade. Now, field service is sent to make sense of what so far has been done remotely, or if the failure seems to make sense, change some hardware around in your system so that the failure really won't make any sense.
4.: You don't have anything equivalent to DDC for your system.
5.: Field Service was performing the monthly Preventive Maintenance (PM) and the system wouldn't boot any more when they got finished. (This is the equivalent of your car not starting any more after having the guy at the gas station "check under the hood".)[7]
You are not in Stage One if the FSE was performing PM on your system and when you returned from lunch, you found your entire VAX 11/780 tilting at a 45-degree angle, with the FSE desperately trying to get the system back upright or at least trying to keep it from tilting any further. This actually happened once in my presence - something about not extending those stabilizer legs before opening all of the cabinets. Now, if the system does tip completely over, then you get to go to Stage One, right after the FSE goes to the hospital.

If the Field Service Engineer wasn't already on the scene in Stage Zero, most FSEs must go through a period of disbelief about the severity or existence of the problem that you are reporting before serious work on the problem begins. This hesitant behavior is usually characterized by the FSE walking into the machine room, observing the flames coming out of the system cabinets and saying: "AH, HA! This looks like a software problem."

Even if the FSE was on-site and the machine worked perfectly before they started doing routine maintenance on it, the FSE still may accuse you of running an operating system that has been "patched" or "customized". Anything beyond setting the local time-zone may be considered to be "customized", even if you replaced broken application executables with ones from earlier vendor-provided versions that do work. You are generally doomed if you are running NIDEC ("Not Invented by DEC") software. In the 1980s, you would expect questions such as, "Are you trying to run UNIX or something?" In the late 1990s, it is "Are you trying to run FreeBSD/NetBSD or something?"

If you are talking about problems with a photocopier, the question would be something like "You aren't trying to make double-sided copies, are you?"

Although the FSE is now at your site, he/she may leave at any moment, causing you to return to Stage Zero.[8] You will advance firmly into Stage One if any of the following occur:

1.: Something obviously wrong was detected using the diagnostics. (Being unable to even start the diagnostics will always get you to stick in Stage One.)
2.: The system has crashed a number of times greater than the secret quantity that forces a Stage One. Because this number is so secret, the FSEs aren't always up to date with what the secret value is today and so the FSE may not think things are as bad as DDC decided they are.
3.: The Fire Department has just allowed you to re-enter the building after extinguishing the "software problem" by hosing down the memory cabinet (M-Box), even though the fire was in the processor cabinet (E-Box).

Stage One typically directs the FSE to change any boards that the diagnostics indicate are causing the problem. If the diagnostics won't even run, this step is either skipped or the FSE swaps whatever boards he happens to have in the processor (aka "KL") spare case.[9]

If the diagnostics then run without incident and the operating system will get as far as asking if it is okay to run CHECKD (an incredibly slow fsck[10]) the FSE may consider the problem solved and may leave. Depending on the number of times they have been called to look at the same problem in recent days, the FSE may hang around until the system gets as far as asking for the current date or starts the network interfaces before leaving. The goal here is apparently to be out of the area prior to the "login" prompt appearing, or more likely, not appearing.

If the diagnostics now find a problem, the indicated board is replaced with one from "the spares kit". Hopefully the FSE brought the right spares kit with him. If not, you will experience a delay in getting a replacement component, which he may have to get, or it will be delivered by the tag-team FSE. (Some creative FSEs will, in this case, replace some other unrelated cards that they do have, just in case the diagnostic is mistaken. This kills time - and possibly your machine - nicely.)

After inserting the replacement part, make sure that the FSE re-runs the diagnostics before he leaves to make sure that:

A.
The replacement card cured the problem or made the problem move elsewhere (symptom changes).

B.
The replacement card isn't worse than the one that was in the machine in the first place.

A spooky thing that happens here is that most FS organizations seem to have a policy that a board will be tagged as "bad" only if it has a solid failure which "follows" the board. If a board can be moved to a different machine or different slot in the machine and the problem goes away, or the problem is intermittent, the board will be replaced, but may not be tagged as being defective. This board now ends up in the spares kit. Remember this board; it may come back, or worse, you might get someone else's headache-board.

Once the problem appears to go away, you fall back to Stage Zero, but the failure counter is incremented.

If, when you arrived at Stage One, the FSE ran diagnostics and they ran successfully and do not complain about anything, the FSE usually pulls and re-seats all the cards in the system. Some FSE's then haul out the pencil eraser and clean the connectors when practical, which seems to always be extremely fatal to the cards that get erased. This type of activity seems to always help advance you to Stage Two.

The problem for you at this point in Stage One is figuring out what is going on. The FSE's usually won't tell you what they are up to, so you have to watch for tell-tale signs.

You are quickly advancing toward Stage Two if:

1.

The system diagnostics display a warning like:


	WARNING:  ONLY DIAGNOSTIC TESTS 1 THROUGH 64 CAN BE RUN
		  WITH MAIN MEMORY DISCONNECTED!

This is always a dead giveaway, particularly if you only called the FSE in to fix a tape drive.

2.

The diagnostics display messages on the console like:


	TEST #1, KL LOOP TEST
	#KL LOOP FAILURz
	@@@@@@@@@@@@@@@@@@@...

and the '@' characters continue to print for several pages of paper until someone stops it.

3.

There are absolutely no signs of life in the system at all, and even after some parts are swapped, or possibly *because* of the part swapping, none of the diagnostics will even start.

4.

The FSE's ask if you have any extra copies of the front-end diagnostic or configuration diskettes. Do not anger the FSE by asking "Why? What's wrong with the site copy?"

5.

The FSE asks if you happen to remember the password on the field-service pack, or if you happened to have duplicated that pack recently.[11]

6.

The FSE's ask you where the on-site microfiche is that they left at your site the last time there was significant trouble. The only worse sign is when they actually go though the fiche and merge in the Update and Correction 'fiche and discard the old sheets. This activity is usually given priority close to that of sorting your sock drawer. You may safely assume that you have serious problems if you see this activity. For more modern support organizations, they may ask if there is a web browser on a machine that is working that they can use, but the FSE may also have to make some calls first to find out what the support URLs are this week.

7.

The FSE's ask if you have a 'scope, or worse, a soldering iron. FSE's normally only do board swapping, and only a few of them can use a soldering iron without inflicting serious injury on themselves or your computer.

Notes on Stage One

[6]

This is an important and little-understood part of FS organizations. A considerable percentage of open tickets are resolved simply by wearing out the customer, who simply gives up, or is forced to turn attention to other issues at some point, allowing the FS organization to close the ticket with some log entry like "Issue resolved, no further complaints, feedback or signs of breathing from customer." To speed this date, expect a regular pattern of requests for more information, such as the massive SYSCHECK that DEC has you run every few days on the off-chance that your system configuration has changed, apart from you having to add disk storage to hold all the SYSCHECK output logs. Any lack of promptness in returning this material may be considered to be a sign that the issue no longer exists and that you are happy, when in fact you are actually occupied down in the local bankruptcy court.

[7]

Although Preventative Maintenance used to mainly involve cleaning the air filters, and perhaps running some memory diagnostics, sometimes they also used to apply mandatory field changes to your hardware. That sounds great until you find that the hardware change now renders the OS you have incapable of booting, until you upgrade the OS as well. Scheduling backups just before PM isn't such a bad idea.

[8]

An interesting piece of trivia should be mentioned at this point: You have no doubt seen these data processing facilities with extra-heavy security measures like guard stations that make you sign in and out, card entry, double-door man-traps, cameras everywhere, etc. You no doubt have always assumed that all of this stuff was to keep terrorists, Rush Limbaugh, Barney the Purple object and other unauthorized things from getting into your computer room, and this reason is often given to company auditors and accountants to justify the outrageous cost for all of the cameras, guards and guns.

Experienced Data Processing and Information Systems facility personnel know that the real reason for all of this security is to keep the FSE's from managing to leave the site before the systems can be completely restarted, at which point you might notice that only 512K of your multi-Megabytes of main memory are still visible, and only one CPU is still responding.

The longer it takes to bring the computer systems up to the point where work can be done on them, the more security measures the facility that houses the computers will have. Think about it.

[9]

For you non-DECies, the KL is last of the "big" 36 bit ECL processors in the PDP-10 line to make it out of the lab. (There was one called "Jupiter" but it had an accident. The KS-10 did come out a bit later, but it was a lower-performance unit.)

[10]

For you non-DECies, CHECKD is this file-system checker that takes a huge amount of time, even if nothing is wrong. CHECKD used to check about 8 megabytes of disk data a minute, so a dual-RP06 public structure (which contains about 400meg total) took about 45 minutes to check, if it didn't have a lot of data stored on it. So it could easily take 50 minutes or more before people could use the computer to find out if the system was working normally again. TOPS-20 would ask if you wanted to run CHECKD with a message like


     Run CHECKD? No
     %XYZZY Warning - Replying 'No' is equivalent to slitting
	    your wrists with a tape leader trimming tool.
     Run CHECKD? Yes

So everybody always ran CHECKD.

[11]

The field-service disk pack is a removable hard disk (or a compact disc) that usually belongs to the computer maker and is kept at sites that have enough problems or a higher level of service agreement. It usually contains all diagnostics for the machine in question, a burn-in or endurance test directory, and copies of LISP-HAUNT, MDL-ZORK, 10TREK, ADVENTURE, PERL, DOOM and EMACS (all for system stressing/loading purposes, of course.) This pack or disc is the proprietary property of the field service organization and the site management are usually forbidden from even gazing at this object any longer than necessary.

Stage Two
District Service goes on alert

or "Call for back-up, Dano"

or Curly: "Hey Moe! This pipe is full of wires!"
Moe: "That's a fine place for putting wires. Well, rip 'em out!"[12]

Stage Two occurs when the first FSE has been unable to fix the problem after a given amount of time, usually about 6 hours, or (more likely) the problem has grown considerably in scope.

Here additional forces arrive from the local office. Sometimes it is simply one other FSE bringing more spare boards, or he is sent to relieve the first FSE. More often than not, this arrival allows one of the FSEs to keep you and your staff distracted while the other FSE tries to retrieve his wrist-watch from the system backplane[13] without being spotted by you.

The second FSE reviews the situation as briefed by the first FSE, looking for anything obvious that would correct the problem, such as turning *all* of the power circuit breakers back on, putting the right cards in the right slots, using Beta instead of VHS, adding toner, etc. This process frequently catches really stupid errors but you will always be given an extremely complex (and bogus) explanation of what the problem was that has now been fixed. Note that there is no guarantee that the second FSE is more senior or knowledgeable than the first, but sometimes that doesn't matter.[14]

If both FSEs are unable to make headway, the local office may also send the FSE who has the most experience with this particular system or this type of problem, assuming he isn't at your site already. You can usually tell when this FSE appears, as he has a complete set of spare parts in his car, and possibly his own microfiche viewer or laptop computer. (Unlike all others, this FSE will replace burnt-out indicator light bulbs without you having to open a ticket.)

This "senior" FSE will usually get the other FSEs on site to go do something else (like buy food) and while they are gone, he will then try to assess the situation, both by looking at the diagnostics, and by talking to you. This allows him to determine how many of the current problems were there when work began. (Note that in some organizations, the arrival of the Senior FSE is the start of Stage Three.)

Depending on the maintenance agreement you have with the service organization, in Stage Two the FSEs may hang around until whatever time your shop normally closes for the night, or stay on until you pass out, at which point they will sneak-out anyway and may even come back in the morning.

Most problems are resolved in the latter phases of Stage Two, so there isn't a lot of other interesting things to say here. Getting to Stage Three is mainly a function of time, although a really spectacular event, such as any of these headlines in the campus newspaper will get you to Stage Three faster:

"FSE taken over by stranger-than-usual aliens! Biology department impressed!",

"Walls bleed in campus computer center - OS Upgrade identified as cause", or

"Trouble ticket open for 28 months causes vendors bug tracking system to form black hole! Reports say damage worse than what was predicted for Y2K problem!"[15]

If the system starts working while at Stage Two, you return to Stage Zero, although the secret failure counter doesn't return to zero. It does goes down a little for every day that the system keeps working, and reaches zero after about a week.

If after two more days you still have FSEs (possibly four of them now lurk around the room by now), you go to Stage Three.[16]

Notes on Stage Two

[12]

The Three Stooges remain one of the best-documented set of Field Service Engineers ever recorded in action. Their films show how to handle repairs on electrical wiring, plumbing, vehicles and dozens of other service activities. These training shorts may still be shown in "Introduction to Field Service 101" courses.

[13]

Actually it was a metal chain, not a wristwatch. If you recall the TV series "Batman", they had signs hanging over all of the things in the "Bat Cave", so that people who were not familiar with the objects would know what they were. So there were signs that said things like "BAT COMPUTER", "BAT GENERATOR", "BAT GUANO", etc. These signs were clearly unnecessary to Batman and company who supposedly knew what all that stuff was. Here is the secret: The signs were really there for the "Bat Tours".

Well, like most universities, someone in the hierarchy decided that we just had to have signs explaining what the various cabinets in the computer room were ("AIR CONDITIONER", "DISK DRIVE", "BOX THAT BUZZES LOUDLY AND DOESN'T LIKE WATER", etc), so that when a tour occurred, the signs would be there and the visitors could try to read the signs as they flapped wildly around in the forced-air of the computer room. They also helped the director of the facility correctly identify the object he was pointing at while giving tours, since he doesn't go in the computer room very often. So on a typical tour, you might hear an exchange like this:

Director: "And this is the IBM 370/155, which cost the school over two million dollars, money that could have been spent on the football program. Yes, a question in back?"
Guest: "Yes, are you sure that is the '370? It really looks more like a swivel chair with a green colored stain."
Second Guest: "Perhaps the '370 is the large blue box with all the flashing lights behind the chair?"
Director: "No, these signs are very accurate and thousands of dollars were spent to make and install them, money that could have been spent on the new stadium."

In our case, some signs would flap around so wildly that they would come loose from the ceiling and fall to the floor. Eventually, someone would pick these lost signs up and simply lay them on top of the appropriate box. Months later, probably just minutes before the next start-of-semester tours, some knowledgeable soul would get a ladder and reinstall the signs in the appropriate locations, more or less. In the meantime, the signs quietly rested on top of the computers, waiting for their chance.

Which brings us to our FSE, the victim of this story. Called to correct a minor problem with eight of the 64 serial ports on a DECsystem-20, he has the PDP-11 cabinet on the DEC-20 extended out of the machine to measure some voltages and reaches on top of the computer for a screwdriver he placed there earlier. Instead of picking the screwdriver up, he drags it across the top, dragging the chain and the "LARGE ORANGE BOX WITH NO FLASHING LIGHTS WHICH MAKES A BORING TOUR STOP" sign along with it, and the chain and sign fall neatly into the card cage, in full view of our staff.

The FSE, thinking it would be better to get that metal chain out of there, particularly since the computer was on, pulls the chain out and puts it back on top and continues with the measurements. After a few minutes he returns to the console area, puzzled that he can't find the signal he was looking for on ANY board. And for some reason, now the console doesn't work any more. This person had clearly lost the gift of cause and effect analysis.

We nearly went to Stage Three to resolve this one, despite witnesses repeatedly telling the additional FSEs exactly what happened. Eventually, our service vendor did get a nice letter from our management saying that this particular FSE would not be let in the building ever again.

[14]

Occasionally, it is the novice-"Wesley Crusher"-type who shows up mainly to bring parts and ends up depressing the HALT button that saves the day, assuming the system can be saved at this point. When this "I just flipped this switch and everything works great now" stuff happens, it irritates the senior FSEs who have been on site for hours, and who probably will secretly dispose of Wesley's body in the unused space at the bottom of the DN20 cabinet or under the false floor. Check here for odd smells.

[15]

Recent evidence shows that if a trouble ticket stays open for an embarrassingly-long amount of time (like 14 months), the vendor might "accidentally" close it, and immediately open a new ticket for the same problem, resetting the response-time clock, at least in their minds. This is similar to airlines that pull away from the gate on time, and then park the plane on some unused part of the airport for two hours before actually taking off because of bad weather somewhere else. Officially, the plane did leave on time.

[16]

Note that if at some point, a crash is determined or strongly suspected of being a software problem, all support clocks freeze. About the only way to go to the next level of support is to start reporting the problem and the misdeeds of the service organization on the Internet. At minimum, this gets you a conference call with a lot of people "committed to your problem" but nothing happens for months, and at best, the FSE structure advances to Stage Three.

Stage Three
Your problems go Regional!

or Note one of the words in 'Regional Playoffs' is 'off'.

Congratulations, you have reached Stage Three. You did this by keeping FSEs at your site over three or more days (two days if they were there more than 14 hours a day), or whining about the situation in public forums on the Internet with samples and embarrassing photos. Your management can speed or slow the arrival of Stage Three support depending on how many calls are made to the computer vendor and how threatening he/she/it can sound:

Educational Non-Threatening: "We have 25,000 students that are unable to complete their projects and will get failing grades and then beat on our cars or use the skills they learned in chemistry classes against us if you don't fix this computer."
Educational Threatening: "I own a rocket launcher and am coming to your house NOW if you don't fix this computer."
Business Non-Threatening: "We have over 50,000 customers who can't download cooking recipes off the Internet because of this problem."
Business Threatening: "We are going to tell 50,000 of our customers - all of whom own rocket launchers - where you are if you don't fix this computer, as you are preventing them from downloading 'porn off the Internet fast enough."

To be at Stage Three, things are really messed up now and parts of your system that never bothered you before are probably malfunctioning. You might even be getting error messages regarding peripherals your system doesn't even have.

The failing system is also starting to look more like it did prior to being originally assembled, as more and more loose parts litter the area.

The Regional FSEs, usually out of a major city like Chicago, Creede, Houston, or Twin Peaks[18] arrive to at least stabilize the situation, and hopefully, get the system back to the level of functionality you had back at Stage Zero. If they fix it completely, that is a plus, but no longer the main goal.

The Stage Three patrol does a lot of rediscovering. This means that they ignore most or all of the information obtained in the earlier stages and must experience it for themselves[19]. If your system will run, you have to bring it up and let the users on, knowing that at any moment it will go down again, erasing the unsaved work of hundreds of students, co-workers or customers, who know where you are and possibly which car you drive.

You are told to not warn the users about what is going on because they would not use the system in the same way they normally do if they knew it might crash at any second, and this might cause the problem to not occur.

Note it is almost impossible to descend from Stage Three back to Stage Two. Even if it takes a day or more to fail, these guys usually hang around, along with your growing collection of FSEs from the earlier stages that have taken-up residence.

Finally, the system crashes. Hurrah! Now the suit jackets come off and the neckties get caught in the printer drum. No, wait, that only happened once.

Stage Three personnel bring their own 'scopes and other strange test equipment, most of which appears to have been obtained from the set of the film "Frankenstein", which they may have wired into your machine before the demonstration crash. If they didn't do this earlier, they will wire it all up now and ask you to make the system crash again in the same way, as they really will be watching this time.

You may go to Stage Four if the system will not run with the test instruments attached, but this isn't a sure thing. You will definitely go to Stage Four if they disconnect the test equipment and now the system won't do anything at all. "They've taken Spocks brain!"[20]

After a crash occurs in front of the instruments, the FSEs will take action. Usually its a phone call to someone at the corporate headquarters, where you may overhear them say stuff like "If it was human, it's dead. No wait, the 'scope probe came loose." (Now speaking to you) "Uh, can you make it crash again?"[21]

The people from Stage Three usually have special diagnostics that the local office never get, or that they didn't know about, or they left them at another local site, or are at this moment are stuffed in the dollar-changer in the Coke machine back at their office with a "Thieving @*%!! Machine" note written on it. Anyway, these newly-used diagnostics provide a new wealth of information to examine, but invariably result in more phone calls to back to Mr. Peabody[22], who is the only person in the world who knows what the diagnostics are trying to report.

The chant of Stage Three becomes "Okay, we have a recorded failure. We can rebuild the system. We have the technology. We have the spares kits. We've got lots of YOUR spare time. Let's do it!"[23]

Stage Three also starts a more methodical replacement of components on a scale not attempted at earlier stages. Let us say that we have what appears to be a hard disk problem. Here is a typical replacement checklist: (You will be asked to reboot and let users use the system between each change to see if the problem is really fixed or see if it will crash again)

1.: Reformat media and restore entire system from backup tapes.
2.: Switch to different disk media and install system on new media.
3.: Re-align heads on drives, reformat media and restore entire system.[24]
4.: Replace hard disk servo and controller boards. This usually means reformatting and restoring the entire system.
5.: Replace cables between drives and host controller. Oh, and why not? Restore entire system from tape just in case. It only takes several hours.
6.: Replace all cards in tape drives in case they are somehow affecting disk drives located across the room. You might be able to talk them out of reformatting the hard disks on this one.
7.: Replace host disk channel controller.
8.: Call into question integrity of the backup tapes used for all previous restores. Restore entire system from two-month-old tapes. Expect some user annoyance when two months of work and mail disappears and later when all new work also disappears when you go back to newer tapes.
9.: Swap all memory cards to see if problem might really be a memory problem in main CPU.
10.: Embark on new wave of diagnostics to determine why the CPU can't find main memory any more.
11.: Experiment with microcode, EISA configuration and other firmware revisions to see if any combinations work better than others, including combinations not recommended by the software side of the organization.
12.: Replace CPU or if multiple CPUs, replace them one at a time, or swap them around from one slot to another.
13.: Replace backplane (if possible in the field). Also consider replacing all interconnect cabling in CPU cabinet.
14.: If any error messages about nonexistent peripherals are seen, ask customer to purchase missing peripherals to see if that helps. For best results and most reliable operation, your system might really need to have a punch card reader.[25]

No matter what the problem is, there are just over a dozen steps worth of work in Stage Three before the Stage Three timer expires. If the problem isn't getting any worse, you get about a week at Stage Three. If the system is degenerating and less of it works by the hour, Stage Four can arrive in as few as four days.

You really can't accelerate movement to Stage Four on your own.[26] Reporting the problems and misdeeds of the FSEs for all to see on the Internet only works once and doesn't get you beyond Stage three no matter when you use that ploy.

Notes on Stage Three

[18]: The locations selected by field service organizations for their "Regional" facilities always seem to be in places where there are no direct flights between their location and yours, forcing these people or their part inventories to travel by goat track to reach your site, arriving some days later.
[19]: Your input isn't really desired at this point either. They pretty much want to experience the entire thing themselves, even if the carnage is well-documented. The chalk outline on the floor of a FSE from an earlier stage of this problem ticket who dropped a wrench into the power distribution system isn't good enough evidence; one of the Stage Three people must try doing the same thing to see if it really is a bad thing to do.
[20]: A reference to an original Star Trek series episode, "Spock's Brain", considered one of the worst Star Trek episodes ever, even worse than the ones where William Shatner makes computers commit suicide by talking to them. At this point in the support structure, you will discover that sometimes the FSEs don't even need to talk to your computer to make it kill itself - just being in the room is deadly enough.
[21]: FSEs clearly have portable personal defensive shields, or else they all would be dead now from asking questions like this. There are far too many heavy things available in the typical computer room that can be used to commit a mischief.
[22]: That's Mr. Peabody from "The "Rocky and Bullwinkle show", who knows exactly what all the diagnostics are saying, is the only person who does, and probably wrote the diagnostics as well in some vendor-unique language, like BLISS. In addition, on one of his trips to historic earth in his Way-Back Machine, Mr. Peabody brought back some Neandethals, solely because they happened to exactly fit the company FSE uniforms that had already been bought and the company didn't want to pay to have them altered. It appears that these FSEs deal mainly with disk drive alignments and other delicate adjustments on your equipment. "No problem! Mongo got pipe wrench!"
(For those of you cheating at Trivial Pursuit, the answer is "his boy Sherman".)
[23]: Unlike the television show, "The Six Million Dollar Man", all your troubles will not be solved in 48 minutes plus commercial breaks, but your costs could be similar.
[24]: If you don't have removable hard disk media, this may sound confusing. Simply substitute "replace hard disk assembly" with "replace drive with one that has a different head/cylinder geometry and a different block count" for equal levels of grief.
[25]: You think I am making this one up. Wrong. At one site, the FSEs actually brought in a piece of equipment unlike any we had, and connected it up to see if that would make a problem go away, or at least cause the system to quit whining about this optional hardware not being present. Of course, it is never any piece of hardware you would actually want or could use.
[26]: Stuffing the Stage Three FSEs under the false floor or into an expansion cabinet does not count towards reaching Stage Four. By this time, you should have six or seven FSEs in your computer room, taking on the look of the state room scene from the Marx Brothers movie, "A Night at the Opera". Somehow, it isn't as funny being there.

Stage Four
Your problems go National!

or "We've come to sell you the new model or service package[27]
and to tell you that we really don't support the configuration you
are having the trouble with[28]. What, does that position make
you angry? Okay, Okay! Put down the chair and we'll keep
working on your issue."

Okay, so Stage Three didn't work out so well. Don't worry, it's probably something obvious, like the planetary alignment of your computer room.

Stage Four personnel usually come from the corporate headquarters and review the things that Stage Three replaced, and will probably replace a few of them again, but in a different order.

Stage Four specializes in replacing things that appear to have (and usually have) nothing to do with the problem whatsoever and don't seem to have the same goal of any of the previous stages. Changing things for the sake of change seems to be part of the art of problem resolution in Stage Four. Science and logical process were killed and swapped-out in Stage Three. So for your hard disk problem, expect the ribbon on your printers to be replaced, tape drive heads to be cleaned and calibrated, and to have most of the false floor ripped up for days. Part by part, the system will be replaced. You need plenty of room for all the replaced parts and cables that will start to accumulate around the area.[29]

Tip: You might want to write down what your systems configuration was when you started all of this so that you might be able to get back to that arrangement, or at least so that you can get all your parts back. Systems have been known to have been "fixed" by disconnecting the offending hardware or turning off the alarms from the hardware that are trying to warn you of data corruption because that hardware really is broken.

Stage Four also has a special squad that deals in "blame assignment"[30], looking for anything at all that might be external to the equipment listed in the service agreement that might be the cause of the problem, at least in some envisioned parallel dimension where our physical laws of nature do not apply. They will look for current leakage from the false floor to the computer cabinet, ignoring comments that the rubber wheels on the computer probably take care of insulating the system against any millivolt of differential they happen to locate across a 100 foot computer room floor.

Then comes the temperature and humidity monitors, and the stern recommendations to change both settings in ways that end up making it rain in your computer room. Of course once that happens, the Stage Four blame squad can point out that it may have rained in there previously and that this might have been the cause of the previous failures. This is your opportunity to respond by pointing out that, until now, your computer room has never been visited by beings from the planet Cretin, so prior rain-making activity was unlikely. [31] Despite your assurances of no prior moronic or paranormal activity, they may hang doggedly onto this discovery even if the rain causes a completely different problem, like shocking the stuffing out of one of the lower-tier FSEs, still hiding behind the system.

Finally, you will see them bring up the Power Disturbance Monitor(TM), and gradually you will be unplugging things all over the building to prove that they are not causing the problem.[32] Oh, you can leave the coffee maker plugged-in, as that is a priority piece of equipment for the FSEs.

Another thing that may happen at Stage Four is the recovery of old parts. You remember all the boards swapped-out by all the previous FSEs? Stage Four has been known to conclude that this system only works well with a particular vintage of cards, and will try to retrieve them and put them all back in your system, even if some have since made it into machines across town. Due to the lack of tagging boards with solid failures, this ends up being an opportunity for more random cards from the ten or twelve spares kits lining the room to find their way into your machine. As you might guess, the chances of things not getting even worse after doing this is very small.

Stage Four will last as long as the service company can possibly stand it. Only when the FSEs are down to two flat-head screws and the Emergency-Off switch as the only things that haven't been replaced or fiddled-with, will you possibly move on to Stage Five on the FSE timetable. I say possibly, because they may decide the problem is "hard" and toss the problem back to the software group for a couple of months at this juncture.[33]

The threat of legal action, or long-range nuclear weapons directed at high levels of the company can cause Stage Five to appear before the end of time, but you might have to start walking towards the court house with the lawsuit papers as well as the launch codes in hand before anything happens.

Notes on Stage Four

[27]

Also known as "Bad Timing". This is where the sales and marketing organization arrive during (and usually oblivious to) your crisis to try to sell your firm some higher level of support or new hardware. Now, someone on your side of the table mentions that we would really like to see the vendor solve the problems with the hardware we have now before we buy any new products. The vendor representatives always seem ready for a response like this, smoothly replying by saying something like "Well, I am certain the issue is just lost in the system or is waiting on more input from your staff, but if you would like, you can call me next week with the open ticket number and I'll be happy to look into it." This is your cue to have one of your people speak-up and say, "the ticket number is C970207-3290" quoting from memory and also point out it has been open over a year. This revelation basically kills the tempo of the presentation, even before the PowerPoint demo crashes of its own accord. Tip: Wait until after they pay for lunch before torpedoing the show.

[28]

Related, sometimes the vendor will "address" the problem by claiming your hardware configuration isn't a valid one, even after you point out that the unit was shipped from the factory that way, and it is even a configuration listed in the catalog. Now the statement may change to "Well, it isn't supported now, because we found people might have used it stupidly", even if what you are trying to do isn't the slightest bit stupid.

[29]

Any attempts by the FSEs to track what parts are on site is abandoned by Stage Three, and even the cards that end up in your system when the problem is fixed won't match any of the serial numbers the FS organization has on file for your system, which can cause additional problems the next time you need maintenance.

At the end of the service visits, any survivors in the Stage Zero or Stage One personnel will be given the task to cart six carloads of stuff back to the local depot. All the other FSEs will escape back into whatever dimension that they came from rather than participate in this task.

[30]

Shortly after the arrival, you will be able to identify the member of the secret "Blame Squad", as he/she will wander into other areas of your computer room, looking to see what brands of equipment you have from other vendors, devices that can be later blamed as possible sources of the problems. You can usually tell the less-experienced "blame assigners", who will eventually say stuff like, "This is probably all caused by that General Electric XG47131480D you have over there." After looking for this device you have never heard of, you respond with, "Uh, that's the computer room power transformer." Experienced "blame assigners" usually are better at picking their blame targets.

[31]

Always be concerned when your computer repairman starts altering settings on the air handlers for the computer room. Use duct tape to stop this activity. What you use the duct tape on (air handler or FSE) is left to your discretion.

[32]

In a datacenter with two mainframes, the FSEs seriously wanted us to shut down the other system to see if their system was affected in any way by the presence of the other computer. Since they refused to let go of this theory and would not try anything else until this was checked, we did as they asked, annoying thousands of students. As fate would have it, the failing machine now failed even faster, but then, they were changing several things at once.

[33]

In the case I am thinking of, ticket A has been open for a while and the vendor finally comes up with a fix, but the fix requires the customer to upgrade to operating system release B, and then apply corrections to that in the form of a patch kit C. To that B+C system can fix D be added. The customer installs OS B and immediately starts getting machine checks. On the chance that the patch kit also solves the problem, patch kit C is also applied since that was planned anyway. The crashes continue. The release notes for OS upgrade B report that it fixes a problem exactly like the one the system started having, and the notes go on to say that despite the evidence, this isn't a software problem (uh, yeah), and that the failure only happens if you have the hardware revision E which has some sort of flaw. Our system had a newer hardware revision F, which supposedly did not have the hardware problem.

Based on that information, what do you as the computer vendor do? Why, you recall Patch Kit C and have the customer remove it, replace the customers revision F hardware with revision F (yes, same version) hardware, and guess what? The system still crashes. Now the vendor starts talking about replacing memory cards, CPU cards and the system backplane and other random hardware, and eventually replaces all of these things (some repeatedly) with no improvement. Eventually, they replaced the backplane and entire metal frame, thinking that they had swapped swapped everything else. This didn't fix the problem.

Now, you might think that maybe, just maybe, the problem might have something to do with the software that was changed at the point when the system started crashing (installing upgrade B), but no, that was one of the the last things tried. Meanwhile, the original goal of trying to apply the fix to problem A was on hold for months. (A FDDI network card completely unrelated to what the error messages were reporting was eventually found to be the true culprit.) This is an unusual case, since the hardware people are usually keen to lob this type of problem back over the net into the OS groups hands.

Stage Five
They let the guru out of the box!

or "You can have what's behind the curtain!"

Stage Five. Wow. It almost never happens. Until recently, I thought it wasn't possible at DEC now that Ken has retired (who has the key to the box?). Being at Stage Five, you have arrived at the pinnacle of Field Support. There are only two choices left to the equipment maker: One, to roll in a complete replacement system and let the existing system be "accidentally" exported to a forbidden country, or Two, let one of the gurus out of the box, who might be able to identify the real problem and get it fixed.[34]

Neither choice is popular with the vendor. The first costs a lot of money, and sometimes those pesky border patrol or customs people catch the faulty system being exported to North Korea or wherever. The second option might allow the guru to be exposed to real-life, something that could be far worse. He might find out that Nixon resigned, that nobody makes slide rules any more, or that IBM finally built a computer with a stack pointer.[35] Plus, the guru might still recommend replacing the system and the vendor will still have to try to smuggle your broken system out of the country in order to dispose of it.

In one of the cases where I witnessed a Stage Five escalation, the system had been putting out a BUGHLT: SWPUPT message for six weeks and crashing each time. The message claims that the hard disk driver or paging code (or both), which should never be swapped to disk, had been swapped. This is bad. However, at least there was code in the operating system that said "I did a bad thing" when it happened.[36]

In response to this crash that was increasing in frequency daily, the FSEs had gradually replaced nearly every component of the mainframe, all the cables to the hard disk drives, most internal cabling, all boards in the drives, realigned the heads on the drives several times, reformatted the packs and had us reinstall the operating system more than a dozen times. They examined the raised flooring for electrical current loops, looked for uneven cooling, and at various times blamed the failures on the telephone lines the modems were connected to, the console DECwriter, the printers and tape drives, even the proximity of an IBM 370-155 (hmm, maybe...), but one by one these were eliminated, except for the IBM. I think we erected a curtain or something as a joke so the DEC-20 could not "see" the IBM. It didn't help.

After six weeks, there were only a few sheets of metal casing and the wheels that hadn't been replaced on the entire system and the problem persisted. We had also gone through periods of having a completely useless system as working parts were replaced with broken parts in such large quantities that no one could recall what had been changed last.

So, just when they were about to break down and replace the entire system, they decided to let a guru out of the box in Maynard.

For the arrival, nearly all of the non-Stage 4 FSEs were asked to leave the area.[37] The guru walks in and looks at the OS logs of the crashes, which had about as much debug information as your typical MS-DOS "Null-pointer assignment" error message. He completely ignores the reams of output from the diagnostics run over the previous six weeks. After ten minutes or so, he proclaims that there is some dirt in a glide track in the head assembly of a particular hard disk drive. Then he leaves for the airport and goes back to Maynard. Maybe he asked the taxi to wait.

The entire FSE garrison echos "WHAT?" This was impossible. Sure, they hadn't taken this particular assembly apart (it wasn't easy), but why would it cause that error message instead of a disk read or write error message?[38] And how could a U-shaped piece of metal cause all this? And why didn't the diagnostics ever fail or report a problem in this area?[39]

They were almost ready to ignore this advice and roll the replacement system in when someone decided to spend the hour cleaning this track. It worked. The system, now with several hundred stripped bolts and worn connectors, ran fine, even when we started four copies of HAUNT at the same time, the ultimate system stress test.[40]

So we didn't get a new computer, but the problem was solved, and a considerable percentage of the computer science college got a grade of Incomplete for the semester. The system behaved itself just fine until it rained in the computer room (real rain, not the sprinkler or air conditioning system), but that's another story.

So there you have it, the Six Stages of Field Service Support.

Notes on Stage Five

[34]

The above description has dealt with issues of a hardware nature and their escalation. I have also seen a case of reaching Stage Five on the software side of a computer company. In this case, after complaining about the same software problem for 18 months, the computer company sent a letter saying that they didn't have any plans to fix the problem in question, or even research it seriously, despite prior claims of doing work to resolve the problem over the previous year and a half. Previous requests for the source code for the affected modules so that we could fix it ourselves were always ignored without comment.

My company immediately responded to the letter by announcing to this vendor that we would never buy anything from them ever again and would dispose of all our existing equipment at the first opportunity, possibly utilizing our 30th floor perch in some spectacle that would be televised: "Tonight on Jerry Springer! When computers go bad!" or "The Late Show Computer Toss!"

The computer vendor (who had previously been saying that they were working on the problem on and off for the previous 18 months), had come up with a patch a year earlier that wouldn't even boot, but mostly used a series of conference calls to keep us informed on the lack of progress), now suddenly decides that we might be serious about having this problem fixed and to listening to why we would not abandon the software configuration we preferred for one that they supported "better". (I point out here that most positive values are greater than zero.) We finally forced this issue by asking fellow users of this type of system on the Internet for their opinion of the two filesystem methods in question, the one with the bug and the vendors favorite. With one exception, everybody who responded was using the filesystem with the bug that the computer vendor was effectively abandoning. Some sites had tried the vendors sweetheart system for a few days and then switched-back, bugs and all.

Well, apparently the computer vendor saw this discussion on the Internet and ran some of their own tests and found out that their customers might actually have a point about the performance difference of the two systems and that their little darling was in-fact a performance pig. Suddenly, the orphaned filesystem software was being looked-at and bugs and stupid inefficiencies were being found all over the place in the operating system, at least according to the now-resumed weekly conference calls.

At one point, to show positive activity, the vendor let a software guru out of the box to visit us, along with someone who can be best described as part diplomat and part damage-control officer. Of course, to make sure we wouldn't actually solve the problem while they were at the customers site or honor our previous requests for source code to let us fix it ourselves, they brought paper listings of the source code in question, which we never got to see.

Of course, having been in a similar position in a former job, the reality is that the software guru can do little apart from observe during the visit, since doing code development/correction while surrounded by a bunch of pacing and chattering people is tough, and you probably don't have any of your normal development tools (plus nothing but a paper listing, but that was self-inflicted). Therefore, the secret goal is to make sure that you know how to make the failure occur predictably and get back to the lab where you can actually work on the problem and not have to work in some corridor that the customer sticks you in that the customer calls "a work space".

As you might expect, letting the software guru out of the box in a software-related Stage Five doesn't mean the problem gets fixed right away, or even near-term. That is only the first phase of the Software Stage Five, which can drag on for months. In my case, after the open problem report passed its 55th month mark and 26 months since the most recent "get serious about fixing it" event, we just stopped using that computer and moved to an operating system and hardware that didn't have that issue.

[35]

I ran into similar people when I visited AT&T's Murray Hill facility in 1991, but the person I talked to there thought Andropov would release his grip on the Soviet people in a few years. Of course, Andropov had died several years earlier and the Soviet Union barely existed in 1991.

[36]

This opens an even more interesting question. Why do you put a message in the system that reports the equivalent of:

"Hi there!! I just did something stupid and committed suicide! I will crash in a few moments and lose everybody's work. In the meantime, here is some music to listen to!" BEEP BEEP BEEP ... (DEC-20s would send out nine BEEPs in a pattern that sounded almost exactly like the morse code S-0-S.)

Also, why do you document this type of event which will cause your customers to ask these types of pointed questions?

[37]

Please refer to the "Monty Python" "New Gas Cooker Sketch", where a seemingly endless supply of gas company employees show up to connect-up a single gas stove but are all unable to do so because of various regulations or lack of paperwork. This is the scene in the computer room all through Stage Five, and reducing the number of people in the room is a good plan, but at the same time it would be nice if these people could learn something from the guru instead of waiting in the hall.

[38]

You know you are getting close to the truth of the situation when the FSEs themselves start asking the types of questions you normally ask to embarrass them, stuff like "won't it work better plugged-in?" or "You do know that the shelf is installed upside-down in this unit, right? Turning the cards over might make them go in without pushing so hard."

[39]

Important Axiom:

Diagnostics only find known problems.

[40]

At the first hint of a solved problem in Stage Five, the doors on all the equipment cabinets will slam shut, so no one has any idea what cards are where in the system, but that doesn't matter. If any major peripherals don't work now, you can always open new tickets and start at Stage Zero. The goal for the FSEs now is to back-away slowly and not make any sudden moves that might cause the system to crash again. Even if the system is functioning by a thread, it probably won't break until the next time someone bumps into the cabinet or when the local FSE comes out for the next preventative maintenance.

As always, the FSE prime-directive is that:

If you can't fix it right away, make sure whatever it rapidly becomes Someone Else's Problem (SEP).

Note that Douglas Adams, the late science fiction writer, based an entire novel on the subject of what he called "SEP fields". He suggested that such fields could be used to "cloak" star-ships, planets and other objects from prying eyes, simply by making the object completely uninteresting and unimportant to any viewer, and thus essentially invisible. Such a scheme would accomplish "practical" invisibility at a fraction of the cost of all those systems that try to bend light and such. What Mr. Adams failed to discover in his research is that for years before he came along, FSEs have been employing small SEP fields to distract computer system owners from the real problems, both to save their respective firms money, and to make sure the problems that can't be concealed escalate rapidly into someone else's jurisdiction, effectively making the problem invisible, at least to the original FSE.

[Copyright 1981,1984,1987,1996,1998-2012 Frank Durda IV, All Rights Reserved.
Mirroring of any material on this site in any form is expressly prohibited.
The official web site for this material is:  http://nemesis.lonestar.org
Contact this address for use clearances: clearance at nemesis.lonestar.org
Comments and queries to this address: web_stories at nemesis.lonestar.org]

Visit the nemesis.lonestar.org home page and index at http://nemesis.lonestar.org

The Six Stages of Field Service Support(TM) by Frank Durda IV