Monday, July 21, 2008

Don't blame the OS for hardware problems

Many times hardware problems may be the actual cause of a problem that is blamed on the OS. It's very easy to blame an OS. It's fashionable to blame Microsoft and particularly Vista when something goes wrong. Many times it's not the OS but an application or driver. That's a subject for another post. Today I'm going to talk about another problem that is often mistakenly blamed on a flaky OS – hardware errors. Computers are very complex systems. A motherboard needs to be manufactured to very close tolerances. A minuscule bit of solder or a bad trace can change the capacitance of a circuit enough that you may get extremely random errors. PSUs (power supply units) are another cause of random hardware errors. In Windows these errors translate to random BSODs and lock ups. Testing the hardware with software running on that same hardware rarely finds problems like this. You would need very expensive equipment and the knowledge of how to interpret the results of testing with this equipment. Testing RAM runs into the same problem. I like memtest86+. The problem is even if memtest86+ passes all the tests, even after several hours of testing, this doesn't mean the RAM is OK. Even failing memtest86+ is only an indication the RAM is bad. The actual problem may be the motherboard, CPU, or power supply. Does this mean that running software that tests the hardware is useless? No it doesn't. I use several different software tests when diagnosing computer problems. They can be very useful at narrowing down the problem. If a software test fails you know you have a hardware problem and can be pretty sure of the actual component causing the problem. If a software test passes you have a reasonable chance that there are no hardware problems related to the test but you can't be sure. I was recently working on a computer that illustrates this. This was a new computer with XP Home SP3. It was only a couple of weeks old. It was experiencing intermittent problems with Internet connections. Programs would quit with the infamous “This program has experienced a serious error”. The event logs had several seemingly unrelated errors. I tried changing the AV program, updating the BIOS, making sure all the latest drivers were installed, yada, yada, yada. Everything would be fine for a few days or even a week then something different would happen. None of the errors were repeatable. I ran memtest86+ for six hours with no errors. I ran several hard drive testing programs with no errors. I changed out the PSU. At this point many people would have said it's just the way Windows works, live with it. If Vista had been on the computer I'm sure that's where the blame would have been placed by many. I replaced the RAM, which had been tested many times for many hours. The computer has been running trouble free ever since. The RAM is now in a different computer also running trouble free. Who knows what the cause of the problem was. I'm sure it's because mass producing things like motherboards and RAM to a price point means that corners are cut. The moral of this story? Diagnosing computer problems is as much an art as a science.


2 comments:

Alex Garfield said...

Windows Vista has great external security of NT source code. Windows Vista has a nice graphical interface with Aero. Windows Vista is lacking a maintenance operating system like DOS as well as still having some issues with backwards compatibility. An example is Windows Vista not recognizing my IPOD Mini 6 gigabytes on Day 1 and on Day 2 playing the music and making me look like a fool at the Apple Store in Albuquerque, New Mexico. Dan W.

Kerry Brown said...

Vista has WinRE which is a very rich maintenance and recovery environment.

http://en.wikipedia.org/wiki/Windows_Recovery_Environment