Tuesday, November 20, 2012

Server Hardware Testing and Burn-in - Detailed Stress Testing and Fault Detection on New Hardware

Go on, admit it, you've idea about it yourself. Wouldn't it be satisfying to set your computer alight? Sadly, that is not what this record is about. Burning In is the term used to relate the process of testing new managed server hardware for faults before putting it to use in a live environment. This is done by running 'Stress testing' software for some period of time.

Whenever we get new server hardware, we all the time do a perfect burn in to ensure that the server hardware is up to our high standards. If the hardware fails at any point, we send it back to the supplier. The actual process is easy, although setting it up isn't.

Server

Memory

First, when the new server is turned on, we boot off of the network, which allows us to boot multiple machines at once without needing 20+ bootable disks. The first test run is the well known Memtest, you'll find it in Google, this thoroughly checks the computers memory, and runs for about 1 day.

If the computer passes the Memtest, it is restarted and booted into a custom Red Hat kickstart setup that will setup a bare Red Hat environment, and Cerberus Test control System, extra software that runs numerous tests on all the hardware in the system.

Cpu

Cerberus performs some tasks to test the Cpu. It compiles the Linux kernel over and over again, runs complex mathematical problems (how long does it take you to work out if 3214235409234472020393848453 is prime?), and runs some code specifically written to run the Cpu at its hottest.

Hard Drive

Cerberus writes large volumes of data to the hard drives over and over again to ensure that the drive platters are functional, and it will also delete and move files, and check the disks for errors.

If after a week the server is still running (not smoking) and hasn't crashed, it is carefully good enough for use as a production machine. If it fails the tests everywhere along the way, it is packed up and returned to be replaced. Web servers that have survived this process will easily survive whatever you can straight through at them.

You would commonly expect that this level of testing would be completed by the hardware manufacturers and so these test shouldn't show up any faults. In our touch testing hundreds of machines we do commonly find faults, and we do send components back.

The infer it is so important to achieve this level of testing on computers that will be used as servers is that the uptime demands are so high. The slightest faults will cause outages and downtime. Once a web server is deployed, never again will you have the opening to take it offline and achieve such detailed testing. Even if it were to crash, there is all the time a request that it be put back online as fast as possible, not left offline whilst thorough diagnostics are completed.

Server Hardware Testing and Burn-in - Detailed Stress Testing and Fault Detection on New Hardware

Recommend : case rackmount Add Memory backup mobile Digital Camera

No comments:

Post a Comment