We believe that Sysadmins should have visibility into their networks, have confidence that their infrastructure is running securely, efficiently and failure-free, and assurance that problems will be detected as early as possible so to assure a prompt resolution with minimal impact.

It is for those reasons that we created EventSentry, which enables visibility, gives confidence and provides assurance. So the better question to ask is: Why NOT monitor?

Fault Tolerance vs. Uptime

It’s easy to take server up-time for granted, and associate fault-tolerant hardware (RAID, dual power supplies, ECC memory) with increased uptime. But downtime can occur even when your servers are up, since software failures and problems are as more likely than hardware failures, and hardware failures can be masked by fault tolerance.

One of the most stressful experiences for system administrators is, from our own experience, the unexpected downtime. The sudden loss of a critical service or hardware component that comes completely unexpected. Most of us have spent more than one evening or weekend in the office because of an “unexpected” failure of hardware or software. Unexpected is in quotes, because many so-called unexpected outages can be prevented with the proper system monitoring in place.

EventSentry White PaperWhite Paper: Why an economical monitoring solution like EventSentry makes business sense by reducing downtime, helping with software license evaluation, saving valuable IT staff time and much more. Download

 

Example: RAID

Let’s consider a fairly typical example - you are running a modern fault-tolerant server that includes a RAID as well as dual power supplies. Let’s also assume this server is running out of the box, with no monitoring setup. 8 months after its inauguration, one of the hard drives in the RAID fails. Your server, hard-working as it is, will of course keep running - after all that’s what a RAID is for. Your server has now become a ticking time bomb, since any subsequent failure of a hard drive will render the server useless. If you are lucky, then you might notice the problem through degraded performance, or notice the problem in the server room. Or maybe the server is in a remote location, and nobody will notice.

With the proper monitoring setup, the RAID controller will log an event in the event log, which will then be emailed or paged to your IT staff. You can now, immediately, order a replacement drive and get the server back in a fault-tolerant state (depending on your support contract with the hardware manufacturer of course).

Unexpected Failures are stressful

Let’s face it, unexpected failures are stressful and harm your business. Most of our jobs are stressful enough, without unexpected failures. Many critical problems start small, maybe with a small warning issued somewhere in the system. For example, a memory module on its way out might issue ECC checksum failures the days before it fails. Again, with the proper monitoring in place you can schedule a replacement and avoid downtime.

But the same applies to software components as well, such as SQL Server, MySQL, Exchange server and so forth. All these applications log problems to the event log, and ignoring those warning or error messages will, in most cases, lead to larger problems down the road.