Song of The Day: Available - Artist: Moving Units
So it’s been a while since I’ve posted. Don’t mean to let my blog slip, but I’ve been pretty busy lately. For one I took a graduate class in fault tolerant computing that started in January. That class has been awesome; however, now that I also started a new job with Corporate Technology Partners, Inc. I’ve had little time for anything else :).
I’m sure there are some that would be interested in knowing what I’m doing at CT Partners, and possibly how that all relates to VelociPeek.com LLC. Rest assured, I’ll make some sense of it all in future posts. For now I’d like to delve a little into reliability. And I don’t mean job reliability, silly! :) Today, I mean computer system reliability.
What’s interesting to me is that there used to be debates about system’s reliability and availability within AOL concerning certain, unnamed, systems (um, those systems will remain anonymous). It’s amusing, somewhat, to ponder this all now. I mean most technical folks understood–mostly OPS folks–the relation of: (Availability = MTTF/(MTTF+MTTR)); however, I wonder how many truly appreciated the field of study and theory behind it all. I appreciated the field and usefulness, but my current class has given me an opportunity to delve a little deeper into the field of research. Because of my experience and current study, I thought I’d share a few thoughts about it now.
So what is reliability?
Reliability is the probability that a particular system will be running up to a given time. Given some failure rate (FR or λ), a unit number, and a mission time (T), we can define reliability (R) as: R(T) = e^[-(λ * T)]. e = the base of the natural log = ~2.7182.
So isn’t that availability?
Well, not exactly. Availability, or what we’ll call operational availability (Ao), is the probability that a particular system will be operational at a particular time. Many people use these interchangeably, but technically and mathematically, they’re different. Reliability is about probability of failure within a system; whereas, availability is about the probability of overall operation. Mathematically, Ao = MTTF/(MTTF+MTTR), or the proportion of the mean-time to failure to the sum of the mean-time to failure and the mean-time to recovery. There are varying definitions of availability (e.g., does it include planned maintenance, etc.); however, this formula will suffice for most discussions. Regarding MTTF within Ao, some refer to MTTF as reliability; however, that doesn’t fully describe it.
Perhaps an example is best to illustrate.
FR provides the fraction of units that failed within a given time period. For example, let’s say out of 2 computer chips 1 failed in 10 hours. This would yield an FR = { 1 chip failed/(2 chips * 10 hours) } = 1/20 or .05/hr. The inverse of the FR is the MTTF (-(λ) or (1/λ)), MTTF = 1/(.05/hr) or 20hrs.
Given this metric, what is the reliability of this 2 chip system?
Well, you can plug things in: R(T), T = 20, λ = .05, then R(20) = e^[-(.05)*20] = ~36% reliable in 20 hours. So far FR = .05 and MTTF = 20 hrs. If we pick an average recoverability of say 1 hour, we have Ao = 20/(20+1) = 95%. If our failure rate increases (e.g., FR = { 1 chip failed / (2 chips * 1 hour) } = .5) then our reliability and availability go down: .004% for a mission T=20, and Ao = 66%, respectively. Hopefully, that helps to show the relationship between the terms.
So how does this all relate to those previous debates?
Well, ignoring all those other complications regarding the theory of reliability and availability for a moment, one can derive a correlation between failure rate, cost per failure, and cost of a reliable and available system. For a business this type of understanding can help improve communication (e.g., expectations) and total cost of ownership (TCO). Hmm…that does appear useful :).
Well, that’s enough for this entry. Since there is a lot more that could be written, I may have to revisit this later for series and parallel configurations. Both topics are interesting and useful. For a good reference on some of this, check out START Volume 11, Number 5.
600) )4j
Tags: Eric O’Laughlen, Reliability