Table of
Content
Front
page:
President's
Message
From the Editor
Messages from VPs:
VP Publications Report from Dr. Robert Loomis
Society
News:
2011 EXCOM and ADCOM Members
Prestigious Engineer of the Year Award
Best Chapter Awards
AdCom Meeting
Nominations for IEEE Medals and Recognitions
RS seeks Administrative Committee Candidates for 2012/2013/2014 Term
Reliability Society Past AdCom Members Obituaries:
Former RS President Monshaw Dies At 84
Obituary for Ann Miller
Feature Articles:
Reliability through the
Ages
Reliability Overview of Air Traffic Reliability in the National Air Space
Regular Articles:
Field Based Reliability Calculations (MTBF) – Surmounting Practical Challenges. An outside the box approach.
Applying basic and familiar reliability theory to estimating and improving the avialablity of software-intensive systems
Fault Tolerance in Web Services
PHM Articles:
Detection of Multiple Failure-Modes in Electronics using Self-Organized Mapping
Book Review :
Reliability Engineering Book Review
Chapter Activities:
Cleveland Chapter
Taipei/Tainan Chapter
The Denver Chapter awarded a certificate to Hobbs Engineering
Announcements:
Solicitation for Society Technical Committees
UK&RI Workshop on Reliability and Safety
WCEAM-IMS 2001
Links:
Reliability Society Home
RS
Newsletter Homepage
|
Field Based Reliability Calculations (MTBF) – Surmounting Practical Challenges An outside the box approach
Gay Gordon-Byrne
Population based field reliability measurements are the statistical "Gold Standard". In the real world,
such measurements have been extremely difficult to execute. This is not for lack of interest or
understanding about the value of reliability (MTBF) measurements, but because of the very real
challenges posed by lack of common descriptive standards. MTBF calculations are simple. The data
sources are the problem. Anyone struggling to merge descriptions as simple as "Dell Poweredge" with
"Dell PE" so as to tabulate the number of devices in a study understands the syntax problem. Everything
must match or the math doesn't work.
Field experience of reliability is different from that of bench testing or small scale testing. We all have
examples of devices which performed admirably in tests, but poorly in the field. It is costly to test large
enough quantities of devices with low failure rates to produce a valid result. For example, a wireless
device with a specified MTBF of 300,000 hours would need to be tested in quantities of thousands in
order to deliver a MTBF result within a brief test window. Some testing may be cursory as a result. If
users do not demand (and be willing to pay for) large scale reliability testing, the test bed will be the
field.
In order to get the most out of the field testing, both inadvertent and deliberate, those responsible for
reliability and performance need to understand the limitations of existing data collection techniques and
consider alternatives for more effective collaboration.
Challenges:
Lacking widely adopted standards for syntax and definitions for failures, the industry has relied upon
manufacturers, voluntary surveys, and anecdote from end users for reliability data. None of these are
statistically valid, for the reasons laid out below. Our contribution to the state of the art in reliability
measurement is a technique for surmounting these problems.
End User Effort:
Many organizations cull their own databases to make calculations of MTBF. These efforts are by
definition limited in scope, effectively covering only products installed in large quantities or over long
periods of time. (Quantity or Time is needed to calculate MTBF for small quantities of devices. Larger
volumes of data make for better statistics.) Further, without standards, experiences at one location are
not comparable to experiences at another. None can learn from each other.
End user efforts are typically hampered by poor quality data. Most organizations have grown over time
and carried legacy systems and legacy databases with them for decades. We have learned from hard
experience that user data is of highly variable quality. Most can barely tabulate the total quantity of
devices in use by model, and it is the rare user that has invested in their service processes to allow any
metrics to be run looking at product reliability. Lacking standards and scrutiny, most user driven
reliability calculations are questionable at best.
Manufacturer Effort:
Field reliability measurements managed by manufacturers are also of limited quality but for a different
reason ‐ they don't have the data. Reliability calculations require an accurate population of devices and
an equally complete set of failure reports over a known period of time. Many equipment users expect
the manufacturer to have such data and believe the information is being withheld. This is not the case.
The population of devices deployed at any point in time cannot be known by the manufacturer — who
can only be expected to have accurate records of quantities shipped. (*) Many products are warehoused
and delivered through distribution further disconnecting the manufacturer to the deployed asset
inventory. On the failure side of the equation, warranties cover only a subset of the total product
lifecycle. Anything out of warranty would not be serviced consistently by the manufacturer. Further,
regardless of age, not all products eligible for warranty return are made. Many products are serviced
independently or by the user. Any attempts to calculate MTBF from these data sources necessitates
extrapolations and statistical adjustments for uncertainty.
Even if quantity and quality issues can be resolved, lack of a common definition of failure distorts the
reliability picture for both user and manufacturer driven efforts. The perspective of the manufacturer is
different from that of the end user. Manufacturers perform root cause analysis to know if the problem
was design, manufacturing, software, or even user error." No Fault Found" is a great result from the
perspective of the manufacturer. In the world of the end user, anytime a field service technician is
involved there was something broken. They do not speak the same language regarding reliability.
Surveys:
In order to compare apples to apples, all parties must use the identical syntax for both asset descriptions
and problems. End users will not invest in structural changes to data collection systems, so most
attempts to collect data are limited to surveys where the data is voluntarily pre‐conformed by the
participants. Due to the costs involved, these surveys are limited in scope and open to statistical
suspicion over non‐response bias and other manipulations.
Innovation: External Standards
We approach the problem of standards externally. Rather than beat the drum for better standards to be
adopted by each organization, we have done an end —run around the standards problem by using
technology to enforce consistent order on existing data. By avoiding the expense and delay involved in
having users adopt better methods, we can immediately calculate MTBF for any product reported to us.
Culling Electronic Databases
The proliferation of electronic databases in both asset management and field service organizations
allows us to use technology to make connections that were previously impossible. Organizations with
the ability to report on both the deployed quantities of assets and associated failures export raw data to
us without manipulation or reprogramming. We externally clean, standardize, and then calculate MTBF.
Results are then consistent across all products reported, allowing comparisons to be made.
Results and Opportunities
MTBF data calculated consistently across multiple products of similar function immediately permits
comparisons of commonly used products on the basis of reliability. Designs and manufacturing
techniques which deliver higher reliability are reflected in the statistics with lower failure rates. Those
with poor design or quality report higher failure rates. Even where available data sets do not deliver
part‐level details, the net MTBF of the device reveals differences. Where multiple users report on the
same device, comparisons are made between the failure rates at different users, allowing users to
compare operations as well as products.
This same field‐driven data can be reported back to the vendor as part of the feedback loop to quality
control and design. If there are differences between internal test results and field results — the vendor
can then do further testing to determine cause. Products with initial problems can be improved more
quickly and with greater confidence on the part of the customer.
TekTrakker Observations
Founded in 2006, TekTrakker was designed from the outset to calculate and deliver MTBF statistics as a
measure of equipment reliability for business computing (IT) technology products. We began with the
supposition that different designs and manufacturing techniques would manifest different failure rates
and set a threshold of 3 STDEV as a range of "normal" performance. We have since discovered that the
range of quality between common IT products is so large that our scale of normal needs to be adjusted
to be within 10 STDEV, if not higher. There is clearly much to be learned and much to be improved
when it comes to IT reliability from the field perspective.
Evidence from other industries, such as for "Smart Grid" devices shows that reliability of electronics is a
widespread mystery. Anecdotal reports of annual failure rates of 5‐10% are surprising to utility insiders,
but not surprising to us with our monitoring of related devices. We believe this a very timely technique
which can be used immediately to help avoid billions in wasted spending for the electric industry.
Limitations
There are limitations to this technique. Asset and service databases must capture a minimum of data
points or we cannot use the information. Results can only be calculated to the least common level of
consistency, which in the case of complex configurable products means some loss of granularity. Part
level detail is captured where available, but we cannot create detail where none is provided. Until users
improve their data collection to improve particular details, all are limited to the least common level of
quality.
Custom products do not lend themselves to this method, nor do those installed in very small quantities
as the time needed to develop a useful result may exceed the useful life of the product.
The largest limitation of our approach is the investment in time for the organizer to build out the
"taxonomy", develop the standards, understand the failure issues, and sheer willpower and patience to
build endless translation tables to deal with syntactical (not statistical) issues.
We welcome the opportunity to work with all people interested in product reliability and hope that we
can assist the industry with improving reliability anywhere and everywhere.
About the Author:
Gay Gordon‐Byrne is the President and co‐founder of TekTrakker Information Systems, LLC. With an
extensive background in business technology solutions, first as a systems programmer, and then in large
systems software sales, hardware systems sales and management, Gay has brought her wide experience
of IT products and policies to bear on bringing reliability measurement and comparison tools to the
industry.
(*) A very small set of products may be registered or entitled in such a way as to connect the
manufacturer to the total quantity of deployed devices.
Download Full Article |