IEEE Reliability Society Newsletter

Population based field reliability measurements are the statistical "Gold Standard". In the real world, such measurements have been extremely difficult to execute. This is not for lack of interest or understanding about the value of reliability (MTBF) measurements, but because of the very real challenges posed by lack of common descriptive standards. MTBF calculations are simple. The data sources are the problem. Anyone struggling to merge descriptions as simple as "Dell Poweredge" with "Dell PE" so as to tabulate the number of devices in a study understands the syntax problem. Everything must match or the math doesn't work.

Field experience of reliability is different from that of bench testing or small scale testing. We all have examples of devices which performed admirably in tests, but poorly in the field. It is costly to test large enough quantities of devices with low failure rates to produce a valid result. For example, a wireless device with a specified MTBF of 300,000 hours would need to be tested in quantities of thousands in order to deliver a MTBF result within a brief test window. Some testing may be cursory as a result. If users do not demand (and be willing to pay for) large scale reliability testing, the test bed will be the field.

In order to get the most out of the field testing, both inadvertent and deliberate, those responsible for reliability and performance need to understand the limitations of existing data collection techniques and consider alternatives for more effective collaboration.

Challenges:

Lacking widely adopted standards for syntax and definitions for failures, the industry has relied upon manufacturers, voluntary surveys, and anecdote from end users for reliability data. None of these are statistically valid, for the reasons laid out below. Our contribution to the state of the art in reliability measurement is a technique for surmounting these problems.

      End User Effort:

Many organizations cull their own databases to make calculations of MTBF. These efforts are by definition limited in scope, effectively covering only products installed in large quantities or over long periods of time. (Quantity or Time is needed to calculate MTBF for small quantities of devices. Larger volumes of data make for better statistics.) Further, without standards, experiences at one location are not comparable to experiences at another. None can learn from each other.

End user efforts are typically hampered by poor quality data. Most organizations have grown over time and carried legacy systems and legacy databases with them for decades. We have learned from hard experience that user data is of highly variable quality. Most can barely tabulate the total quantity of devices in use by model, and it is the rare user that has invested in their service processes to allow any metrics to be run looking at product reliability. Lacking standards and scrutiny, most user driven reliability calculations are questionable at best.

      Manufacturer Effort:

Field reliability measurements managed by manufacturers are also of limited quality but for a different reason ‐ they don't have the data. Reliability calculations require an accurate population of devices and an equally complete set of failure reports over a known period of time. Many equipment users expect the manufacturer to have such data and believe the information is being withheld. This is not the case.

The population of devices deployed at any point in time cannot be known by the manufacturer — who can only be expected to have accurate records of quantities shipped. (*) Many products are warehoused and delivered through distribution further disconnecting the manufacturer to the deployed asset inventory. On the failure side of the equation, warranties cover only a subset of the total product lifecycle. Anything out of warranty would not be serviced consistently by the manufacturer. Further, regardless of age, not all products eligible for warranty return are made. Many products are serviced independently or by the user. Any attempts to calculate MTBF from these data sources necessitates extrapolations and statistical adjustments for uncertainty.

Even if quantity and quality issues can be resolved, lack of a common definition of failure distorts the reliability picture for both user and manufacturer driven efforts. The perspective of the manufacturer is different from that of the end user. Manufacturers perform root cause analysis to know if the problem was design, manufacturing, software, or even user error." No Fault Found" is a great result from the perspective of the manufacturer. In the world of the end user, anytime a field service technician is involved there was something broken. They do not speak the same language regarding reliability.

      Surveys:

In order to compare apples to apples, all parties must use the identical syntax for both asset descriptions and problems. End users will not invest in structural changes to data collection systems, so most attempts to collect data are limited to surveys where the data is voluntarily pre‐conformed by the participants. Due to the costs involved, these surveys are limited in scope and open to statistical suspicion over non‐response bias and other manipulations.

      Innovation: External Standards

We approach the problem of standards externally. Rather than beat the drum for better standards to be adopted by each organization, we have done an end —run around the standards problem by using technology to enforce consistent order on existing data. By avoiding the expense and delay involved in having users adopt better methods, we can immediately calculate MTBF for any product reported to us.

      Culling Electronic Databases

The proliferation of electronic databases in both asset management and field service organizations allows us to use technology to make connections that were previously impossible. Organizations with the ability to report on both the deployed quantities of assets and associated failures export raw data to us without manipulation or reprogramming. We externally clean, standardize, and then calculate MTBF. Results are then consistent across all products reported, allowing comparisons to be made.

      Results and Opportunities

MTBF data calculated consistently across multiple products of similar function immediately permits comparisons of commonly used products on the basis of reliability. Designs and manufacturing techniques which deliver higher reliability are reflected in the statistics with lower failure rates. Those with poor design or quality report higher failure rates. Even where available data sets do not deliver part‐level details, the net MTBF of the device reveals differences. Where multiple users report on the same device, comparisons are made between the failure rates at different users, allowing users to compare operations as well as products.

This same field‐driven data can be reported back to the vendor as part of the feedback loop to quality control and design. If there are differences between internal test results and field results — the vendor can then do further testing to determine cause. Products with initial problems can be improved more quickly and with greater confidence on the part of the customer.

      TekTrakker Observations

Founded in 2006, TekTrakker was designed from the outset to calculate and deliver MTBF statistics as a measure of equipment reliability for business computing (IT) technology products. We began with the supposition that different designs and manufacturing techniques would manifest different failure rates and set a threshold of 3 STDEV as a range of "normal" performance. We have since discovered that the range of quality between common IT products is so large that our scale of normal needs to be adjusted to be within 10 STDEV, if not higher. There is clearly much to be learned and much to be improved when it comes to IT reliability from the field perspective.

Evidence from other industries, such as for "Smart Grid" devices shows that reliability of electronics is a widespread mystery. Anecdotal reports of annual failure rates of 5‐10% are surprising to utility insiders, but not surprising to us with our monitoring of related devices. We believe this a very timely technique which can be used immediately to help avoid billions in wasted spending for the electric industry.

      Limitations

There are limitations to this technique. Asset and service databases must capture a minimum of data points or we cannot use the information. Results can only be calculated to the least common level of consistency, which in the case of complex configurable products means some loss of granularity. Part level detail is captured where available, but we cannot create detail where none is provided. Until users improve their data collection to improve particular details, all are limited to the least common level of quality.

Custom products do not lend themselves to this method, nor do those installed in very small quantities as the time needed to develop a useful result may exceed the useful life of the product.

The largest limitation of our approach is the investment in time for the organizer to build out the "taxonomy", develop the standards, understand the failure issues, and sheer willpower and patience to build endless translation tables to deal with syntactical (not statistical) issues.

We welcome the opportunity to work with all people interested in product reliability and hope that we can assist the industry with improving reliability anywhere and everywhere.

About the Author:

Gay Gordon‐Byrne is the President and co‐founder of TekTrakker Information Systems, LLC. With an extensive background in business technology solutions, first as a systems programmer, and then in large systems software sales, hardware systems sales and management, Gay has brought her wide experience of IT products and policies to bear on bringing reliability measurement and comparison tools to the industry.

(*) A very small set of products may be registered or entitled in such a way as to connect the manufacturer to the total quantity of deployed devices.

Download Full Article

Field Based Reliability Calculations (MTBF) – Surmounting Practical ChallengesAn outside the box approach

Field Based Reliability Calculations (MTBF) – Surmounting Practical Challenges
An outside the box approach