Pankaj Jalote Pankaj Jalote Professor, CSE, IIT Kanpur, Professor, CSE, IIT Kanpur, India India
Pankaj Jalote Pankaj Jalote Professor, CSE, IIT Kanpur, Professor, CSE, IIT Kanpur, India India
System Reliability
Reliability of a system - its ability to provide failure-free operation failureFailure the system behavior is incorrect or not as expected; is a random phenomenon
Reliability Quantification
Reliability of a system defined as failure probability in a time period R(t) = Prob that system has not failed by time t For rel work, often distribution of R(t) is specified
Reliability Quantification..
Reliability can also be quantified by Mean Time to Failure (MTTF) Also by failure rate (no of failures per unit time.) From R(t), MTTF or failure rate can be determined Under some assumptions, failure rate and MTTF are inversely related
Software Reliability
Software (un)reliability not caused due to aging but due to bugs The more the bugs, the lesser the reliability of the software Still failures seem random, hence rel theory can be applied
Software Reliability
Software systems often are one-off one Measuring reliability in lab not practical as too much failure data is needed; requires time
Assume that reliability is a function of the defect level and as defects are removed, reliability improves Model the failure-fix process of failuresoftware evolution Many models have been proposed in the last 3 decades Model parameters determined from past data on failures and fixes
For software products, a large population exists in field and faults are not removed as failures occur According to SRGMs, the reliability should remain the same I.e. the failure rate should be constant
Failures/month/unit
Users learn with time and avoid failure causing situation Users start with exploring more, then limit to some part of the product
Most users use a few product features
Configuration related failures are much more in the start These failures reduce with time
For a user, there is a transient failure rate, which decays with a factor With time the transient goes, and failure rate reaches a steady state Steady state failure rate represents the reliability of the product
Failure rate for one unit is (i) = 0 * i + f 0 is the initial transient rate f is the final steady state rate is the decay factor
Failure rate
Time
Applying it to a Product
Considered the failure and sale data of a real product for MS Applying the model to the data and determining parameters, we get
0 = 0.04 failures/month f = 0.008 failures/month = 0.4 (i.e. 40% decay each month)
Example
Steady state failure rate is 1/6th of average rate in month 2, 1/3rd of average rate in month 4 I.e. initial MTTF could be 1/6th the steady state MTTF Steady state is reached quite soon in two to three months
Sw Architecture
Architecture is the components in the system and how they are connected Is decided very early in sw project If reliability and performance can be modeled from architecture, can improve the architecture Some work going on in arch. based perf. and rel modeling
Program Verification
Program Verification
Basic goal to ensure that program is free of defects (bugs) as much as possible Good program verification leads to higher reliability
Testing program is executed with test data to find bugs Static analysis program source code is analyzed Dynamic analysis program run on some data and assertions made Model checking Formal verification
Techniques
Most techniques work in isolation Sometimes they are complimentary in their defect detection capability Combining techniques meaningfully can improve reliability We are working on techniques for combining testing and static analysis
Testing
Testing remains main verification activity most reliance on it Consumes as much as half of the total effort in a sw product Testing: test case design, execution, checking the results, then debugging, fixing, retesting Each step is expensive
Test Automation
Test automation can help reduce cost and make testing more effective Most test automation approaches focus on data collection, re-testing reLittle effort in complete end-to-end end-toautomation We are working on automating OO testing using state based models
Summary
Software reliability is a rich and wide area Exciting work going on across the world in modeling, analysis, program checking, testing, etc Lots of open issues