Project Proposal

Early Stage Software Reliability Prediction Using Bayesian Network

1. Introduction

ADS (ATM Distributed Switching) is a new telephone switching system that Nortel is currently developing. The major feature of ADS is that it consolidate voice, data, and leased lines networks onto a common broadband ATM network. The skeleton of ADS looks like:

                   ______________________________   
                  | Servers running applications |
                  |     like ISUP, H.323, MIB    |
                  |______________________________|
                                 |
                                 |
                             ATM network
                                 |
                       __________|___________
                      | User access servives |
                      |  through interfaces  |
                      |______________________|

The servers are connected by a ring (Controller Ring) with some redundancy. The Controller Ring
consists of Controllers (servers, including a Master processor, a Totem processor, and an Ether card), Gateways (including an ATM I/C, a Totem processor, and an Ether card), and Ethernet links.

                    ----Ctrl------------Ctrl----
                   |                            |
                   |                            |  <---- Controller Ring
                    ---- GW--------------GW-----
                         |                |
                         |                |
                     ATM Switch       ATM Switch

The applications run on Controllers. They are build upon a number of software component layers (see runtime subcomponents slide). Since software service reliability is vital for telecommunication systems, metrics and models that help to assess and improve the quality of a software during its life cycle are of practical benefit.

One of the layers of the software architecture is the TCP/IP Adaptor layer for tranferring data between servers, routers, and gateways. Right now we are focusing on predicting the reliability of this layer, basing solely on its characteristics and relationship with other layers. We are using an inference tool, Bayes net, to determine the achievable reliability..

2. Project motivation

While lots of work has been done to determine the reliability of a software after it is developed, little work has been done to predict the achievable reliability of a software before it is actually built. However, to predict the reliability during software design stage may give the design team a concrete guidence or direction. It also helps us to make trade-offs between reliability and resources, costs, and releasing schedules.

Bayes net is a technique for modeling uncertainty. It has been popular in AI-uncertainty community and has been applied to various problems as in medical diagnosis, map learning, language understanding, vison, heuristic search, and so on[2]. While uncertainty is inherent and inevitable in software development processes and products[6], we can easily incorporate analytical or experimental or even observation date into a Bayes net. The Bayes net can also be readily updated along with the progress of software developing. Using Bayes net, we can either confirm, evaluate or predict software uncertainty[6]. The project is very chanlleging in two ways: First little work has been done on applying Bayes net to software reliability prediction; Second there are still doubts on the value of early stage software reliability prediction.

3. Previous approach

The usual steps to predict reliability for a practical application are as follows[1]:
1. failure definition;
2. system definition and decomposition;
3. test run selection;
4. determine of parameters for both execution time and calendar time components of the model;
5. performance of studies and computation of useful quantities.

Reliability is computed as: First compute the failure intensity based on the selected model and failures experienced at a specific time, then use the formula R=exp(-rt) (where r is the computed failure intensity and t is the CPU execution time) to compute the reliability. So lots of test results is required to predict.

4. Our methdology

The field of software metrics assumes that characteristics of software products and development processes strongly influence the quality of the released product, and its reliability in particular.[3] Since there may be no test data available for an early stage software, we try to predict its reliability basing on the following three aspects of software metrics:
1. Software architecture, including software layering, modurality, degree of coupling, redundancy, etc.
2. Software development environment, including OO vs. non-OO, validation & verification techniques, etc.
3. Computation, communication & data distribution paradigms, including centralized vs. distributed computation, shared vs. distributed memory, OS characteristics, etc.

Bayes nets are essentially directed acyclic graphs. To construct the graph, we will decompose a software into a set of logical components falling in one of the above three categories first. These components correspond to the nodes in a Bayes net. Two nodes are connected directly if there is dependency between them (we will still call the node which has no parents a root node as in a tree). Our goal is to determine the reliability of the software by infering the probability of that the root node will be at good status. The key difficulty of the method lies in constructing a realistic Bayes net for the software.

5. Current status

Right now we are building a Bayes net for the TCP/IP layer of the ADS's application archtecture. We are trying to come up with a set of resonable nodes. For each node, work is being done to determine its possible states and probability distribution.

I am thinking about building the reliability Bayes net recursively, that is, every software layer's
reliability depends on one level down layer's reliability, itself's reliability, and the degree of interface complexity between itself and down layer. For each of the three above mentioned categories. we can list all possible nodes first. So we can just select necessary nodes from the nodes list for a specific category.

Ihe allowed down time of the S/W on the Master processor is supposed to be 124minutes (per year?) accoording to the "Processor and I/F Card Availabity" slide. As one of the 5 layers, the crash time for the TCP/IP layer must be much lower, for about 24.8 minutes assuming every layer has same reliability. That means the minimum reliabilty of the TCP/IP layer is 0.999952811.

There are some data available from [4] which discussed the reliability of a commercial telecommunications system. Briand's report[5] on metrics in the early phases of software development need to be studied. There are also papers on communication and OO design pattern in software development.

6. Future work

The net for the TCP/IP layer is being refined. Data for various probabilities are being collected.

Since there are many layers in ADS's software architecture and each layer will contribute to the reliability of the entire software, we would like to model other layers and eventually the entire software.


References:
[1] J. Musa, A. Iannino, and K. Okumoto, Software Reliability: Measurement, Prediction,
Application, Professional Edition
, McGraw-Hill, 1990.
[2] E. Charniak, Baysian Networks without Tears, AI Magazine, 12(4):50-63, 1991.
[3] J.P. Hudepohl, S.J. Aud, T.M. Khoshgoftaar, etc., Integrating Metrics and Models for Software Risk Assessment, ?:93-98, 1996.
[4] M. Kaaniche, K. Kanoun, Reliability of a Commercial Communications System, ?:207-212, 1996.
[5] L. Briand, S. Morasca, V.R. Basili, Defining and Validating High-Level Design Metrics, CS-TR-3301, University of Maryland, College Park.
[6] H. Ziv, D.J. Richardson, Constructing Bayesian-network Models of Software Testing and Maintainance Uncertainties, UCI-ICS TR 97-23, University of California, Irvine, 1997.