Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
Extending Rt-Minix With Fault Tolerance Capabilities: Pablo J. Rogina
CAPABILITIES
Pablo J. Rogina
Universidad de Buenos Aires, Fac. de Cs. Exactas y Naturales, Depto. de Computación,
Pabellón I - Ciudad Universitaria, Buenos Aires, Argentina, (1428)
[email protected]
and
Gabriel Wainer
Universidad de Buenos Aires, Fac. de Cs. Exactas y Naturales, Depto. de Computación,
Pabellón I - Ciudad Universitaria, Buenos Aires, Argentina, (1428)
[email protected]
ABSTRACT
The MINIX operating system was extended with real-time services, ranging from A/D drivers to new scheduling
algorithms and statistics collection. A testbed was constructed to test several sensor replication techniques in order to
implement and verify several robust sensing algorithms. As a result, new services enhancing fault tolerance for
replicated sensors were also provided within the kernel. The resulting OS offers new features such as real-time task
management (for both periodic or aperiodic tasks), clock resolution handling, and sensor replication manipulation.
Keywords: Fault Tolerance, Operating Systems, Real-time Systems, Sensing Algorithms, Sensor Replication.
RESUMEN
El sistema operativo MINIX fue extendido con servicios de tiempo real, desde controladores de dispositivos A/D a
nuevos algoritmos de planificación y recolección de estadísticas. Se construyó un banco de pruebas para probar varias
técnicas de replicación de sensores para implementar y verificar diversos algoritmos de sensado robusto. Como
resultado, también se proveyeron dentro del núcleo del SO nuevos servicios que mejoran la tolerancia a fallas de
sensores replicados. El SO resultante ofrece nuevas características tales como administración de tareas de tiempo real
(ya sea para tareas periódicas o aperiódicas), manipulación de la resolución del reloj, y manejo de replicación de
sensores.
Palabras clave: Tolerancia a Fallas, Sistemas Operativos, Sistemas de Tiempo Real, Algoritmos de Sensado,
Replicación de Sensores.
1 INTRODUCTION
Computing systems are already among almost any human activities. In particular, real-time systems (those where the
correctness depends not only on the results obtained, but also on the time at which these results are produced) are
present in more and more complex tasks every day, where an error can lead to catastrophic situations (even with danger
to human life). Therefore, fault tolerance capabilities for this kind of systems are critical to their success during their
lifetime cycle. Although fault tolerance strategies are being developed since a long time ago, they were oriented mainly
to distributed systems.
This kind of systems span from microcontrollers in automobile engines to very complex applications, such as aircraft
flight control or process control in manufacturing plants. Nonetheless, most real-time systems consist of a control system
and a controlled system. Information about the environment is provided via sensors, and the system can in turn modify
the state of the environment through actuators. Let's take for example a simple manufacturing process: a water tank must
have its temperature and pH within a certain range; this is a basic control process (see Fig. 1). The environment is the
controlled system, and a computer must keep the temperature and balance the pH. It is necessary that the control system
monitors the environment, using sensors (a thermometer and a pH-meter in this case). The control system changes the
environment by means of another type of components: actuators (for the example, a heater and an acid injector).
A control process can follow these steps in a repetitive manner, with time constraints applied:
Sensing: real world status must be known (by measuring temperature and pH value)
Controlling: real world values must be checked. Temperature and pH should be within certain limits (lower and upper).
Acting: real world status may need to be changed. Turning the heater on to raise the temperature until the required value
is reached.
2
them to fail.
Timing faults can be classified in:
Transient they happen once and then disappear. If the task or action is repeated, the fault does not occur again.
Intermittent they appear, disappear, and are present again. This condition makes them hard to diagnose.
Permanent faults that are present until the failed component is replaced or repaired.
This work is devoted to present the efforts in building a programming environment for real-time systems. The work is
based on a modification of the Minix operating system, so as the results can be used with educational purposes. Sensor
replication schemes were included in the kernel, providing fault tolerance when sensing values from the real world.
The work is organized as follows: Section 2 describes the extensions done to the RT-MINIX operating system. Section 3
is devoted to fault tolerance capabilities related with sensing algorithms and sensor replication; while sensing algorithms
are presented in Section 4. Both static and dynamic tests are discussed in Sections 5 and 6 respectively. Finally,
conclusions and future work proposals are listed in Section 7.
3
user queues can be joined, allowing File System (FS) and Memory Manager (MM) processes to be moved from server to
user process category. An in-depth analysis was made to check the possibility of deadlock between FS and MM, first
revisiting the semantics of them and then trying to measure the impact of the new scheduler (with the joined queues),
showing that deadlocks cannot occur with the changed scheduler.
Once the OS was extended with real-time services, the need arose to have several measuring tools. It is needed to test
the evolution of the executing tasks according with the different scheduling strategies. The impact of the different
workloads should be also considered. To do so, the kernel is in charge to keep a new data structure accessible to the user
via a system call. Statistics also can be monitored online by means of a function key displaying all that information.
MINIX proved to be a feasible testbed for OS development and real-time extensions that could be easily added to it.
This “new” operating system (a MINIX 2.0 base with real-time extensions) has a rich set of features, which makes it a
good choice to conduct real-time experiences. The added real-time services covered several areas:
Task creation: tasks can be created either periodic or aperiodic, stating their period, worst execution time and priority
Clock resolution management: the resolution (grain) of the internal clock can be changed to get better accuracy while
scheduling tasks.
Scheduling algorithms: both RMS and EDF algorithms are supported, and can be selected on the fly.
Statistics: several variables about the whole operation are accessible to the user to provide data for benchmarking and
testing new developments.
Supervisory Control and Data Acquisition: as a user application, it makes full use of real-time services.
4 SENSING ALGORITHMS
The algorithms selected to be implemented under RT -MINIX were taken from [1], and are described below:
Algorithm: Approximate-agreement
Input: A set of sensors, each with a value.
Output: A set of sensors, each with a new value converging toward a common value.
4
Step 1: each sensor broadcasts its value.
Step 2: each sensor receives the values from the other sensors and sots the values into vector v.
Step 3: the lowest τ values and the highest τ values are discarded from v at each sensor.
Step 4: each sensor forms new vector v' by taking the remaining values v[i*τ] where i=0,1,... (the smallest remaining
value and every remaining τ'th value in order).
Step 5: the new value is the mean of the values in v'.
5
sensors are faulty.
5 STATIC TESTS
To prove that the algorithms have been implemented properly, a set of tests had to be conducted. At a first step, data was
used "statically", this is, hard-coded in the test programs. The set of values used in the first test were the same presented
in [1] and shown in Table 1. It simulates a set of 5 sensors, one of them working in a faulty manner, thus providing a
different value each time a reading was made. This set of sensors can be thought as belonging to a robotic arm,
providing information about the arm's elbow position, for example. The measured angle is expressed as a value along a
tolerance (both plus and minus). Those ranges imply the concept of abstract sensor: "a set of values that contains the
physical variable of interest" [4].
Case S1 S2 S3 S4 S5
1 4,7 ± 2,0 1,6 ± 1,6 3,0 ± 1,5 1,8 ± 1,0 3,0 ± 1,6
2 4,7 ± 2,0 1,6 ± 1,6 3,0 ± 1,5 1,8 ± 1,0 1,0 ± 1,6
3 4,7 ± 2,0 1,6 ± 1,6 3,0 ± 1,5 1,8 ± 1,0 2,5 ± 1,6
4 4,7 ± 2,0 1,6 ± 1,6 3,0 ± 1,5 1,8 ± 1,0 0,9 ± 1,6
Table 1 - Sensors and its broadcasted values [1]
Each one of the algorithms shown above were applied to all the four cases in Table 1. At any time, the number of
sensors is 5, and the number of sensors with intermittent failures is 1. These conditions preserve the effectiveness of the
algorithms (because 1<5/2). Results achieved by our own version of the algorithms running under RT-MINIX were the
same stated in [1], thus validating our implementation.
The algorithms were also tested using another set of values, this time taken from [2]. Fig. 2 shows both the set of values
and the results to be obtained.
6
6 DYNAMIC TESTS
After the algorithms have been successfully proven with static data, an idea took form in the manner to prove them once
again, this time with dynamic data, i.e. variable from test to test. To provide the algorithms with such sets of values, a
device was built: four linear 100MΩ potentiometers were connected to each one of the four resistive inputs on the game
port of a PC. This testbed would use one of the recent real-time services available in RT -MINIX (Analogic/Digital
conversion capabilities through the joystick driver).
The potentiometers can be thought this time as sensors for a valve in a pipeline, providing information about the valve
position, where the minimum value referring the valve as totally closed, while the maximum value representing the
valve as totally open. The wiring diagram for the testbed is shown in Fig. 4.
An auxiliary program was written to read the four inputs simultaneously, showing the values on screen. This application
is used to adjust the "sensors" to the desired value, allowing to simulate a faulty one; positioning it out of range from the
remaining ones (for this test, N=4 and τ=1).
After the implementation steps and tests were finished, some comparisons could be drawn:
Development: none of the algorithms imposed difficulties in their implementation.
Response time: no evident differences in response time from all the algorithms were found.
Results: Approximate Agreement (AA) and Fast Convergence (FC) return a value, while Optimal Region returns a range,
and Brooks-Iyengar Hybrid returns a range plus a value. Optimal Region (OR) and Brooks-Iyengar Hybrid (BIH) give
answers within a narrower range than input data. As several dynamic tests were performed, with the model adjusted to
7
different situations, it was found that the answer from Approximate Agreement always fell inside the range returned from
OR and BIH, while the broader the range, more the difference between this value and the answer from BIH.
7 CONCLUSION
Fault tolerance, as a key discipline with growing use inside real-time systems, provides several techniques and schemes
that can and must be used in different areas of such systems: from specification languages and temporal logic in the
definition steps; the scheduling perspective and replication of sensors and actuators in the implementation steps.
This work described how the real-time extensions to the MINIX operating system, transforming it into RT-MINIX, have
been complemented with fault tolerant sensing algorithms to allow the development of applications taking benefits of
that kind of services provided from the operating system kernel. With these extensions, RT-MINIX can be used as a
platform for real-time processing or as a starting point for adding more real-time services. Robust sensing algorithms
were implemented and tested under RT-MINIX, and are now available as a service to applications having to deal with
sensor replication.
Future work may include extending the sensing algorithms to deal with multidimensional sensors, (replacing each
interval corresponding to a physical value by a vector of intervals). Fault Tolerant schedulers must be studied and
integrated in a next version of RT -MINIX, providing the programmer with a specialized and improved fault-tolerant
environment.
ACKNOWLEDGEMENTS
This work was partially supported by the UBA-SECYT research project TX-004, "Concurrency in Distributed Systems".
All the related source code can be obtained at http://www.dc.uba.ar/people/proyinv/cso/rt-minix together with
downloading and installation instructions.
REFERENCES
1. Brooks, R. and Iyengar, S. Robust Distributed Computing and Sensing Algorithm. IEEE Computer, pp 53-60, June
1996.
2. Jayasimha, D. Fault Tolerance in a Multisensor Environment. Dept. of Computer Science, The Ohio University,
May 1994.
3. Laprie, J. Dependable Computing and Fault Tolerance: Concepts and Terminology. 15 th Annual Int. Symposium on
Fault-Tolerant Computing, pp 2-11, June 1985.
4. Marzullo, K. Tolerating failures of continuous-valued sensors. ACM Transactions on Computer Systems, 8(4):284-
304, November 1990.
5. Paulik, V. Joystick device driver for Linux. Source code and installation details available online at
ftp://atrey.karlin.mff.cuni.cz/pub/linux/joystick/joystick-0.8.0.tar.gz
6. Rogina, P. and Wainer, G. New Real-Time Extensions to the MINIX operating system. Proc. of 5th Int. Conference
on Information Systems Analysis and Synthesis (ISAS'99), August 1999.
7. Tannenbaum, A. A Unix clone with source code for operating systems courses. ACM Operating Systems Review,
21:1, January 1987.
8. Wainer, G. Implementing Real-Time Scheduling in a Time-Sharing Operating System. ACM Operating Systems
Review, July 1995.