CS 410/510 - Software Engineering

Resilience Engineering

Reference: Sommerville, Software Engineering, 10 ed., Chapter 14

The big picture

The resilience of a system is a judgment of how well that system can maintain the continuity of its critical services in the presence of disruptive events, such as equipment failure and cyberattacks. This view encompasses these three ideas:

Resilience engineering places more emphasis on limiting the number of system failures that arise from external events such as operator errors or cyberattacks. Assumptions:

Four related resilience activities are involved in the detection and recovery from system problems:

Cybersecurity

Cybercrime is the illegal use of networked systems and is one of the most serious problems facing our society. Cybersecurity is a broader topic than system security engineering. Cybersecurity is a socio-technical issue covering all aspects of ensuring the protection of citizens, businesses, and critical infrastructures from threats that arise from their use of computers and the Internet. Cybersecurity is concerned with all of an organization's IT assets from networks through to application systems.

Factors contributing to cybersecurity failure:

Cybersecurity threats:

Examples of controls to protect the assets:

Redundancy and diversity are valuable for cybersecurity resilience:

Cyber resilience planning:

Socio-technical resilience

Resilience engineering is concerned with adverse external events that can lead to system failure. To design a resilient system, you have to think about socio-technical systems design and not exclusively focus on software. Dealing with these events is often easier and more effective in the broader socio-technical system.

Four characteristics that reflect the resilience of an organization:

The ability to respond
Organizations have to be able to adapt their processes and procedures in response to risks. These risks may be anticipated risks or may be detected threats to the organization and its systems.
The ability to monitor
Organizations should monitor both their internal operations and their external environment for threats before they arise.
The ability to anticipate
A resilient organization should not simply focus on its current operations but should anticipate possible future events and changes that may affect its operations and resilience.
The ability to learn
Organizational resilience can be improved by learning from experience. It is particularly important to learn from successful responses to adverse events such as the effective resistance of a cyberattack. Learning from success allows.

People inevitably make mistakes (human errors) that sometimes lead to serious system failures. There are two ways to consider human error:

Systems engineers should assume that human errors will occur during system operation. To improve the resilience of a system, designers have to think about the defense and barriers to human error that could be part of a system. Can these barriers should be built into the technical components of the system (technical barriers)? If not, they could be part of the processes, procedures and guidelines for using the system (socio-technical barriers).

Defensive layers have vulnerabilities: they are like slices of Swiss cheese with holes in the layer corresponding to these vulnerabilities. Vulnerabilities are dynamic: the 'holes' are not always in the same place and the size of the holes may vary depending on the operating conditions. System failures occur when the holes line up and all of the defenses fail.

Strategies to increase system resilience:

Resilient systems design

Designing systems for resilience involves two streams of work:

Survivable systems analysis