BTR for CPS

Problem#

Distributed cyber-physical systems, such as vehicles, factories, or warehouses, have become increasingly connected both internally and with the rest of the world. This can provide some great benefits, a prime example of which can be driver assist features that prevent accidents. However, as these distributed systems grow in complexity and connectivity, they become increasingly likely to experience both benign faults and attacks by adversaries. The goal of this project is to provide resource-efficient techniques by which we can find these faults and recover from them, in a predetermined bounded amount of time.

Why not use something else#

There are a number of well-established techniques to protect systems from faults. In the realm of benign faults in traditional distributed computing systems (like data centers), one can deploy Paxos or RAFT to prtect against benign faults. If Byzantine faults are a concern, then other options like PBFT are possible. However, the key problem with these approaches is that they require significant resources that many CPS simply do not have. PBFT, for instance, requires 3f+1 replicas to be able to mask the presence of f faults. BTR techniques, in contrast, are able to provide useful security with only f+1 replicas.

Approach#

Our approach involves leveraging the inertia that is inherently present in CPS. That is, because CPS have inertia, there is a brief period of time after a fault occurs during which the fault starts to affect the system, but a catastrophe has not yet occurred. If a security technique can guarantee that it will discover the source of the fault and take corrective action such that the system can return to normal behavior in a bounded amount of time, where this bounded time is dependent on the inertia of the system, then we can provide similar levels of security performance as BFT, but at much less cost.

To illustrate, consider the example in the following figure. Four actuators (alarm, burner, valve, and monitor) are controlled using four data flows that take input from one or both of the sensors S1 and S2

(a) A small example system in which the state of a chemical vat must be controlled. (b) There are four applications that run in this system, using data from the two sensors to control each of the four actuators. (c) Timing properties of of each of the four application.

A BTR technique will then provide the ability for the system to reconfigure the data flows as nodes in the set {N1,..,N4} fail. See, e.g., the below figure. As additional faults occur, lower-criticality tasks (and their replicas) are dropped by the system. Here, the small numbered squares are the tasks, with grey boxes containing the same number as a colored counterpart indicating that the task is a replica. In total, the primary task and its replicas will represent f+1 total copies of a task that run on unique nodes.

When using BTR, the system can safely reconfigure data flows to ensure that the state of the chemical vat continues to remain in a safe range despite the occurrence of faults.

Sample Result#

In a simulated experiment of the Volvo XC90 (slightly modified) network, we show the cruise control system, where the system should maintain a speed of 65 mph. The figure shows (a) a no-fault scenario, (b) an attacked system with no defense, (c) an attacked system that uses REBOUND, and (d) a close-up of the behavior when the attack occurs. With REBOUND, we see that the system only misbehaves for tens of milliseconds before recovery completes.

Publications#

REBOUND: Defending Distributed Systems Against Attacks with Bounded‑Time Recovery. N. Gandhi, E. Roth, B. Sandler, A. Haeberlen, L.T.X. Phan. EuroSys, 2021.
Bounded Time Recovery for Distributed Real‑Time Systems N. Gandhi, E. Roth, R. Gifford, L.T.X. Phan, A. Haeberlen. RTAS, 2020.