BTR for CPS
#
ProblemDistributed cyber-physical systems, such as vehicles, factories, or warehouses, have become increasingly connected both internally and with the rest of the world. This can provide some great benefits, a prime example of which can be driver assist features that prevent accidents. However, as these distributed systems grow in complexity and connectivity, they become increasingly likely to experience both benign faults and attacks by adversaries. The goal of this project is to provide resource-efficient techniques by which we can find these faults and recover from them, in a predetermined bounded amount of time.
#
Why not use something elseThere are a number of well-established techniques to protect systems from faults. In the realm of benign faults in traditional distributed computing systems (like data centers), one can deploy Paxos or RAFT to prtect against benign faults. If Byzantine faults are a concern, then other options like PBFT are possible. However, the key problem with these approaches is that they require significant resources that many CPS simply do not have. PBFT, for instance, requires 3f+1 replicas to be able to mask the presence of f faults. BTR techniques, in contrast, are able to provide useful security with only f+1 replicas.
#
ApproachOur approach involves leveraging the inertia that is inherently present in CPS. That is, because CPS have inertia, there is a brief period of time after a fault occurs during which the fault starts to affect the system, but a catastrophe has not yet occurred. If a security technique can guarantee that it will discover the source of the fault and take corrective action such that the system can return to normal behavior in a bounded amount of time, where this bounded time is dependent on the inertia of the system, then we can provide similar levels of security performance as BFT, but at much less cost.
To illustrate, consider the example in the following figure. Four actuators (alarm, burner, valve, and monitor) are controlled using four data flows that take input from one or both of the sensors S1 and S2
A BTR technique will then provide the ability for the system to reconfigure the data flows as nodes in the set {N1,..,N4} fail. See, e.g., the below figure. As additional faults occur, lower-criticality tasks (and their replicas) are dropped by the system. Here, the small numbered squares are the tasks, with grey boxes containing the same number as a colored counterpart indicating that the task is a replica. In total, the primary task and its replicas will represent f+1 total copies of a task that run on unique nodes.
#
Sample Result#
Publications- REBOUND: Defending Distributed Systems Against Attacks with Bounded‑Time Recovery. N. Gandhi, E. Roth, B. Sandler, A. Haeberlen, L.T.X. Phan. EuroSys, 2021.
- Bounded Time Recovery for Distributed Real‑Time Systems N. Gandhi, E. Roth, R. Gifford, L.T.X. Phan, A. Haeberlen. RTAS, 2020.