Communication and Fault Tolerance in Parallel Computers (Thesis)
Abstract:
This thesis explores two fundamental issues in the design of
large-scale parallel computers: {it communication /} and {it fault
tolerance/}.
In Chapter 1, we introduce and provide motivation for the problems
that we study in this thesis.
Chapter 2 examines several simple algorithms for routing packets on
butterfly networks with bounded queues. Among other things, we show
that for any greedy queuing protocol, a routing problem in which each
of the $N$ inputs sends a packet to a randomly chosen output requires
$O(log N)$ steps, with high probability, provided that the queue size
is a sufficiently large, but fixed, constant.
In Chapter 3, we analyze the fault-tolerance properties of several
bounded-degree hypercubic networks that are commonly used for parallel
computation. Among other things, we show that an $N$-node butterfly
containing $N^{1-epsilon}$ worst-case faults (for any constant
$epsilon > 0$) can emulate a fault-free butterfly of the same size
with only constant slowdown. Similar results are proved for the
shuffle-exchange graph. Hence, these networks become the first
connected bounded-degree networks known to be able to sustain more
than a constant number of worst-case faults without suffering more
than a constant-factor slowdown in performance.
In Chapter 4, we study the ability of array-based networks to tolerate
faults. Among other things, we show that an $N imes N$ two-dimensional
array can
sustain $N^{1-epsilon}$ worst-case faults, for some fixed $epsilon <
1$, and still emulate a fully-functioning $N imes N$ array with only
constant slowdown.
In Chapter 5, we study a concurrent error detection scheme called
Algorithm Based Fault Tolerance (ABFT). Unlike the schemes
developed in Chapters 3 and 4 to
tolerate permanent faults, the scheme studied in this chapter is
primarily aimed at tolerating transient faults in a parallel
computer. The main contribution of this chapter is to propose a simple
and novel algorithm called RANDGEN to generate data-check relationships.
By simply varying its parameters, RANDGEN
can produce data-check relationships with a wide spectrum of
properties, many of which have been considered important in recent ABFT
designs.