Understanding Resource Usage and Performance in Wide-Area Distributed Systems
Abstract:
Many Internet services employ wide-area frameworks to deliver
exponentially growing network traffic to end users with low response
time. These systems typically leverage a large number of remote nodes
at the edge of the Internet, which makes the systems difficult to
develop and test. Therefore, federated testbeds are essential
infrastructures for developing wide-area systems because they allow
researchers to deploy new services under realistic network conditions.
In this dissertation, we study resource usage in PlanetLab to
understand and characterize user behavior in federated testbeds. We
also present Lsync, a low-latency file transfer system for
coordinating remote nodes in wide-area platforms, including testbeds.To support the development of new network services on a global scale,
the next generation of federated testbeds are under active
development, but very little is known about resource usage in these
shared infrastructures. We conduct an extensive study of the usage
profiles in PlanetLab that we collected for six years by running
CoMon, a PlanetLab monitoring service. We examine various aspects of
node-level behavior as well as experiment-centric behavior, and
describe their implications for resource management in federated
testbeds. We find that the usage is much different from shared
compute clusters, that conventional wisdom does not hold for
PlanetLab, and that several properties of PlanetLab as a network
testbed are largely responsible for this difference.We also present a low-latency file transfer system, Lsync, that can be
used as a synchronization building block for wide-area distributed
systems where latency matters. While many distributed systems depend
on fast data synchronization for coordinating remote nodes, current
data dissemination systems focus on efficiency for open client
populations, rather than focusing on completion latency for a known
set of nodes. In examining this problem, we find that optimizing for
latency produces strategies radically different from existing
distribution tools, and can dramatically reduce latency across a wide
range of scenarios. Lsync performs novel node selection, scheduling,
and adaptive policy switching that dynamically chooses the best
synchronization method using information available at runtime. Our
evaluation results show that Lsync reduces latency by more than a
factor of 14 compared to a widely used synchronization tool, and makes
most remote nodes fully synchronized even under frequent file updates.