Pipelined Multithreading Transformations and Support Mechanisms (thesis)
Abstract:
Even though chip multiprocessors have emerged as the predominant
organization for future microprocessors, the multiple on-chip cores do not
directly result in improved application performance (especially for legacy
applications, which are predominantly sequential C/C++ codes).
Consequently, parallelizing applications to execute on multiple cores is
essential to their success. Independent multithreading techniques, like
DOALL extraction, create partially or fully independent threads, which
communicate rarely, if at all. While such strategies keep high
inter-thread communication costs from impacting program performance,
they cannot be applied to parallelize general-purpose applications which
are characterized by difficult-to-break recurrences. Even though cyclic
multithreading techniques, such as DOACROSS, are more applicable, the
cyclic inter-thread dependences created by these techniques cause them to
have very low tolerance to rising inter-core latencies.To address these problems, this work introduces a pipelined
multithreading (PMT) transformation called Decoupled Software Pipelining
(DSWP). DSWP, in particular, and PMT techniques, in general, are able to
tolerate inter-core latencies, while still handling codes with complex
recurrences. They achieve this by enforcing an acyclic communication
discipline amongst threads, which allow threads to use inter-thread queues
to communicate in a pipelined fashion. This dissertation demonstrates that
DSWPed codes not only tolerate inter-core communication costs, but also
effectively tolerate variable latency stalls in applications better than
single-threaded execution on both in-order and out-of-order issue
processors with comparable resources. It then performs a thorough analysis
of the performance scalability of automatically generated DSWPed codes and
identifies the conditions necessary to achieve peak PMT performance.Next, the dissertation shows that even though PMT applications tolerate
inter-core latencies well, the high frequency of inter-thread
communication (once every 5 to 20 dynamic instructions) in these codes,
makes them very sensitive to the intra-thread overhead imposed by
communication operations. In order to understand the issues surrounding
inter-thread communication for PMT applications, this dissertation
undertakes a methodical exploration of the design space of communication
support options for PMT. Three new communication mechanisms with varying
cost-performance tradeoffs are introduced and are shown to perform 38% to
200% better than the state of the art.