|
|
Parallel
computing is a mainstay of modern computation and information analysis and
management, ranging from scientific computing to information and data services.
The inevitable and rapidly growing adoption of multi-core parallel architectures
within a processor chip by all of the computer industry pushes explicit
parallelism to the forefront of computing for all applications and scales, and
makes the challenge of parallel programming and system understanding all the
more crucial. The challenge of programming parallel systems has been
highlighted as one of the three greatest challenges for the computer industry by
leaders of even the largest desktop companies.
This course caters to students from all departments who are interested in using
parallel computers of various scales to speed up the solution of problems. It
also caters to computer science and engineering students who want to understand
and grapple with the key issues in the design of parallel architectures and
software systems. A significant theme of the treatment of systems issues is
their implications for application software, drawing the connection between the
two and thereby making the systems issues relevant to users of such systems as
well. In addition to general programming and systems, there will be a
significant focus on the modern trend toward increasingly more parallel
multi-core processors within a single chip.
The first two thirds of the course will focus on the key issues in parallel
programming and architecture. In the last third, we will examine some advanced
topics ranging from methods to tolerate latency to programming models for
clustered commodity systems to new classes of information applications and
services that strongly leverage large-scale parallel systems. Students will do a
parallel programming project, either with an application they propose from their
area of interest or with one that is suggested by the instructors. Students will
have dedicated access to two kinds of multi-core processor systems in addition
to large-scale multiprocessors for their projects.
Prerequisites: COS 318 and 475 or instructor's permission.
Text Book: Parallel Computer Architecture: A Hardware-Software Approach by David Culler and Jaswinder P. Singh, with Anoop Gupta Tanenbaum, Morgan Kaufmann Publishers, 1998.
Week 1 (2/5): Overview of Parallel Architecture (Lecture notes)
Motivation for parallel systems; history of programming models, architectures and convergence to modern system design; fundamental design issues; trends in modern processor and communication architecture and in the usage of parallel computers.Week 2 (2/12): Parallel Programs (Lecture notes)
A structured process for parallelizing applications, illustrated through some representative case studies. What parallel programs look like in three programming models: a shared address space, explicit message passing, and a model proposed for multicore processors. Synchronization and coordination methods for parallel programs, including multithreading and event-based pipeline models.Week 3 (2/19): Shared Memory Multiprocessors (Lecture notes, warm-up parallel programming assignment)
Overview of small-scale cache-coherent shared address space multiprocessors that have a uniformly accessible memory system. Overview of cache coherence and memory consistency, and an introduction to the design space of protocols and their tradeoffs. How synchronization is implemented in such systems, and the implications for parallel software.Week 4 (2/26): Project Proposal Presentations
Students discuss projects they plan to do for the course, chosen from their own ideas or ones that the instructors provide.Week 4 (2/27): Invited Lecture on Multicore Processors by Prof. Kunle Olukotun (Lecture notes) An in-depth look at the history, motivation and trends in on-chip parallel design at processor scale, namely the inevitable trend toward modern multi-core processors. Design issues for these systems, industrial case studies, and future directions. Students discuss projects they plan to do for the course, chosen from their own ideas or ones that the instructors provide. Week 5 (3/5): Programming for Performance (lecture notes)
(Christian Bienia's Tools Slides) An exploration of the key issues in writing high performance parallel programs, following the stages of the structured process above. Focus on the shared address space model. Load balancing, data locality, communication cost, etc. An in-depth look at some case studies, and the implications for programming models.Week 6 (3/12): Workload-driven Evaluation and Project Status
Issues in evaluating parallel systems and design tradeoffs using application workloads. Methods for scaling workloads and machines, for evaluating real systems, and for evaluating ideas and tradeoffs through simulation. Characterizing workloads for use in system evaluation.Week 7 (3/26): Scalable Computers (lecture notes)
The design of scalable systems, which involve physically distributing memory with processing nodes. Methods for realizing programming models in distributed-memory systems, and the relationship between support in the communication architecture and efficiency of realizing programming models. The implications of communication architecture support for the design of application software. Scalable synchronization methods.Week 8 (4/2): Directory-based Cache Coherence (lecture notes)
Supporting a cache-coherent shared address space on scalable systems with physically distributed memory. An overview of directory-based approaches, assessment of key tradeoffs and design challenges, and implications for the design of application software. Synchronization in such systems, and case studies of commercial realizations.Week 9 (4/9): Latency Tolerance (lecture notes)
Methods for tolerating the high latency of memory access and inter-processor communication, which unlike bandwidth and processing limitations is not solved by throwing more money at the problem. Trading bandwidth for latency, using techniques like precommunication, block data transfer, relaxed memory consistency models, and multithreading within a processing core.Week 10 (4/16): Clusters and their Applications (cancelled)
Commodity-based systems that do not lend themselves well to supporting a cache-coherent shared address space, but that are increasingly important in practice and at large scale, for both scientific computing as well as scalable information services. Programming models for such systems, including a symmetric but non-coherent shared address space (using the Unified Parallel C example) and explicit message passing. Implications for application software.Week 11 (4/23): Invited lecture by Dr. Andrew Birrell from Microsoft Research (lecture notes)
Mutual Exclusion: Some History, Some Problems, and a Glimmer of HopeWeek 12 (4/30): Invited lecture by Prof. Kathy Yelick from UC Berkeley 5/15: Final project due
5/16: Final project presentations starting 1:30pm
There are three types of parallel machines available for you to use.
IMPORTANT NOTE: You have to change your password on the niagara
machine as soon as possible. The current default password is your
login name. To change your password, log into the machines following
the instructions below and type "passwd". The program will guide you through
the process and ask for your new password.
Three different types of shared-memory computing resources are offered to
allow you to work on your projects. All machines use a Unix operating system
and can be accessed using SSH. If you are working from a workstation which
also uses a Unix operating system, you can log into a machine called
"hostname" as follows:
ssh hostname
Replace "hostname" with the correct name of the machine (a full description
of all machines is given below). The computers do not share a common
filesystem, you have to manually copy all files which you need to each
machine. If you're working from a Unix workstation, you can use scp to copy
a file "my_project.tgz" as follows:
scp my_project.tgz hostname: (Note the colon at the end of "hostname")
The SSH programs are also available for Windows programs, but you will have
to install and setup the programs yourself. We do not offer any support for
that.
All machines have a pre-installed version of gcc which you can use to
compile your programs. While it is possible to write your programs directly
on the servers, we recommend that you work offline on your local workstation
using an editor or integrated development environment (IDE) which you are
familiar with. This will allow you to write and test your program in a more
convenient way. Only use the shared-memory computers for performance
experiments.
Of the shared-memory resources which we offer, only "hecate" has a job
submission system which guarantees that no more than one program runs on a
set of CPUs at any time. On all other machines, it is possible that multiple
computationally intensive programs run at the same time. This is a problem
if you are running performance experiments and need the exact timing of the
program. To get accurate timing numbers, monitor the execution of your
program from another shell with "top" and re-run your program as needed.
"top" is an interactive tool which lists the resource requirements of all
running programs ordered by CPU time used. The column labeled "%CPU" shows
the share of CPU time that each program has received during the last
monitoring interval. Your program should have close to 100% CPU time. If
other CPU-intensive programs are running at the same time, your program will
get less CPU time and the other programs will show up at the beginning of
the list generated by "top". On Solaris, you can use "prstat".
To use the job submission system on hecate you have to write a job
submission script and use a small set of tools to manage your program runs.
You can use the following example script "run.cmd" as a template (adjust the
number of CPUs and other values as needed):
#PBS -l ncpus=4,walltime=1:00:00
#PBS -m abe
#PBS -M your_puid@princeton.edu
#
# You can list all commands as you would use them in a shell, for example:
cd my_projects
# As the last command, simply execute your program as you would normally do,
for example:
./my_program --threads=4
Submit your job as follows:
qsub run.cmd
More explanations are available in the hecate
tutorial.
The same job control system is also used on hbar, but a slightly different job submission script should be used. Hbar is a "2-node cluster". The frontend, hbar.cs.prineton.edu, is the node which is publicly accessible. This node is intended for development work, test runs and job control. The second node is hidden and used as a dedicated compute node. It is intended for time-sensitive performance measurement. The job control system guarantees that submitted jobs have exclusive access to the resources specified in the job submission script, without any interference from other users.
You can use a job submission script such as the following:
#PBS -l nodes=1:ppn=8
#PBS -m abe
#PBS -M your_puid@princeton.edu
#
cd my_projects
./my_program --threads=8
Job submission and control works like on hecate (see above).
A more detailed tutorial on the PBS job submission system can be found here