Princeton COS 598A: Parallel Architecture and Programming

COS 598A: Parallel Architecture and Programming

Spring 2007, Princeton University

Course Summary | Announcement | Administrative | Schedule | Resources

Course Summary

Parallel computing is a mainstay of modern computation and information analysis and management, ranging from scientific computing to information and data services. The inevitable and rapidly growing adoption of multi-core parallel architectures within a processor chip by all of the computer industry pushes explicit parallelism to the forefront of computing for all applications and scales, and makes the challenge of parallel programming and system understanding all the more crucial. The challenge of programming parallel systems has been highlighted as one of the three greatest challenges for the computer industry by leaders of even the largest desktop companies.

This course caters to students from all departments who are interested in using parallel computers of various scales to speed up the solution of problems. It also caters to computer science and engineering students who want to understand and grapple with the key issues in the design of parallel architectures and software systems. A significant theme of the treatment of systems issues is their implications for application software, drawing the connection between the two and thereby making the systems issues relevant to users of such systems as well. In addition to general programming and systems, there will be a significant focus on the modern trend toward increasingly more parallel multi-core processors within a single chip.

The first two thirds of the course will focus on the key issues in parallel programming and architecture. In the last third, we will examine some advanced topics ranging from methods to tolerate latency to programming models for clustered commodity systems to new classes of information applications and services that strongly leverage large-scale parallel systems. Students will do a parallel programming project, either with an application they propose from their area of interest or with one that is suggested by the instructors. Students will have dedicated access to two kinds of multi-core processor systems in addition to large-scale multiprocessors for their projects.

Prerequisites: COS 318 and 475 or instructor's permission.

Text Book: Parallel Computer Architecture: A Hardware-Software Approach by David Culler and Jaswinder P. Singh, with Anoop Gupta Tanenbaum, Morgan Kaufmann Publishers, 1998.

Announcements

5/01: DEADLINE FOR FINAL SUBMISSION IS MAY 15. PRESENTATION OF RESULTS IN CLASS ON MAY 16.
3/15: Precept information now available.
2/19: Students will give a presentation on their course project plan on 2/26. Prof. Kunle Olukotun from Stanford will give a lecture on 2/27 from 1:30-3pm.

Administrative Information

Lectures

Monday 1:30-4:20pm, Room 105, CS Building

Professors

Kai Li, room 321 CS Building, 8-4637, li at cs
Jaswinder Pal Singh, room 423 CS Building, 8-5329, jps at cs

Teaching Assistants

Christian Bienia, room 318a CS Building, 8-1759, cbienia at cs

Syllabus (Tentative)

Week 1 (2/5): Overview of Parallel Architecture (Lecture notes)
Motivation for parallel systems; history of programming models, architectures and convergence to modern system design; fundamental design issues; trends in modern processor and communication architecture and in the usage of parallel computers.

Week 2 (2/12): Parallel Programs (Lecture notes)
A structured process for parallelizing applications, illustrated through some representative case studies. What parallel programs look like in three programming models: a shared address space, explicit message passing, and a model proposed for multicore processors. Synchronization and coordination methods for parallel programs, including multithreading and event-based pipeline models.

Week 3 (2/19): Shared Memory Multiprocessors (Lecture notes, warm-up parallel programming assignment)
Overview of small-scale cache-coherent shared address space multiprocessors that have a uniformly accessible memory system. Overview of cache coherence and memory consistency, and an introduction to the design space of protocols and their tradeoffs. How synchronization is implemented in such systems, and the implications for parallel software.

Week 4 (2/26): Project Proposal Presentations
Students discuss projects they plan to do for the course, chosen from their own ideas or ones that the instructors provide.

Week 4 (2/27): Invited Lecture on Multicore Processors by Prof. Kunle Olukotun (Lecture notes) An in-depth look at the history, motivation and trends in on-chip parallel design at processor scale, namely the inevitable trend toward modern multi-core processors. Design issues for these systems, industrial case studies, and future directions. Students discuss projects they plan to do for the course, chosen from their own ideas or ones that the instructors provide.

Week 5 (3/5): Programming for Performance (lecture notes)
(Christian Bienia's Tools Slides) An exploration of the key issues in writing high performance parallel programs, following the stages of the structured process above. Focus on the shared address space model. Load balancing, data locality, communication cost, etc. An in-depth look at some case studies, and the implications for programming models.

Week 6 (3/12): Workload-driven Evaluation and Project Status
Issues in evaluating parallel systems and design tradeoffs using application workloads. Methods for scaling workloads and machines, for evaluating real systems, and for evaluating ideas and tradeoffs through simulation. Characterizing workloads for use in system evaluation.

Week 7 (3/26): Scalable Computers (lecture notes)
The design of scalable systems, which involve physically distributing memory with processing nodes. Methods for realizing programming models in distributed-memory systems, and the relationship between support in the communication architecture and efficiency of realizing programming models. The implications of communication architecture support for the design of application software. Scalable synchronization methods.

Week 8 (4/2): Directory-based Cache Coherence (lecture notes)
Supporting a cache-coherent shared address space on scalable systems with physically distributed memory. An overview of directory-based approaches, assessment of key tradeoffs and design challenges, and implications for the design of application software. Synchronization in such systems, and case studies of commercial realizations.

Week 9 (4/9): Latency Tolerance (lecture notes)
Methods for tolerating the high latency of memory access and inter-processor communication, which unlike bandwidth and processing limitations is not solved by throwing more money at the problem. Trading bandwidth for latency, using techniques like precommunication, block data transfer, relaxed memory consistency models, and multithreading within a processing core.

Week 10 (4/16): Clusters and their Applications (cancelled)
Commodity-based systems that do not lend themselves well to supporting a cache-coherent shared address space, but that are increasingly important in practice and at large scale, for both scientific computing as well as scalable information services. Programming models for such systems, including a symmetric but non-coherent shared address space (using the Unified Parallel C example) and explicit message passing. Implications for application software.

Week 11 (4/23): Invited lecture by Dr. Andrew Birrell from Microsoft Research (lecture notes)
Mutual Exclusion: Some History, Some Problems, and a Glimmer of Hope

Week 12 (4/30): Invited lecture by Prof. Kathy Yelick from UC Berkeley
5/15: Final project due
5/16: Final project presentations starting 1:30pm

Precepts

Due to the individual nature of the course projects, no conventional precept is offered for the course. Instead, students are encouraged to meet with the teaching assistant to discuss issues with their projects on a personal basis.

Students are required to complete milestones with their projects during the course of the semester. The exact requirements and deadlines will be announced in advance. All information relevant for submission and deadlines are available on the precept website.

Resources

Parallel Machines

There are three types of parallel machines available for you to use.

Quad-core Xeon Multiprocessors
- Hostname: hbar.cs.princeton.edu
- Architecture: Architecture: 8-way CMP/SMP (2 Quad-core processors, each of which consists of 2 Dual-core processors within the same package)
- Operating System: GNU/Linux
- Account Type: Princeton CS Department
- Notes: Hbar is a frontend node for development work and job control, submit jobs to get exclusive access for an identical machine for your program (see below)
SGI Origin 300 (Hecate Supercomputer)
- Hostname: hecate.princeton.edu
- Architecture: ccNUMA with 128 Intel Itanium 2 processors
- Operating System: GNU/Linux
- Notes: More information available from here.
Niagara Multiprocessor
- Hostname: niagara.cs.princeton.edu
- Architecture: 32-way CMP (1 Sun Niagara processor)
- Operating System: Sun Solaris
- Note: niagara can only be used from within the CS network
How to use these multiprocessors

IMPORTANT NOTE: You have to change your password on the niagara machine as soon as possible. The current default password is your login name. To change your password, log into the machines following the instructions below and type "passwd". The program will guide you through the process and ask for your new password.

Three different types of shared-memory computing resources are offered to allow you to work on your projects. All machines use a Unix operating system and can be accessed using SSH. If you are working from a workstation which also uses a Unix operating system, you can log into a machine called "hostname" as follows:

ssh hostname

Replace "hostname" with the correct name of the machine (a full description of all machines is given below). The computers do not share a common filesystem, you have to manually copy all files which you need to each machine. If you're working from a Unix workstation, you can use scp to copy a file "my_project.tgz" as follows:

scp my_project.tgz hostname: (Note the colon at the end of "hostname")

The SSH programs are also available for Windows programs, but you will have to install and setup the programs yourself. We do not offer any support for that.

All machines have a pre-installed version of gcc which you can use to compile your programs. While it is possible to write your programs directly on the servers, we recommend that you work offline on your local workstation using an editor or integrated development environment (IDE) which you are familiar with. This will allow you to write and test your program in a more convenient way. Only use the shared-memory computers for performance experiments.

Of the shared-memory resources which we offer, only "hecate" has a job submission system which guarantees that no more than one program runs on a set of CPUs at any time. On all other machines, it is possible that multiple computationally intensive programs run at the same time. This is a problem if you are running performance experiments and need the exact timing of the program. To get accurate timing numbers, monitor the execution of your program from another shell with "top" and re-run your program as needed. "top" is an interactive tool which lists the resource requirements of all running programs ordered by CPU time used. The column labeled "%CPU" shows the share of CPU time that each program has received during the last monitoring interval. Your program should have close to 100% CPU time. If other CPU-intensive programs are running at the same time, your program will get less CPU time and the other programs will show up at the beginning of the list generated by "top". On Solaris, you can use "prstat".

To use the job submission system on hecate you have to write a job submission script and use a small set of tools to manage your program runs. You can use the following example script "run.cmd" as a template (adjust the number of CPUs and other values as needed):

#PBS -l ncpus=4,walltime=1:00:00 #PBS -m abe #PBS -M your_puid@princeton.edu # # You can list all commands as you would use them in a shell, for example: cd my_projects # As the last command, simply execute your program as you would normally do, for example: ./my_program --threads=4
Submit your job as follows:

qsub run.cmd

More explanations are available in the hecate tutorial.

The same job control system is also used on hbar, but a slightly different job submission script should be used. Hbar is a "2-node cluster". The frontend, hbar.cs.prineton.edu, is the node which is publicly accessible. This node is intended for development work, test runs and job control. The second node is hidden and used as a dedicated compute node. It is intended for time-sensitive performance measurement. The job control system guarantees that submitted jobs have exclusive access to the resources specified in the job submission script, without any interference from other users.

You can use a job submission script such as the following:

#PBS -l nodes=1:ppn=8 #PBS -m abe #PBS -M your_puid@princeton.edu # cd my_projects ./my_program --threads=8
Job submission and control works like on hecate (see above).

A more detailed tutorial on the PBS job submission system can be found here

Tutorial and Documentation

Tutorials
- Introduction to Programming with Threads, by Andrew Birrell, 1989.
- GNU Pth Portable Threads, by Ralf S. Engelschall, 2006.
- Basic Use of Pthreads: An Introduction to POSIX Threads, 2004.
Reference book
- Multithreaded Programming with Threads, Bil Lewis and Daniel J. Berg, Sun Microsystems, Prentice Hall 1998.