Proposal due: April 4, 2008.
Poster session: Tuesday, May 6, 2008, 11am-1pm, CS
building "banana room" (outside room 105).
Final report due: May 13, 2007 (Dean's day).
For the final project, you are asked to explore the application of data analysis techniques to the data problem of your choice. The project is quite open ended. For instance, you can choose to study an algorithm and its variations in depth, making controlled experimental comparisons on various datasets. Or you can choose to study one particular data problem, giving special consideration to the unique properties of the problem domain, and testing one or more methods on it. Every project should involve the experimental application of at least one algorithm or technique to at least one dataset, although more than this minimal requirement will generally be expected. You may work individually or in groups of 2-3. We strongly encourage you to work in groups so as to be able to complete more ambitious projects.
By Friday, April 4, 2008, please turn in a brief, written description of what you propose to do for your final project. The proposal should be submitted in hard copy to the TAs. If you are working as a group, only one proposal need be turned in for the group (be sure to include everyone's name and email address). Feel free to come talk to us about your ideas. We will notify you once your project has been approved. At that time, we will also assign one of the two TAs to supervise your project, so that you will have a clear point of contact for getting assistance.
There will be a poster session on Tuesday, May 6, 2008. Presenting a poster is a required part of this project, although the poster itself will not be graded. See below for further information.
This project is due on Tuesday, May 13, 2007. Please make every effort to turn in your project on time. The final project cannot be turned in late, whether or not you are using "free" late days.
We strongly advise starting early on your project. Preparing data and running experiments can take a lot of time!
Every project must involve at least one algorithm or technique, and at least one dataset; your analysis should go beyond what has already been explored in this course. Your project can focus either on the algorithm or on the data. Every project should have a clearly defined purpose.
Here are some examples of possible types of projects:
These examples are only meant to provide a starting point. You are strongly encouraged to be creative and original in your choice of topic!
You will need to do some background reading on your topic to help you decide how to proceed, and to give context to your work.
Here are some places to look to get ideas for topics, and for background reading:
It is okay to do a project that is related to independent research that you are doing as part of your graduate study, junior project or senior thesis. In this case, you will need to carve out a project that is focused and relevant to this course. If this is part of a junior or senior project, please inform your advisor of this so as to recalibrate expectations. We emphasize that turning in a project based on previously completed research is not appropriate.
Every project must involve at least one dataset, ideally one which was not already used in the course. There are many interesting and freely available datasets that you can find with Google searches. Here are some of the other places you might look for data:
Although you are encouraged to use data that you find or gather on your own, it should go without saying that if you plan to use data that is private, confidential, classified, copyrighted, controlled, sensitive, etc., it is your responsibility to be sure that it is legally and ethically okay for you to use the data for the purposes of this project (including possibly sharing the data with the graders, should the need arise). Please do not use any data in any way that might be considered illegal, unethical, immoral or inappropriate.
You can implement your project using R or you can use another software environment of your choice, such as java, C or matlab. (You also may need to pre-process your data to get it into an appropriate format, for instance, using perl or python.) The project should involve a number of analyses, and a detailed exploration and summary of the results using visualization. Your analysis should keep in mind the overall purpose you have chosen for your project.
You can use any of the library functions built into R, or that you download on-line, or you can use other publicly available software packages. If you use any software that you did not write yourself, please note this in your report, and, as with any project, demonstrate in your report that you understand how the underlying algorithm works.
If you implement code yourself, be aware as always that it can be tricky to be sure that this kind of program is actually working properly. Be sure that it is carefully tested before running your experiments. For instance, check the output of the program carefully on tiny datasets where you know what the output should be (for instance, you have computed it by hand, or you have found or implemented another program, say, in another language or using a different technique, that computes it for you). Also keep an eye out for clues that your program might have problems, for instance, if the results violate proven theorems or differ substantially from results in the published literature. Your report should include a brief description of what measures you took to be sure that your program is working properly.
The project proposal should about a half page in length (no more than a single full page), single-spaced 12pt font with 1-inch margins. It will not be graded.
The proposal should describe the following, as best as you can. (We understand that some details will have to be worked out as the project proceeds.)
We will hold a poster session on Tuesday, May 6, 2008 from 11am-1pm in the "banana room" of the Computer Science building, which is the area right outside room 105. This is intended to be a fun "science fair" kind of event to give you a chance to present your own project, and to hear about the projects of others. Participation in the poster session is required, but will not be graded. Lunch (probably pizza) will be served around noon.
Each group or individual project will be provided with a 4' x 4' bulletin board. You should prepare material to place in this space so that others can learn about your project. You can either prepare and print out an actual poster (if you can find a printer that handles oversize paper), or you can simply prepare powerpoint-style slides which you can then print out on ordinary paper and tack to your bulletin board. Push-pins will be provided. Your poster should describe at a high level what you did and what results you got. During the poster session, you (or others in your group) should spend at least half the time physically at your poster so that you can explain it in a one-on-one fashion to anyone who is interested. The rest of your time can be spent looking at your classmates' posters.
You should be finished attaching your poster to your bulletin board before the poster session begins at 11am. You can start setting up your poster at 10am. (At the end of the poster session, please be sure to take down your poster, and return any push-pins that you used. Any materials that are left behind after the end of the poster session will be discarded.)
We understand that you may not be entirely finished with your project by the time of the poster session. In that case, your poster should reflect what you have done so far, and indicate what you plan to accomplish in the remaining time.
The end result of your project is a report that clearly and concisely describes what you did, the results you obtained, and what they mean. The report should be submitted in hard copy in the homework box by May 13, 2008. If you are working individually, your report should be 3-5 pages long. If you are working as part of a small group, your group should submit one report, which is 5-7 pages long. The report should use 12pt, 1-inch margins, and single spacing. The page length limits do not include figures.
Your report should follow the general outline of a scholarly paper in this area. You should write your report as clearly as possible in a manner that would be understandable to a fellow COS424 student. In other words, you should not assume that the reader has background beyond what has been covered in class (as well as a general computer science background).
Begin by describing the problem you are studying, the motivation for studying it, and some background material (i.e., what's been done before). Previous work and outside sources should be cited throughout your report in a scholarly fashion following the style of academic papers in this area. (You can find examples by looking in the journals or conferences listed above.)
Next, clearly explain what you did, both at a high level and in more low-level detail. Explain the reason you chose the data and analyses that you performed, and explain how your study was conducted in enough detail that a motivated reader can replicate it. State the results of your analysis clearly, and include good visualizations to illustrate the results. (A table of numbers is usually less compelling than a well-chosen plot of the same data.) All reports must include at least one informative visualization.
It is important to discuss and interpret your results. Don't just give a table of results and assume that their meaning is obvious to the reader. Explain what they mean, and what conclusions can be drawn from them (and what conclusions cannot be drawn from them). Again, do all this in a way that would be understandable and interesting to a fellow COS424 student. What did you expect to find? What did you find instead? What are the implications? If you found something surprising, can you think of how it might be explained? Be thoughtful, observant and critical.
Projects will be graded along the following dimensions:
As always, feel free to contact us at anytime with questions or difficulties you encounter, or if you have trouble thinking of a topic or finding papers to read.