Caml
Power

Problem Set 5: Moogle

Acknowledgement: Thanks to Greg Morrisett at Harvard for the core components of this assignment!

Quick Links

  • Code bundle
  • Large wiki test case. Download it, unpack it (tar xfz wiki.tgz) in your directory and try it with the command:
    ./moogle 8080 140 wiki/Teenage_Mutant_Ninja_Turtles
    
    It will take several minutes to index the pages (about 3-5 on my desktop, which is an Intel 3.4GZ i7), even with optimized data structures. Search is near instanteous, which is the way Google would want it! We'll talk about how to parallelize web indexing in future classes.

Objectives

The goal of this problem set is to build a small web search engine called Moogle. This is really your first (somewhat) large-scale development project in OCaml. In completing this assignment, you will demostrate your understanding of OCaml's module system. You are encouraged to work in pairs for this problem set.

Your job is to implement some key components of Moogle, including efficient abstractions for sets and dictionaries and the basic crawler that constructs the web index. You will also implement a page-rank algorithm that can be used to sort the links returned from a query so that more "important" links are returned first.

The naive implementations of sets and dictionaries we provide won't scale very well, so we're going to have you re-implement dictionaries and sets. You'll be implementing solutions based on balanced trees.

TESTING EXPLANATION

Testing for your functors will be required. All the functors you write will have a run_tests : unit -> unit function. See the examples of tests in dict.ml (the tests for remove), and see how to run them by scrolling to the very bottom of dict.ml.

In addition to writing unit tests, we have provided three sample dumps of webpages to test your crawler.

  • simple-html: a directory containing a small list (7 pages) of pages to test your crawler.
  • html: a directory containing the Ocaml manual (20 html pages).
  • wiki: This will be a larger set of pages you can use to test the performance of your final implementation. I am still working on assembling it. It is not actually required for you to implement the project. I wanted to get something out on the web before the hurricane hits and I lose power ...

Style

You must follow the proper style guide as mentioned in section and in these online style guides here.

Words of Warning

This project is significantly more difficult than previous problem sets because you need to read and understand all of the code that we have given you, as well as write new code. We HIGHLY recommend you start as early as possible. This is not the kind of problem set you can expect to complete last minute.

As a warning, you shouldn't just point Moogle at an arbitrary web site and start crawling it unless you understand the robots.txt protocol. This protocol lets web sites tell crawlers (like Google's or Yahoo's or Bing's) which sub-directories they are allowed to index, at what rate they can connect to the server, how frequently they can index the material, etc. If you don't follow this protocol, then you are likely to have some very angry people on your doorsteps. So to be on the safe side, you should restrict your crawling to your local disk.

Another word of warning: Setting up and configuring a real web server demands a lot of knowledge about security. In particular, we do not recommend that you poke holes in your machine's firewall so that you can talk to Moogle from a remote destination. Rather, you should restrict your interaction to a local machine.

Getting Started

Download the code here. Unzip and untar it:

$ tar xfz a5.tgz

Within the moogle directory, you will find a Makefile and a set of .ml files that make up the project sources. You will also find three directories of web pages that you can use for testing: simple-html, html, and wiki. Below is a brief description of the contents of each file:

  • Makefile: used to build the project -- type "make all" at the command line to build the project.
  • order.ml: definitions for an order datatype used to compare values.
  • myset.ml: an interface and simple implementation of a set abstract datatype.
  • dict.ml: an interface and simple implementation of a dictionary abstract datatype.
  • query.ml: a datatype for Moogle queries and a function for evaluating a query given a web index.
  • util.ml: includes an interface and the implementation of crawler services needed to build the web index. This includes definitions of link and page datatypes, a function for fetching a page given a link, and the values of the command line arguments (e.g., the initial link, the number of pages to search, and the server port.)
  • moogle.ml: the main code for the Moogle server.
  • crawl.ml: Includes a stub for the crawler that you will need to complete.
  • graph.ml: definitions for a graph abstract data type, including a graph signature, a node signature, and a functor for building graphs from a node module.
  • nodescore.ml: definitions for node scores maps, which as part of the page-rank algorithm.
  • pagerank.ml: skeleton code for the page-rank algorithm including a dummy indegree algorithm for computing page ranks. You will be editing this.

Build Moogle

This subsection explains how to build Moogle. Initially, Moogle won't do much for you, but it will compile. Once you implement the necessary routines in crawl.ml, you will be able index small web sites (such as simple-html). To index and query larger sites, you will need to implement more efficient data structures to manage sets and dictionaries.

Compile Moogle via command line by typing make all. To start Moogle up from a terminal or shell type:


  $ ./moogle 8080 42 simple-html/index.html

The first command line argument (8080) represents the port that your moogle server listens to. Unless you know what you are doing, you should leave it as 8080.

The second command line argument (42) represents the number of pages to index. Moogle will index less than or equal to that number of pages.

The last command line argument (simple-html/index.html) indicates the page from which your crawler should start. Moogle will only index pages that are on your local file system (inside the simple-html, html, or wiki directories.)

You should see that the server starts and then prints some debugging information ending with the lines:

  Starting Moogle on port 8080.
  Press Ctrl-c to terminate Moogle.

Now try to connect to Moogle with your web-browser -- I have tested this with Chrome, which you can download here. I cannot promise other browsers will work properly. (My version of FireFox was glitchy and there may be problems with Safari.) Once you have selected a browser, open it up and type in the following URL

  http://localhost:8080

You may need to try a different port (e.g., 8081, 8082, etc.) to find a free port.

Once you connect to the server, you should see a web page that says "Welcome to Moogle!". You can try to type in a query and if you do, you'll probably see an empty response (you'll see an empty response until you implement crawl.ml).

The Moogle home page lets you enter a search query. The query is either a word (e.g., "moo"), a conjunctive query (e.g., "moo AND cow"), or a disjunctive query (e.g., "moo OR cow"). By default, queries are treated as conjunctive so if you write "moo cow", you should get back all of the pages that have both the word "moo" and the word "cow".

The query is parsed and sent back to Moogle as an http request. At this point, Moogle computes the set of URLs (drawn from those pages we have indexed) whose web pages satisfy the query. For instance, if we have a query "Greg OR cow", then Moogle would compute the set of URLs of all pages that contain either "Greg" or "cow".

After computing the set of URLs that satisfy the user's query, Moogle builds an html page response and sends the page back to the user to display in their web-browser.

Because crawling the actual web demands careful attention to protocols (in particular robots.txt), we will be crawling web pages on your local hard-disk instead.

1. Implement the Crawler

Your first major task is to implement the web crawler and build the search index. In particular, you need to replace the dummy crawl function in the file crawl.ml with a function which builds a WordDict.dict (dictionary from words to sets of links.)

You will find the definitions in the CrawlerServices module (in the file util.ml) useful. For instance, you will notice that the initial URL provided on the command line can be found in the variable CrawlerServices.initial_link. You should use the function CrawlerServices.get_page to fetch a page given a link. A page contains the URL for the page, a list of links that occur on that page, and a list of words that occur on that page. You need to update your WordDict.dict so that it maps each word on the page to a set that includes the page's URL. Then you need to continue crawling the other links on the page recursively. Of course, you need to figure out how to avoid an infinite loop when one page links to another and vice versa, and for efficiency, you should only visit a page at most once.

The variable CrawlerServices.num_pages_to_search contains the command-line argument specifying how many unique pages you should crawl. So you'll want to stop crawling after you've seen that number of pages, or you run out of links to process.

The module WordDict provides operations for building and manipulating dictionaries mapping words (strings) to sets of links. Note that it is defined in crawl.ml by calling a functor and passing it an argument where keys are defined to be strings, and values are defined to be LinkSet.set. The interface for WordDict can be found in dict.ml.

The module LinkSet is defined in pagerank.ml and like WordDict, it is built using a set functor, where the element type is specified to be a link. The interface for LinkSet can be found in the myset.ml file.

Running the crawler in the top-level loop won't really be possible, so to test and debug your crawler, you will want to compile via command line using make, and add code that prints things out. See the OCaml documentation for the Printf.printf functions for how to do this, and note that all of our abstract types (e.g., sets, dictionaries, etc.) provide operations for converting values to strings for easy printing. We have provided you with three sample sets of web pages to test your crawler. Once you are confident your crawler works, run it on the small html directory:


./moogle 8080 7 simple-html/index.html

simple-html contains 7 very small html files that you can inspect yourself, and you should compare that against the output of your crawler. If you attempt to run your crawler on the larger sets of pages, you may notice that your crawler takes a very long time to build up your index. The dummy list implementations of dictionaries and sets do not scale very well.

Note: Unless you do a tiny bit of extra work, your index will be case sensitive. This is fine, but may be confusing when testing, so keep it in mind.

2. Sets as Dictionaries

A set is a data structure that stores keys where keys cannot be duplicated (unlike a list which may store duplicates). Moogle uses lots of sets. For example, our web index maps words to sets of URLs, and our query evaluator manipulates sets of URLs. We also want sets to be faster than the naive list-based implementation we have provided. Here, we build sets of ordered keys, keys on which we can provide a total ordering.

Instead of implementing sets from scratch, we want you to use what you already have for dictionaries to build sets. To do so, you must write a functor, which, when given a COMPARABLE module, produces a SET module by building and adapting an appropriate DICT module. That way, we can just re-use all of the work that we will put later into implementing dictionaries efficiently.

For this part, you will need to uncomment the DictSet functor in the file myset.ml. The first step is to build a suitable dictionary D by calling Dict.Make. The key question is: what can you pass in for the type definitions of keys and values? Then you need to build the set definitions in terms of the operations provided by D : DICT.

You should make sure to write a set of unit tests that exercise your implementation, following the TESTING EXPLANATION. You must test ALL your functions (except for the string functions). Finally, you should change the Make functor at the bottom of myset.ml so that it uses the DictSet functor instead of ListSet. Try your crawler on the simple-html and html directories.

3: Dictionaries as Trees

The critical data structure in Moogle is the dictionary. A dictionary is a data structure that maps keys to values. Each key in the dictionary has a unique value, although multiple keys can have the same value. For example, WordDict is a dictionary that maps words on a page to the set of links that contain that word. At a high level, you can think of a dictionary as a set of (key,value) pairs.

As you may have observed as you tested your crawler with your new set functor, building dictionaries on top of association lists (which we have for you) does not scale particularly well. The implementation we have provided in dict.ml is pretty inefficient, as it is based on lists, so the asymptotic complexity of operations, such as looking up the value associated with a key can take linear time in the size of the dictionary.

For this part of the problem set, we are going to build a different implementation of dictionaries using a kind of balanced tree called a 2-3 tree. You studied many of the properties of 2-3 trees and their relationship to red-black trees in COS 226; in this course, you will learn how to implement them in a functional language. Since our trees are guaranteed to be balanced, then the complexity for our insert, remove, and lookup functions will be logarithmic in the number of keys in our dictionary.

Before you start coding, please read this writeup, which explains the invariants in a 2-3 tree and how to implement insert and remove.

Read every word of it. Seriously.

In the file dict.ml, you will find a commented out functor BTDict which is intended to build a DICT module from a DICT_ARG module. Here is the type definition for a dictionary implemented as a 2-3 tree:


type pair = key * value

type dict = 
    | Leaf
    | Two of dict * pair * dict
    | Three of dict * pair * dict * pair * dict

Notice the similarities between this type definition of a 2-3 tree and the type definition of a binary search trees we have covered in lecture. Here, in addition to nodes that contain two subtrees, we have a Three node which contains two (key,value) pairs (k1,v1) and (k2,v2), and three subtrees left, middle, and right. Below are the invariants as stated in the handout:

2-node: Two(left,(k1,v1),right)
  1. Every key k appearing in subtree left must be k < k1.
  2. Every key k appearing in subtree right must be k > k1.
  3. The length of the path from the 2-node to every leaf in its two subtrees must be the same.
3-node: Three(left,(k1,v1),middle,(k2,v2),right)
  1. k1 < k2.
  2. Every key k appearing in subtree left must be k < k1.
  3. Every key k appearing in subtree right must be k > k2.
  4. Every key k appearing in subtree middle must be k1 < k < k2.
  5. The length of the path from the 3-node to every leaf in its three subtrees must be the same.

Note that all of these bounds here are non inclusive (< and > vs. <= and >=), unlike the handout. The handout permitted storing multiple keys that are equal to each other, however, for our dictionary, a key can only be stored once. The last invariants of both types of nodes imply that our tree is balanced, that is, the length of the path from our root node to any Leaf node is the same.

Open up dict.ml and locate the commented out BTDict module.

3a: balanced

Locate the balanced function at the bottom of the BTDict implementation (right before the tests):

balanced : dict -> bool

You must write this function which takes a dictionary (our 2-3 tree) and returns true if and only if the tree is balanced. To do so, you only need to check the path length invariants, not the order of the keys. In addition, your solution must be efficient --- our solution runs in O(n) where n is the size of the dictionary. In a comment above the function, please explain carefully in English how you are testing that the tree is balanced.

Once you have written this function, scroll down to run_tests and uncomment test_balanced(); . Next, scroll down to IntStringBTDict and uncomment those two lines. All the tests should pass. Now that you are confident in your balanced function, you are required to use it on all your tests involving insert.

3b: strings, fold, lookup, member

Implement these functions according to their specification provided in DICT_ARG:


val fold : (key -> value -> 'a -> 'a) -> 'a -> dict -> 'a
val lookup : dict -> key -> value option
val member : dict -> key -> bool
val string_of_key : key -> string                                       
val string_of_value : value -> string 
val string_of_dict : dict -> string 
You may change the order in which the functions appear, but you may not change the signature of any of these functions (name, order of arguments, types). You may add a "rec" to make it "let rec" if you feel that you need it.

3c: insert (upward phase)

Read the handout if you have not already done so. To do insertion, there are two "phases"--the downward phase to find the right place to insert, and the upward phase to rebalance the tree, if necessary, as we go back up the tree. To distinguish when we need to rebalance and when we don't, we have created a new type to represent a "kicked-up" configuration:


type kicked = 
   | Up of dict * pair * dict (* only two nodes require rebalancing *)
   | Done of dict             (* we are done rebalancing *)

The only kind of node that could potentially require rebalancing is a 2-node (we can see this pictorially: the only subtrees that require rebalancing are represented by an up arrow in the handout; only 2-nodes have up arrows), hence we make the Up constructor take the same arguments as the Two constructor. We have provided stubs for functions to perform the upward phase on a 2-node and a 3-node:

let insert_upward_two (w: pair) (w_left: dict) (w_right: dict)
    (x: pair) (x_other: dict) : kicked = ...

let insert_upward_three (w: pair) (w_left: dict) (w_right: dict)
    (x: pair) (y: pair) (other_left: dict) (other_right: dict) : kicked = ...

Please read the specification in the source code. These functions return a "kicked" configuration containing the new balanced tree (See page 5 on the handout), and this new tree will be wrapped with "Up" or "Done", depending on whether this new balanced tree requires further upstream rebalancing (indicated by an upward arrow). Implement these functions according to the specification in the handout.

3d: insert (downward phase)

Now that we have the upward phase (the hard part) taken care of, we can worry about the downward phase. We have provided three mutually recursive functions to simplify the main insert function. Here are the downward phase functions:


  let rec insert_downward (d: dict) (k: key) (v: value) : kicked =
    match d with
      | Leaf -> ??? (* base case! see handout *)
      | Two(left,n,right) -> ??? (* mutual recursion! *) 
      | Three(left,n1,middle,n2,right) -> ??? (* mutual recursion! *)

  and insert_downward_two ((k,v): pair) ((k1,v1): pair) 
      (left: dict) (right: dict) : kicked = ...

  and insert_downward_three ((k,v): pair) ((k1,v1): pair) ((k2,v2): pair) 
      (left: dict) (middle: dict) (right: dict) : kicked = ...

  (* the main insert function *)
  let insert (d: dict) (k: key) (v: value) : dict =
    match insert_downward d k v with
      | Up(l,(k1,v1),r) -> Two(l,(k1,v1),r)
      | Done x -> x

Note that, like the upward phase, the downward phase also returns a "kicked-up" configuration. The main insert function simply extracts the tree from the kicked up configration returned by the downward phases, and returns the tree.

Implement these three downward phase functions. You will need to use your upward phase functions that you wrote in part c. Once you have written your code, you must write tests to test your insert function, lookup function, and member function. We have provided generator functions to generate keys, values, and pairs to help you test (See DICT_ARG). Use the test functions that we have provided for remove as an example. Remember to make run_tests () call your newly created test functions.

3e: remove (upward phase)

Like insert, remove has an upward and downward phase. However, instead of passing "kicked-up" configurations, we pass a "hole". We have provided the following type definition for a "hole":


type hole = 
   | Hole of pair option * dict
   | Absorbed of pair option * dict

Here, a "hole" type can either contain a Hole (represented by an empty circle in the handout), or the hole could have been absorbed, which we represent with Absorbed. The Hole and Absorbed constructors correspond directly to the Hole and Absorbed terminology used in 2-3 tree specification document. The constructors take the tree containing the Hole/Absorbed hole, and a pair option containing the (key,value) pair we removed. This is useful when we need to remove a (key,value) pair that is in the middle of the tree (and therefore need to replace it with the minimum (key,value) pair in the current node's right subtree). In addition to a "hole" type, we have also provided a "direction2/3" type, to indicate, for the current node, which subtree is currently the "hole":


type direction2 = 
   | Left2
   | Right2

type direction3 =
   | Left3
   | Mid3
   | Right3

We have provided the downward phase for you, so you will only need to implement the upward phase (see pages 9 and 10 on the handout):

  (* cases 1-2 *)
  let rec remove_upward_two (n: pair) (rem: pair option) 
      (left: dict) (right: dict) (dir: direction2) : hole =
    match dir,n,left,right with
      | Left2,x,l,Two(m,y,r) -> Hole(rem,Three(l,x,m,y,r))
      | Right2,y,Two(l,x,m),r -> ???
      | Left2,x,a,Three(b,y,c,z,d) -> ???
      | Right2,z,Three(a,x,b,y,c),d -> ???
      | Left2,_,_,_ | Right2,_,_,_ -> Absorbed(rem,Two(Leaf,n,Leaf))

  (* cases 3-4 *)
  let rec remove_upward_three (n1: pair) (n2: pair) (rem: pair option)
      (left: dict) (middle: dict) (right: dict) (dir: direction3) : hole =
    match dir,n1,n2,left,middle,right with
      | Left3,x,z,a,Two(b,y,c),d -> Absorbed(rem,Two(Three(a,x,b,y,c),z,d))
      | Mid3,y,z,Two(a,x,b),c,d -> ???
      | Mid3,x,y,a,b,Two(c,z,d) -> ???
      | Right3,x,z,a,Two(b,y,c),d -> ??? 
      | Left3,w,z,a,Three(b,x,c,y,d),e -> ??? 
      | Mid3,y,z,Three(a,w,b,x,c),d,e -> ???
      | Mid3,w,x,a,b,Three(c,y,d,z,e) -> ???
      | Right3,w,z,a,Three(b,x,c,y,d),e -> ???
      | Left3,_,_,_,_,_ | Mid3,_,_,_,_,_ | Right3,_,_,_,_,_ ->
        Absorbed(rem,Three(Leaf,n1,Leaf,n2,Leaf))

Once you have finished writing the upward phase functions, uncomment out the test_remove_* functions in run_tests (). All the tests should pass.

3f: choose

Implement the function according to the specification in DICT:

choose : dict -> (key * value * dict) option

Write tests to test your choose function.

4. Try your crawler!

Finally, because we have properly factored out an interface, we can easily swap in your tree-based dictionaries in place of the list-based dictionaries by modifying the Make functor at the end of dict.ml. So make the change and try out your new crawler on the three sets of webpages. Our startup code prints out the crawling time, so you can see which implementation is faster. Try it on the larger test set in the wiki directory in addition to the html and simple-html directories with:


./moogle 8080 45 wiki/Teenage_Mutant_Ninja_Turtles

If you are confident everything is working, try changing 45 to 224. (This may take several minutes to crawl and index). You're done!

We've provided you with a debug flag at the top of moogle.ml, that can be set to either true or false. If debug is set to true, Moogle prints out debugging information using the string conversion functions defined in your dictionary and set implementations. This is set to true by default.

5. Pagerank

Only do this part when you know for sure the rest of your problem set is working. Making too many changes all at once is a recipe for disaster. Always keep your code compiling and working correctly. Make changes one module at a time.

We will now apply our knowledge of ADTs and graphs to explore solutions to a compelling problem: finding ``important'' nodes in graphs like the Internet, or the set of pages that you're crawling with Moogle.

The concept of assigning a measure of importance to nodes is very useful in designing search algorithms, such as those that many popular search engines rely on. Early search engines often ranked the relevance of pages based on the number of times that search terms appeared in the pages. However, it was easy for spammers to game this system by including popular search terms many times, propelling their results to the top of the list.

When you enter a search query, you really want the important pages: the ones with valuable information, a property often reflected in the quantity and quality of other pages linking to them. Better algorithms were eventually developed that took into account the relationships between web pages, as determined by links. (For more about the history of search engines, you can check out this page.) These relationships can be represented nicely by a graph structure, which is what we'll be using here.

NodeScore ADT

Throughout the assignment, we'll want to maintain associations of graph nodes to their importance, or NodeScore: a value between 0 (completely unimportant) and 1 (the only important node in the graph).

In order to assign NodeScores to the nodes in a graph, we've provided a module with an implementation of an ADT, NODE_SCORE, to hold such associations. The module makes it easy to create, modify, normalize (to sum to 1), and display NodeScores. You can find the module signature and implementation in nodescore.ml.

NodeScore Algorithms

In this section, you'll implement a series of NodeScore algorithms in different modules: that is, functions rank that take a graph and return a node_score_map on it. As an example, we've implemented a trivial NodeScore algorithm in the IndegreeRanker module that gives all nodes a score equal to the number of incoming edges.

5a: Random Walk Ranker, or Sisyphus walks the web

In this section, we will walk you through creating a robust ranking algorithm based on random walks. The writeup goes through several steps, and we encourage you to do your implementation in that order. However, you will only submit the final version.

Part 1

You may realize that we need a better way of saying that nodes are popular or unpopular. In particular, we need a method that considers global properties of the graph and not just edges adjacent to or near the nodes being ranked. For example, there could be an extremely relevant webpage that is several nodes removed from the node we start at. That node might normally fare pretty low on our ranking system, but perhaps it should be higher based on there being a high probability that the node could be reached when browsing around on the internet.

So consider Sisyphus, doomed to crawl the Web for eternity: or more specifically, doomed to start at some arbitrary page, and follow links randomly. (We assume for the moment that the Web is strongly connected and that every page has at least one outgoing link, unrealistic assumptions that we will return to address soon.)

Let's say Sisyphus can take k steps after starting from a random node. We design a system to determine nodescores based off how likely Sisyphus reaches a certain page. In other words, we ask: where will Sisyphus spend most of his time?

Step 1

Implement rank in the RandomWalkRanker functor in pagerank.ml, which takes a graph, and returns a normalized nodescore on the graph where the score for each node is proportional to the number of times Sisyphus visits the node i n num_steps steps, starting from a random node. For now, your function may raise an exception if some node has no outgoing edges. Note that num_steps is passed in as part of the functor parameter P.

You may find the library function Random.int useful, and may want to write some helper functions to pick random elements from lists (don't forget about the empty list case).

Step 2

Our Sisyphean ranking algorithm does better at identifying important nodes according to their global popularity, rather than being fooled by local properties of the graph. But what about the error condition we mentioned above: that a node might not have any outgoing edges? In this case, Sisyphus has reached the end of the Internet. What should we do? A good solution is to jump to some random page and surf on from there.

Modify your rank function so that that instead of raising an error at the end of the Internet, it has Sisyphus jump to a random node.

Step 3

This should work better. But what if there are very few pages that have no outgoing edges -- or worse, what if there is a set of vertices with no outgoing edges to the rest of the graph, but there are still edges between vertices in the set? Sisyphus would be stuck there forever, and that set would win the node popularity game hands down, just by having no outside links. What we'd really like to model is the fact that Sisyphus doesn't have an unlimited attention span, and starts to get jumpy every now and then...

Modify your rank function so that if the module parameter P.do_random_jumps is Some alpha, then with probability alpha at every step, Sisyphus will jump to a random node in the graph, regardless of whether the current node has outgoing edges. (If the current node has no outgoing edges, Sisyphus should still jump to a random node in the graph, as before.)

Don't forget to add some testing code. As stated in "Testing" section below, possible approaches include just running it by hand a lot of times and verifying that the results seem reasonable, or writing an approx-equal function to compare the result of a many-step run with what you'd expect to find.

Submit your final version of RandomWalkRanker once you have tested it out.

5b: Using your new page rank algorithm

At the top of crawl.ml is a definition of a MoogleRanker module that's used to order the search results. Replace the call to the IndegreeRanker functor with a call to the RandomWalkRanker you wrote, and experiment with how it changes the results.

5c: Karma: QuantumRanker

This portion of the assignment is worth Karma that may or may not be useful at the end of the semester. If you are right on the boundary between grades, the amount of Karma you accumulate over the semester may tilt your grade one way or the other. However, this part of the assignment is optional and not required.

Our algorithm so far works pretty well, but on a huge graph it would take a long time to find all of the hard-to-get-to nodes. (Especially when new nodes are being added all the time...)

We'll need to adjust our algorithm somewhat. In particular, let's suppose Sisyphus is bitten by a radioactive eigenvalue, giving him the power to subdivide himself arbitrarily and send parts of himself off to multiple different nodes at once. We have him start evenly spread out among all the nodes. Then, from each of these nodes, the pieces of Sisyphus that start there will propagate outwards along the graph, dividing themselves evenly among all outgoing edges of that node.

So, let's say that at the start of a step, we have some fraction q of Sisyphus at a node, and that node has 3 outgoing edges. Then q/3 of Sisyphus will propagate outwards from that node along each edge. This will mean that nodes with a lot of value will make their neighbors significantly more important at each timestep, and also that in order to be important, a node must have a large number of incoming edges continually feeding it importance.

Thus, our basic algorithm takes an existing graph and NodeScore, and updates the NodeScore by propagating all of the value at each node to all of its neighbors. However, there's one wrinkle: we want to include some mechanism to simulate random jumping. The way that we do this is to use a parameter alpha, similarly to what we did in exercise 2. At each step, each node propagates a fraction 1-alpha of its value to its neighbors as described above, but also a fraction alpha of its value to all nodes in the graph. This will ensure that every node in the graph is always getting some small amount of value, so that we never completely abandon nodes that are hard to reach.

We can model this fairly simply. If each node distributes alpha times its value to all nodes at each timestep, then at each timestep each node accrues (alpha/n) times the overall value in this manner. Thus, we can model this effect by having the base NodeScore at each timestep give (alpha/n) to every node.

That gives us our base case for each timestep of our quantized Sisyphus algorithm. What else do we need to do at each timestep? Well, as explained above, we need to distribute the value at each node to all of its neighbors.

Step 1

Write a function propagate_weight that takes a node, a graph, a NodeScore that it's building up, and the NodeScore from the previous timestep, and returns the new NodeScore resulting from distributing the weight that the given node had in the old NodeScore to all of its neighbors. Remember that only (1-alpha) of the NodeScore gets distributed among the neighbors (alpha of the weight was distributed evenly to all nodes in the base case). A value for alpha is passed in to the functor as part of the P parameter. Note: To handle the case of nodes that have no outgoing edges, assume that each node has an implicit edge to itself.

Step 2

Now we have all the components we need. Implement the rank function in the QuantumRanker module. It should return the NodeScore resulting from applying the updating procedures described above. The number of timesteps to simulate is passed in to the functor as part of the P parameter.

Don't forget to add some tests!

6: Submission

NOTE: if you are working in a pair, only one of you should submit. To submit, use the dropbox link posted next to the assignment link.


7: Super Karma: Moogle Design Contest

Google has a different image on their homepage every day. Moogle deserves the same -- or at least a different design every year. We challenge you to make the Moogle homepage better. Perhaps you can think of something to give it some Princeton flare. I'm also open to name changes (PU-gle? Tiger Oogle?) The creator of the best design will be rewarded with untold riches, including having your homepage featured in next year's problem set. Should you accept the challenge, you can directly edit your moogle.html file and submit a new design as part of your solution to the problem set.

If you have ideas for a new Princeton-based set of web pages that could be downloaded and used next year for testing and experimentation, please let me know. The set of pages cannot be too large. Moreover, as mentioned above, the robots.txt protocol governs who is allowed to crawl which pages and how often. Even if you are well-meaning, but accidentally have a bug in your implementation, you can get in quite a bit of trouble.

If you have comments or suggestions concerning this assignment -- perhaps ideas for how to make it better in the future, please let me know in your README.txt file.