COS 461, Princeton University Spring 2015
Researchers often summarize a large collection of measurement data using distribution functions. Imagine you have a list of Web pages with different sizes, in terms of number of bytes. The cumulative distribution function (CDF) of page sizes would have a y-axis of "the fraction of Web pages that are less than or equal to x bytes", and an x-axis of the number of bytes. The graph would start at y=0, since no Web pages have less than or equal to 0 bytes, and reach y=1 when x reaches the size of the largest page. Sometimes researchers plot the complementary cumulative distribution function (CCDF), which is "the fraction of Web pages that are greater than x bytes". The graph would start at y=1, since all Web pages have more than 0 bytes, and gradually decrease toward y=0 upon reaching the x-axis value for the largest page. Researchers sometimes plot one or both axes of the CCDF on a logarithmic scale to see more of the detail in some region of the curve. In the questions below, you will plot CCDF distributions, on either linear or logarithmic scales.
The important fields in the Netflow data are: dpkts and doctets (the number of packets and bytes in the flow, respectively), first and last (the timestamps of the first and last packets in the flow, respectively), srcaddr and dstaddr (the source and destination IP addresses, respectively), srcport and dstport (the source and destination transport port numbers, respectively), prot (the transport protocol, e.g., TCP, UDP), src_mask and dst_mask (the length of the longest matching IP prefix for the source and destination IP addresses, respectively), and src_as and dst_as (the AS that originated the IP prefixes matching the source and destination IP addresses, respectively). For example, looking at the first two lines of the file
#:unix_secs,unix_nsecs,sysuptime,exaddr,dpkts,doctets,first,last,engine_type,engine_id,srcaddr,dstaddr,nexthop,input,output,srcport,dstport,prot,tos,tcp_flags,src_mask,dst_mask,src_as,dst_as
1285804501,0,2442636503,127.0.0.1,1,40,2442590868,2442590868,0,0,128.103.176.0,24.8.80.0,64.57.28.75,213,225,80,51979,6,0,17,16,0,1742,0
you have a flow with one 40-byte packet that arrived at time 2442590868. The packet was sent by source 128.103.176.0 to destination 24.8.80.0, though the last 11 bits are set to 0 due to the anonymization of the data. The source port is 80 (i.e., HTTP) and the destination port is 51979 (i.e., an ephemeral port), suggesting this is traffic from a Web server to a Web client. The protocol is 6 (i.e., TCP). The tcp_flags of 17 suggests that the ACK and FIN bits were set to 1, suggesting this is a FIN-ACK packet; the other packets of the Web transfer were presumably not included in the flow record due to packet sampling. The source and destination masks were 16 and 0, respectively, meaning that the source prefix 128.103.0.0/16 and the destination prefix was either unknown or 0.0.0.0/0. The source AS was 1742 (Harvard, according to "whois -h whois.arin.net 1742"), and for whatever reason the destination AS was not known.
A hint: You may find various UNIX commands like cut, sort, uniq, and grep useful in parsing and analyzing the data. For example, if you are processing the file foo.gz you can do:
gzcat foo.gz | cut -d "," -f6 | sort | uniq -c | sort -nr
to extract the sixth comma-separated field (i.e., number of bytes in the flow), count the number of occurrences of each value, and list the frequency counts from most-popular value to least-popular. Including small awk/perl/ruby/python scripts in the pipeline can be helpful for computing sums, averages, and so on. (While testing your code, you may want to test with smaller inputs by piping the data through "head -40" to see just the first 40 lines of the file. You may also find "tail +2" useful for skipping the first line of the file, which consists of the names of the data fields.) Answer the following questions:
Here is a sample line from an RIB table in TABLE_DUMP_V2 MRT format:
TABLE_DUMP2|1388750400|B|202.167.228.107|7631|0.0.0.0/0|7631 4826|IGP|202.167.228.107|0|0|7631:50|NAG||
The source IP is in the fourth column, the source AS in the fifth, the prefix in
the sixth, and in the seventh column is the AS path with the origin AS as last.
The update files are in BGP4MP MRT format. Here is a sample line from
an update file:
BGP4MP|1391431556|A|202.167.228.37|38809|66.240.183.0/24|38809 2914 174 23136|IGP|202.167.228.37|0|0|2914:420 2914:1005 2914:2000 2914:3000 65504:174|NAG||
The prefix affected by update is in the sixth column.
Be aware that the number of prefixes and (especially) the number of BGP update
messages is fairly large; so, you will need to take care that your analysis
programs make efficient use of memory. It is highly recommended
that you do not try to complete this assignment on the VM since you
will get limited memory available warnings.
BGP is an incremental protocol, sending an update message only when the route for a destination prefix changes. So, most analysis of BGP updates must start with a snapshot of the RIB, to know the initial route for each destination prefix. Use the RIB snapshot to identify the initial set of destination prefixes, and then analyze the update messages to count the number of update messages for each prefix on a single BGP session (i.e., from one BGP speaker to the RouteViews server).
You should repeat your analysis for all parts of question 2.1-2.4 across each sessions (each 12-2pm time period). Reporting the results for each session (rather than averaging) is fine. You should get a sense of how similar or different various trends are across different sessions.
The assignment page says, "Count each prefix equally, independently of [...] whether one prefix is contained inside another." These should be counted separately.
Yes.
You have to consider the RIB data (the full contents of the routing table) as well as the update data. Some of the prefixes in the table are never updated, so you will only know about them if you look at the table dumps.