Contents |
SO_REUSEADDR
option to allow the proxy to re-use an address that is already in use; look at the setsockopt
man page for more information
fputs
to send data to the client- it will work correctly for text data, but will likely fail to work correctly for binary data. Use the read/write fread/fwrite functions instead. Another possible cause for this problem could be that your proxy is running in a single-threaded mode, and it is missing subsequent HTTP requests while it is serving the request for the main page.
realloc
man page for information on dynamically growing a buffer.
Almost all large commercial websites implement some sort of load-balancing or distribution scheme in order to cope with large volumes of traffic and ensure availability in the face of hardware failure. When you send a request to a website like Google, rather than being directed to a single specific server your request is sent to one of several (possibly several thousand!) machines selected by some combination of software and hardware that is looking at inbound requests. One strategy is to keep track of the load on the cluster of servers, and then direct the next incoming request to the machine with the lowest load. Simpler schemes (like round robin balancing) just direct traffic to each server in the cluster in sequence without paying attention to existing load information.
The result is that two requests to the same URL, sent very close together, may be ultimately serviced by two different machines with different hardware and software configurations! In principle, this sort of load distribution is meant to be entirely transparent to the user; all the servers should return the exact same results and the same data. In practice, this is often not exactly the case. One cause of differences can be errors in the cluster setup. If a particular machine if misconfigured, or not properly updated, it may reply to requests with out-of-date data. A machine with a clock that drifts out of sync quickly may return timestamps that are very different from its siblings in the cluster.
Other differences are intentional; in order to debug problems with individual machines in the cluster, the machines may be configured to put an identifying marker in the headers or data of messages that they return, to ease debugging. Finally, there can sometimes be transient errors that occur as the cluster is updated; in order to improve redundancy, clustered machines may cache local copies of data rather than reading it from a single shared storage medium. Updates to the cached copy can either be pushed out to the cluster machines, or pulled off of a central server according to some schedule. It is possible to hit two different machines in a cluster while they are in the process of updating, in which case one machine will return the 'new' data, and another machine will still be serving the older, cached copy. Such windows are hopefully quite small, but can occasionally become visible to users who are engaging in unusual activities (such as making repeated, rapid web queries to a single URL in order to test the web proxy that they wrote!).