C代写:CS213WritingaCachingWebProxy


实现一个Web代理器,完成代理基本的HTTP协议。

Introduction

A proxy server is a computer program that acts as an intermediary between
clients making requests to access resources and the servers that satisfy those
requests by serving content. A web proxy is a special type of proxy server
whose clients are typically web browsers and whose servers are the same
servers that browsers use. When a web browser uses a proxy, it contacts the
proxy instead of communicating directly with the web server; the proxy
forwards its client’s request to the web server, reads the server’s response,
then forwards the response to the client.
Proxies are useful for many purposes. Sometimes proxies are used in firewalls,
so that browsers behind a firewall can only contact a server beyond the
firewall via the proxy. A proxy may also perform translations on pages, for
example, to make them viewable on web-enabled phones. Importantly, proxies are
used as anonymizers: by stripping requests of all identifying information, a
proxy can make the browser anonymous to web servers. Proxies can even be used
to cache web objects by storing local copies of objects from servers then
responding to future requests by reading them out of its cache rather than by
communicating again with remote servers.
In this lab, you will write a simple HTTP proxy that caches web objects. For
the first part of the lab, you will set up the proxy to accept incoming
connections, read and parse requests, forward requests to web servers, read
the servers’ responses, and forward those responses to the corresponding
clients. This first part will involve learning about basic HTTP operation and
how to use sockets to write programs that communicate over network
connections. In the second part, you will upgrade your proxy to deal with
multiple concurrent connections. This will introduce you to dealing with
concurrency, a crucial systems concept. In the third and last part, you will
add caching to your proxy using a simple main memory cache of recently
accessed web content.

Logistics

This is an individual project. You may not use your grace days on this lab.
You should do your work on one of the Shark machines.

Handout instructions

As usual, start by downloading the lab handout (proxylab-handout.tar) from
Autolab and extracting it into the directory in which you plan to workissue
the following command:
linux> tar xvf proxylab-handout.tar
If possible, extract the files directly onto a Shark machine; some operating
systems and file transfer programs clobber Unix file system permission bits.

Robust I/O package

The handout directory contains the files csapp.c and csapp.h, which comprise
the CS:APP package discussed in the CS:APP3e textbook. In addition to various
error-handling wrapper functions and helper functions, the CS:APP package
includes the robust I/O (RIO) package. When reading and writing socket data,
you should use the RIO package instead of low-level I/O functions, such as
read, write, or standard I/O functions, such as fread, and fwrite.
Keep in mind that the error-handling functions in CS:APP may not be
appropriate for use in your proxy. Before blindly using wrapper functions or
writing any of your own, carefully consider the proper action a server should
take on each particular error.

Modularity

The skeleton file proxy.c is provided in the handout. proxy.c contains a main
function that does practically nothing, and you should fill in that file with
the guts of your implementation. Modularity, though, should be an important
consideration, and it is permissible and encouraged for you to separate the
individual modules of your implementation into different files. For example,
your cache should be largely (or completely) decoupled from the rest of your
proxy, so one popular idea is to move the implementation of the cache into
separate files like cache.c and cache.h.

Makefile

Since you are free to add your own source files for this lab, you are
responsible for updating the Makefile. The entire project should compile
without warnings (you may want to use the -Werror flag), and you will want to
determine the appropriate set of compilation flags (including optimization,
linking, and debugging flags) for your final submission.

Part I: Implementing a sequential web proxy

The first step is implementing a basic sequential proxy that handles HTTP/1.0
GET requests. Other requests type, such as POST, are strictly optional.
When started, your proxy should listen for incoming connections on a port
whose number will be specified on the command line. Once a connection is
established, your proxy should read the entirety of the request from the
client and parse the request. It should determine whether the client has sent
a valid HTTP request; if so, it can then establish its own connection to the
appropriate web server then request the object the client specified. Finally,
your proxy should read the server’s response and forward it to the client.

HTTP/1.0 GET requests

When an end user enters a URL into the address bar of a web browser, the
browser will send an HTTP request to the proxy that begins with a line.
In that case, the proxy should parse the request into at least the following
fields: the hostname; and the path or query and everything following it,
/hub/index.html. That way, the proxy can determine that it should open a
connection and send an HTTP request of its own starting with a line of the
following form:
GET /hub/index.html HTTP/1.0
Note that all lines in an HTTP request end with a carriage return, ‘\r’,
followed by a newline, ‘\n’. Also important is that every HTTP request is
terminated by an empty line: “\r\n”.
You should notice in the above example that the web browser’s request line
ends with HTTP/1.1, while the proxy’s request line ends with HTTP/1.0. Modern
web browsers will generate HTTP/1.1 requests, but your proxy should handle
them and forward them as HTTP/1.0 requests.
It is important to consider that HTTP requests, even just the subset of
HTTP/1.0 GET requests, can be incredibly complicated. The textbook describes
certain details of HTTP transactions, but you should refer to RFC 1945 for the
complete HTTP/1.0 specification. Ideally your HTTP request parser will be
fully robust according to the relevant sections of RFC 1945, except for one
detail: while the specification allows for multiline request fields, your
proxy is not required to properly handle them. Of course, your proxy should
never prematurely abort due to a malformed request.

Request headers

Request headers are very important elements of an HTTP request. Headers are
essentially key-value pairs provided line-by-line following the first request
line of an HTTP request. Of particular importance for this lab are the Host,
User-Agent, Connection, and Proxy-Connection headers. Your proxy must perform
the following operations with regard to the listed HTTP request headers:

  • Always send a Host header. While this behavior is technically not sanctioned by the HTTP/1.0 specification, it is necessary to coax sensible responses out of certain web servers, especially those that use virtual hosting.
    The Host header describes the hostname of the web server your proxy is trying
    to access. For example, to access your proxy would send the following header:
    It is possible that web browsers will attach their own Host headers to their
    HTTP requests. If that is the case, your proxy should use the same Host header
    as the browser.
  • You may choose to always send the following User-Agent header
    The header is provided on two separate lines because it does not fit as a
    single line in the writeup, but your proxy should send the header as a single
    line.
    The User-Agent header identifies the client (in terms of parameters such as
    the operating system and browser), and web servers often use the identifying
    information to manipulate the content they serve. Sending this particular
    User-Agent: string may improve, in content and diversity, the material that
    you get back during simple telnet-style testing.
  • Always send the following Connection header:
    Connection: close
  • Always send the following Proxy-Connection header:
    Proxy-Connection: close
    The Connection and Proxy-Connection headers are used to specify whether a
    connection will be kept alive after the first request/response exchange is
    completed. It is perfectly acceptable (and suggested) to have your proxy open
    a new connection for each request. Specifying close as the value of these
    headers alerts web servers that your proxy intends to close connections after
    the first request/response exchange.
    With the exception of the Host header, your proxy should ignore the values of
    the request headers described above; instead, your proxy should always send
    the headers this document specifies.
    For your convenience, the values of the described User-Agent header is
    provided to you as a string constant in proxy.c.
    Finally, if a browser sends any additional request headers as part of an HTTP
    request, your proxy should forward them unchanged.

Port numbers

There are two significant classes of port numbers for this lab: HTTP request
ports and your proxy’s listening port.
The HTTP request port is an optional field in the URL of an HTTP request. That
is, the URL may be of the form, in which case your proxy should connect to the
host on port 8080 instead of the default HTTP port, which is port 80. Your
proxy must properly function whether or not the port number is included in the
URL.
The listening port is the port on which your proxy should listen for incoming
connections. Your proxy should accept a command line argument specifying the
listening port number for your proxy. For example, with the following command,
your proxy should listen for connections on port 12345: linux] ./proxy 12345
You will have to supply a port number every time you wish to test your proxy
by running it. You may select any non-privileged port (greater than 1,024 and
less than 65,536) as long as it is not used by other processes.
Since each proxy must use a unique listening port and many people will
simultaneously be working on each machine, the script port-for-user.pl is
provided to help you pick your own personal port number.
Use it to generate port number based on your Andrew ID:
linux> ./port-for-user.pl droh droh: 45806
The port, p, returned by port-for-user.pl is always an even number. So if you
need an additional port number, say for the Tiny server, you can safely use
ports p and p + 1.
Please don’t pick your own random port. If you do, you run the risk of
interferring with another user.

Part II: Dealing with multiple concurrent requests

Production web proxies usually do not process requests sequentially; they
process multiple requests in parallel. This is particularly important when
handling a single request can involve a lengthy delay (as it might when
contacting a remote web server). While your proxy waits for a response from
the remote web server, it can work on a pending request from another client.
Thus, once you have a working sequential proxy, you should alter it to
simultaneously handle multiple requests.

POSIX Threads

You will use the POSIX Threads (Pthreads) library to spawn threads that will
execute in parallel to serve multiple simultaneous requests. A simple way to
implement concurrent request service is to spawn a new thread to process each
new incoming request. In this architecture, the main server thread simply
accepts connections and spawns off independent worker threads that deal with
each request to completion and terminate when they are done. Other designs are
also viable: you might alternatively decide to have your proxy create a pool
of worker threads from the start. You may use any architecture you wish as
long as your proxy exhibits true concurrency, but spawning a new worker thread
for each request is the simplest and historically most common method.
The basic usage of Pthreads involves the implementation of a function that
will serve as the start routine for new threads. Once a start routine exists,
you can use the pthread_create function to create and start a new thread. New
threads are by default joinable, which means that another thread must clean up
spare resources after the thread exits, similar to how an exited process must
be reaped by a call to wait.
Luckily, it is possible to detach threads, meaning spare resources are
automatically reaped upon thread exit.
To properly detach threads, the first line of the start routine should be as
follows:
pthread_detach(pthread_self());

Race conditions

While multithreading will almost certainly improve the performance of your web
proxy, concurrency comes at a price: the threat of race conditions. Race
conditions most often arise when there is a shared resource between multiple
threads. You must find ways to avoid race conditions in your concurrent proxy.
That will likely involve both minimizing shared resources and synchronizing
access to shared resources. Synchronization involves the use of objects called
locks, which come in many varieties. The Pthreads library contains all of the
locking primitives you might need for synchronization in your proxy. See the
your textbook for details.

Thread safety

The open clientfd and open listenfd functions described in the CS:APP3e
textbook are thread safe are based on the modern and protocol-independent
getaddrinfo function, and thus are thread safe.

Part III: Caching web objects

For the final part of the lab, you will add a cache to your proxy that will
keep recently used Web objects in memory. HTTP actually defines a fairly
complex model by which web servers can give instructions as to how the objects
they serve should be cached and clients can specify how caches should be used
on their behalf. However, your proxy will adopt a simplified approach.
When your proxy receives a web object from a server, it should cache it in
memory as it transmits the object to the client. If another client requests
the same object from the same server, your proxy need not reconnect to the
server; it can simply resend the cached object.
Obviously, if your proxy were to cache every object that is ever requested, it
would require an unlimited amount of memory. Moreover, because some web
objects are larger than others, it might be the case that one giant object
will consume the entire cache, preventing other objects from being cached at
all. To avoid those problems, your proxy should have both a maximum cache size
and a maximum cache object size.

Maximum cache size

The entirety of your proxy’s cache should have the following maximum size:
MAX_CACHE_SIZE = 1 MiB
When calculating the size of its cache, your proxy must only count bytes used
to store the actual web objects; any extraneous bytes, including metadata,
should be ignored.

Maximum object size

Your proxy should only cache web objects that do not exceed the following
maximum size:
MAX_OBJECT_SIZE = 100 KiB
For your convenience, both size limits are provided as macros in proxy.c.
The easiest way to implement a correct cache is to allocate a buffer for each
active connection and accumulate data as it is received from the server. If
the size of the buffer ever exceeds the maximum object size, the buffer can be
discarded. If the entirety of the web server’s response is read before the
maximum object size is exceeded, then the object can be cached. Using this
scheme, the maximum amount of data your proxy will ever use for web objects is
the following, where T is the maximum number of active connections:
MAX_CACHE_SIZE + T * MAX_OBJECT_SIZE

Eviction policy

Your proxy’s cache should employ an eviction policy that approximates a least-
recently-used (LRU) eviction policy. It doesn’t have to be strictly LRU, but
it should be something reasonably close. Note that both reading an object and
writing it count as using the object.

Synchronization

Accesses to the cache must be thread-safe, and ensuring that cache access is
free of race conditions will likely be the more interesting aspect of this
part of the lab. As a matter of fact, there is a special requirement that
multiple threads must be able to simultaneously read from the cache. Of
course, only one thread should be permitted to write to the cache at a time,
but that restriction must not exist for readers.
As such, protecting accesses to the cache with one large exclusive lock is not
an acceptable solution. You may want to explore options such as partitioning
the cache, using Pthreads readers-writers locks, or using semaphores to
implement your own readers-writers solution. In either, the fact that you
don’t have to implement a strictly LRU eviction policy will give you some
flexibility in supporting multiple readers.

Evaluation

This assignment will be graded out of a total of 100 points, which will be
awarded based on the following criteria:

  • BasicCorrectness: 40 points for basic proxy operation (autograded)
  • Concurrency: 15 points for handling concurrent requests (autograded)
  • Cache: 15 points for a working cache (autograded)
  • RealPages: 20 points for correctly serving the following pages (5 pts each):
  • Style: 10 points for style.

Autograding

Your handout materials include an autograder, called driver.sh, that Autolab
will use to assign scores for BasicCorrectness, Concurrency, and Cache. From
the proxylab-handout directory:
linux> ./driver.sh
You must run the driver on a Shark machine.

Manual testing

The autograder does only simple checks to confirm that your code is acting
like a concurrent caching proxy.
Therefore, your TAs will do additional manual testing to see how your proxy
deals with real pages of increasing complexity.
The TAs will test your proxy by running it on a Shark machine, and typing the
RealPages URLs directly into the most recent version of the Firefox browser.
We will not evaluate your proxy under any other conditions, or on any other
pages.
Your TAs will also examine your code for any correctness issues that weren’t
detected by the earlier tests. In particular, we will be looking for errors
such as race conditions, thread safety issues, and non-approximate LRU cache
implementations, memory and descriptor leaks, and improper SIGPIPE handling.

Robustness

As always, you must deliver a program that is robust to errors and even
malformed or malicious input.
Servers are typically long-running processes, and web proxies are no
exception. Think carefully about how long-running processes should react to
different types of errors. For many kinds of errors, it is certainly
inappropriate for your proxy to immediately exit.
Robustness implies other requirements as well, including invulnerability to
error cases like segmentation faults and a lack of memory leaks and file
descriptor leaks.

Style

Style points will be awarded based on the usual criteria. Proper error
handling is as important as ever, and modularity is of particular importance
for this lab, as there will be a significant amount of code. You should also
strive for portability.

Testing and debugging

Besides the simple autograder, you will not have any sample inputs or a test
program to test your implementation. You will have to come up with your own
tests and perhaps even your own testing harness to help you debug your code
and decide when you have a correct implementation. This is a valuable skill in
the real world, where exact operating conditions are rarely known and
reference solutions are often unavailable.
Fortunately there are many tools you can use to debug and test your proxy. Be
sure to exercise all code paths and test a representative set of inputs,
including base cases, typical cases, and edge cases.

thttpd

thttpd is a tiny HTTP server that is available for download on the web. You
can run thttpd on any port given you have sufficient permissions, and it will
serve files as requested. You can use thttpd to test your proxy’s ability to
handle arbitrary content that you can make available as files to thttpd.

Tiny web server

Your handout directory the source code for the CS:APP Tiny web server. While
not as powerful as thttpd, the CS:APP Tiny web server will be easy for you to
modify as you see fit. It’s also a reasonable starting point for your proxy
code. And it’s the server that the driver code uses to fetch pages.

curl

You can use curl to generate HTTP requests to any server, including your own
proxy. It is an extremely useful debugging tool. For example, if your proxy
and Tiny are both running on the local machine, Tiny is listening on port
15213, and proxy is listening on port 15214, then you can request a page from
Tiny via your proxy.

telnet

As you saw during lecture, you can use telnet to open a connection to your
proxy and send it HTTP requests.

netcat

netcat, also known as nc, is a versatile network utility. You can use netcat
just like telnet, to open connections to servers. Hence, imagining that your
proxy were running on catshark using port 12345 you can do something like the
following to manually test your proxy
In addition to being able to connect to Web servers, netcat can also operate
as a server itself. With the following command, you can run netcat as a server
listening on port 12345.
Once you have set up a netcat server, you can generate a request to a phony
object on it through your proxy, and you will be able to inspect the exact
request that your proxy sent to netcat.

Web browsers

Eventually you should test your proxy using the most recent version of Mozilla
Firefox. Visiting About Firefox will automatically update your browser to the
most recent version.
It will be very exciting to see your proxy working through a real Web browser.
Although the functionality of your proxy will be limited, you will notice that
you are able to browse the vast majority of websites through your proxy.
An important caveat is that you must be very careful when testing caching
using a Web browser. All modern
Web browsers have caches of their own, which you should disable before
attempting to test your proxy’s cache.


文章作者: SafePoker
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 SafePoker !
  目录