A Simple Web Client
For this week's programming exercise, we will create a truly barebones web client. Based on the example tcp client code in
hw1.zip, and the http example sessions shown in class, write a command-line program called hw1 that takes a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename. If the URL does not end in a filename, use 'index.html'. Make sure it works for both text and images by opening the stored file in a web browser. You may assume that the URL is on the form http://host/path, where path may or may not be an empty string, may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name. You may assume files to be no larger than one megabyte, and you are not expected to handle HTTP redirect (3xx) return codes other than report them.
The hostname may be a name like www.google.com, but the example code requires an ip address (like 128.30.87.92). To look up the IP address of a given host name, use gethostbyname() or getaddrinfo(). "man 3 gethostbyname" on the command line will give you the details, or consult the gethostbyname or getaddrinfo example.
A few hints:
- Use http version 1.0. Version 1.1 can get a lot more complicated. Good functions to use for handling filenames and text include:
sprintf, sscanf, strstr, strchr
- Read more about these using the "man pages". For example, try "man sprintf" on the command line.
- NOTE: Newlines in http are represented as "\r\n", not just "\n".
- Section 2.2.2-2.2.3 in the book should be helpful. Your book talks about the "request line" and "header lines" for an http request. You will only need to use the request line and the host line of the header.
- You will need to remove the HTTP headers from the web server's response, before saving the data to a file.
- Your program will be tested (at least) on these urls:
http://www.google.com/
http://www.google.com/intl/en_ALL/images/logo.gif
http://www.google.com/thispagedoesnotexist
http://www.thissitedoesnotexist1000.com
http://www.engadget.com/2010/08/27/amazon-kindle-review
http://www.engadget.com/2010/08/27/amazon-kindle-review/
make sure you handle all these cases gracefully. The first should produce a file index.html. The second should produce a logo.gif (containing the picture). The third and fourth should exit with error i.e. exit(1). Beej's Guide to Network Programming is a great resource you may want to make use of. For the engadget URLs, you need to supply an extra "host:" parameter in the request.
- Spend some time thinking about how to do the string manipulation. It does not have to be complicated. The complete program, including comments, error handling etc. can be written in about 100 leisurely lines.
Handin
We will be using Handin to turn in all assignments. Turn this in for class 342, assignment 1.