HTTP

Introduction

The World Wide Web is a major client-server system, with millions of users. A site may become a Web host by running an http server. A user becomes a Web client by running a browser such as Netscape. A client can use any of the servers, and often uses a series of them.

Servers

There are a number of servers available. They all use the same protocol for communication with clients, and they differ in capabilities such as speed, reliability, etc. Original ones were the CERN server and the NCSA server. These have given way to servers from Apache, Netscape, Microsoft, O'Reilly, Silicon Graphics, etc, etc.

The primary purpose of a Web server is to deliver a document on request to a client. The document may be text, an image file, or other type of file. The document is identified by a name called a URL (Uniform Resource Locator). If the server stores that particular URL (or can generate content for that URL), then it returns the document as the message reply.

Browsers

The purpose of a browser is to allow the user to request documents to be delivered to it, and to display them in some meaningful way. Browsers differ in the version of HTML they support, in extra features such as non-standard extensions, email support, the amount of customisation, speed, caching capabilities, etc. Browsers include Netscape, IE, Mozilla, Konqueror, Opera, Lynx, Amaya, etc, etc

URLs

URLs specify a document access method (a client server protocol), a server machine and the location of a document on that machine.

http://pandonia/OS.html
ftp://services.canberra.edu.au/bin/ls

HTTP

Design

HTTP is a stateless, connectionless, reliable protocol. Each request from a client is handled reliably and then the connection is broken. The Web is an excellent example of a set of protocols stretched way beyond their original scope, with a huge series of patches at all levels to try to fix problems.

Versions

There are 3 versions of HTTP

Version 0.9 - totally obsolete
Version 1.0 - almost obsolete
Version 1.1 - current

An O/O version was under development to replace HTTP/1.1 but seems to have vanished.

Each version must understand all earlier versions

HTTP 0.9

Request format

Request = Simple-Request

Simple-Request = "GET" SP Request-URI CRLF

Response format

A response is of the form

Response = Simple-Response

Simple-Response = [Entity-Body]

HTTP 1.0

This version added much more information to the requests and responses. Rather than "grow" the 0.9 format, it was just left alongside the new version.

Request format

The format of requests from client to server is

Request = Simple-Request | Full-Request

Simple-Request = "GET" SP Request-URI CRLF

Full-Request = Request-Line
		*(General-Header
		| Request-Header
		| Entity-Header)
		CRLF
		[Entity-Body]

A Simple-Request is an HTTP/0.9 request and must be replied to by a Simple-Response.

A Request-Line has format

Request-Line = Method SP Request-URI SP HTTP-Version CRLF

where

Method = "GET" | "HEAD" | POST |
	 extension-method

e.g.

GET http://jan.newmarch.name/index.html HTTP/1.0

Response format

A response is of the form

Response = Simple-Response | Full-Response

Simple-Response = [Entity-Body]

Full-Response = Status-Line
		*(General-Header 
		| Response-Header
		| Entity-Header)
		CRLF
		[Entity-Body]

The Status-Line gives information about the fate of the request:

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

e.g.

HTTP/1.0 200 OK

The codes are

Status-Code =	  "200" ; OK
		| "201" ; Created
		| "202" ; Accepted
		| "204" ; No Content
		| "301" ; Moved permanently
		| "302" ; Moved temporarily
		| "304" ; Not modified
		| "400" ; Bad request
		| "401" ; Unauthorised
		| "403" ; Forbidden
		| "404" ; Not found
		| "500" ; Internal server error
		| "501" ; Not implemented
		| "502" ; Bad gateway
		| "503" | Service unavailable
		| extension-code

The Entity-Header contains useful information about the Entity-Body to follow

Entity-Header =	Allow
		| Content-Encoding
		| Content-Length
		| Content-Type
		| Expires
		| Last-Modified
		| extension-header

For example

HTTP/1.1 200 OK
Date: Fri, 29 Aug 2003 00:59:56 GMT
Server: Apache/2.0.40 (Unix)
Accept-Ranges: bytes
Content-Length: 1595
Connection: close
Content-Type: text/html; charset=ISO-8859-1

HTTP 1.1

HTTP 1.1 fixes many problems with HTTP 1.0, but is more complex because of it. This version is done by extending or refining the options available to HTTP 1.0. e.g.

there are more commands such as TRACE and CONNECT
you should use absolute URLs, particularly for connecting by proxies e.g GET http://www.w3.org/index.html HTTP/1.1
there are more attributes such as If-Modified-Since, also for use by proxies

The changes include

hostname identification (allows virtual hosts)
content negotiation (multiple languages)
persistent connections (reduces TCP overheads - this is very messy)
chunked transfers
byte ranges (request parts of documents)
proxy support

The 0.9 protocol took one page. The 1.0 protocol was described in about 20 pages. 1.1 takes 120 pages.

Character set

HTTP messages use the US ASCII character set
Some parts of a message need not be understood by the HTTP client or server, but are intended for other parts of the application
These "content" parts can be in any character set

HTTP 1.1 Requests

The set of requests has been expanded to

"OPTIONS"
"GET"
"HEAD"
"POST"
"PUT"
"DELETE"
"TRACE"
"CONNECT"
extension-method

Content negotiation

An HTTP request can specify what types of content it can handle by the entity headers

    Accept              
    Accept-Charset      
    Accept-Encoding     
    Accept-Language

The Accept header can tell what type of document can be handled
```
    Accept: audio/*; q=0.2, audio/basic
      
```

Accept-Charset can tell the character sets handled

     Accept-Charset: iso-8859-5, unicode-1-1;q=0.8

Accept-Encoding can tell the encodings handled

      Accept-Encoding: compress;q=0.5, gzip;q=1.0

Accept-Language

    Accept-Language: da, en-gb;q=0.8, en;q=0.7

Dates

For caching and expires, the client and server need to use dates

HTTP recognises three date formats

Sun, 06 Nov 1994 08:49:37 GMT  ; RFC 822, updated by RFC 1123
Sunday, 06-Nov-94 08:49:37 GMT ; RFC 850, obsoleted by RFC 1036
Sun Nov  6 08:49:37 1994       ; ANSI C's asctime() format

The HTTP protocol specifies the possible date formats as

      HTTP-date    = rfc1123-date | rfc850-date | asctime-date
      rfc1123-date = wkday "," SP date1 SP time SP "GMT"
      rfc850-date  = weekday "," SP date2 SP time SP "GMT"
      asctime-date = wkday SP date3 SP time SP 4DIGIT
      date1        = 2DIGIT SP month SP 4DIGIT
                     ; day month year (e.g., 02 Jun 1982)
      date2        = 2DIGIT "-" month "-" 2DIGIT
                     ; day-month-year (e.g., 02-Jun-82)
      date3        = month SP ( 2DIGIT | ( SP 1DIGIT ))
                     ; month day (e.g., Jun  2)
      time         = 2DIGIT ":" 2DIGIT ":" 2DIGIT
                     ; 00:00:00 - 23:59:59
      wkday        = "Mon" | "Tue" | "Wed"
                   | "Thu" | "Fri" | "Sat" | "Sun"
      weekday      = "Monday" | "Tuesday" | "Wednesday"
                   | "Thursday" | "Friday" | "Saturday" | "Sunday"
      month        = "Jan" | "Feb" | "Mar" | "Apr"
                   | "May" | "Jun" | "Jul" | "Aug"

Authentication

If a server wishes the client to authenticate its request, it does so by first rejecting the request with a "401" message. As part of this rejection, it should indicate in the "WWW-Authenticate" field information about the authorisation "realm" so that the client can determine if it possesses an authorisation for that realm. The client can then try again, but this time it includes a user-id and password.

This is not a very secure scheme. All the HTTP messages are sent in plain text format. The user-id and password are not encrypted in any way.

POST versus GET

"Normal" queries use GET. Strictly, if a request is "idempotent" it should use GET. Idempotent means that the client is not asking for a state change in the server, and would expect a repeat request to return the same result. This is the norm for static document requests

GET http://localhost/index.html

GET should also be used for idempotent form requests. Again, these are ones that do not cause any (visible) change of state.

GET http://localhost/cgi-bin/test-cgi?name=jan

Parameters are passed after a '?', in the form vbl=value. Any problematic characters have to be escaped. e.g. space is written as its Ascii value in hex as '%20' (or '+'). GET url's can become very long. They can also be a security leak since the form data is visible in the url and is often saved in bookmarks, log files, etc.

Note that a GET request that e.g. increases a count of logins to the server is still regarded as idempotent since it is not visible to the client.

Queries may be intended to result in state changes on the server. e.g. uploading a file, confirming a transaction, etc. These queries should use POST, and include form data in the content part of the message.

SOAP (see later) is criticised for forcing use of POST even for idempotent queries.

This page is maintained by Jan Newmarch http://jan.newmarch.name