HTML and HTTP

Introduction

The World Wide Web is a major client-server system, with probably millions of users. A site may become a Web server by running an http daemon. A user becomes a Web client by running a browser such as Netscape. A client can use any of the servers, and often uses a series of them.

Servers

There are a number of servers available. They all use the same protocol for communication with clients, and they differ in capabilities such as speed, reliability, etc. Original ones were the CERN server and the NCSA server. These have given way to servers from Netscape, Microsoft, O'Reilly, Silicon Graphics, Amaya (WWW Consortium) etc.

The primary purpose of a Web server is to deliver a document on request to a client. The document may be text, an image file, or other type of file. The document is identified by a name called a URL (Uniform Resource Locator). If the server stores that particular URL (or can generate content for that URL), then it returns the document as the message reply.

Browsers

The purpose of a browser is to allow the user to request documents to be delivered to it, and to display them in some meaningful way. Browsers differ in the version of HTML they support, in extra features such as non-standard extensions, email support, the amount of customisation, speed, caching capabilities, etc.

URLs

URLs specify a document access method (a client server protocol), a server machine and the location of a document on that machine.

http://pandonia/OS.html
ftp://services.canberra.edu.au/bin/ls

HTML

Document structure

HTML is a markup language defined in SGML (Standarised Generalised Markup Language). HTML defines a structure to a document without specifying the details of layout. For example, headers of various levels are defined. The control over layout of headers could not be specified originally.

A trivial document looks like

<html>
<head>
<title>
Title of document
</title>
</head>

<body>
<h1> Header level 1 </h1>
Some text in here
</body>

Hypertext links

An HTML document may contain links to other documents. When a link is selected, the browser is expected to fetch the new document and display it in place of the current one.

HTTP

Design

HTTP is a stateless, connectionless, reliable protocol. Each request from a client is handled reliably and then the connection is broken.

Versions

The current version of HTTP is version 1.1. The previous versions were 0.9 and 1.0. The first line of any message should include the version number as in

HTTP/1.1

If this is not present, version 0.9 is assumed.

HTTP/1.0 servers must handle different versions of request as follows:

recognise the format of the Request-Line for HTTP/0.9 and HTTP/1.0 request.
understand any valid request in the format of HTTP/0.9 or HTTP/1.0
respond appropriately with a message in the same version as the client

HTTP/1.0 clients must

recoghnise the format of the Status-Line for HTTP/1.0 responses.

understand any valid response in the format of HTTP/0.9 and HTTP/1.0

Request format

The format of requests from client to server is

Request = Simple-Request | Full-Request

Simple-Request = "GET" SP Request-URI CRLF

Full-Request = Request-Line
		*(General-Header
		| Request-Header
		| Entity-Header)
		CRLF
		[Entity-Body]

A Simple-Request is an HTTP/0.9 request and must be replied to by a Simple-Response.

A Request-Line has format

Request-Line = Method SP Request-URI SP HTTP-Version CRLF

where

Method = "GET" | "HEAD" | POST |
	 extension-method

e.g.

GET http://jan.newmarch.name/index.html HTTP/1.0

Response format

A response is of the form

Response = Simple-Response | Full-Response

Simple-Response = [Entity-Body]

Full-Response = Status-Line
		*(General-Header 
		| Response-Header
		| Entity-Header)
		CRLF
		[Entity-Body]

The Status-Line gives information about the fate of the request:

Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF

e.g.

HTTP/1.0 200 OK

The codes are

Status-Code =	  "200" ; OK
		| "201" ; Created
		| "202" ; Accepted
		| "204" ; No Content
		| "301" ; Moved permanently
		| "302" ; Moved temporarily
		| "304" ; Not modified
		| "400" ; Bad request
		| "401" ; Unauthorised
		| "403" ; Forbidden
		| "404" ; Not found
		| "500" ; Internal server error
		| "501" ; Not implemented
		| "502" ; Bad gateway
		| "503" | Service unavailable
		| extension-code

The Entity-Header contains useful information about the Entity-Body to follow

Entity-Header =	Allow
		| Content-Encoding
		| Content-Length
		| Content-Type
		| Expires
		| Last-Modified
		| extension-header

Authentication

If a server wishes the client to authenticate its request, it does so by first rejecting the request with a "401" message. As part of this rejection, it should indocate in the "WWW-Authenticate" field information about the authorisation "realm" so that the client can determine if it possesses an authorisation for that realm. The client can then try again, but this time it includes a user-id and password.

This is not a very secure scheme. All the HTTP messages are sent in plain text format. The user-id and password are not encrypted in any way.

HTTP 1.1

HTTP 1.1 fixes many problems with HTTP 1.0, but is more complex because of it.

hostname identification (allows virtual hosts)
content negotiation (multiple languages)
persistent connections (reduces TCP overheads - this is very messy)
chunked transfers
byte ranges (request parts of documents)
proxy support

Fatter clients and servers

CGI scripts run on the server side and provide an indefinite amount of server-side processing.

Helpers handle documents on the client browser side that the browser cannot. It does so by calling another process and passing the document to it. There is little communication between browser and handler.

Plugins also handle documents that the browser cannot. However, plugins run wothin the browser address space as DLLs.

JavaScript and VBScript are run by interpreters within the browser. Typically they are used for field validation.

Java applets are run by an interpreter within the browser. They can accomplish far more than JavaScript or VBScript.

ActiveX controls are DLLs that run within the browser address space. They are built from native code and can do anything.

CGI scripts fatten the server. JavaScript and VBScript really add to the presentation layer. Java and ActiveX can carry application as well as presentation logic.

This page is maintained by Jan Newmarch http://jan.newmarch.name