IP

General

Introduction

There are many kinds of networks in the world. These range from the very old such as serial links, through to wide area networks made from copper and fibre, to wireless networks of various kinds, both for computers and for telecommunications devices such as phones. These networks obviously differ at the physical link layer, but in many cases they also differed at higher layers of the OSI stack.

Over the years there has been a convergence to the "internet stack" of IP and TCP/UDP. For example, Bluetooth defines physical layers and protocol layers, but on top of that is an IP stack so that the same internet programming techniques can be employed on many Bluetooth devices. Similarly, developing 4G wireless phone technologies such as LTE (Long Term Evolution) will also use an IP stack.

While IP provides the networking layer 3 of the OSI stack, TCP and UDP deal with layer 4. These are not the final word, even in the interenet world: SCTP has come from the telecommunications to challenge both TCP and UDP, while to provide internet services in interplanetary space requires new, under development protocols such as DTN. Nevertheless, IP, TCP and UDP hold sway as principal networking technologies now and at least for a considerable time into the future. Each langauge has some measure of support for this style of programming.

This chapter discusses the IP layer as this is fundamental to all IP networking programs.

The TCP/IP stack

The OSI model was devised using a committee process wherein the standard was set up and then implemented. Some parts of the OSI standard are obscure, some parts cannot easily be implemented, some parts have not been implemented.

The TCP/IP protocol was devised through a long-running DARPA project. This worked by implementation followed by RFCs (Request For Comment). TCP/IP is the principal Unix networking protocol. TCP/IP = Transmission Control Protocol/Internet Protocol.

The TCP/IP stack is shorter than the OSI one:

TCP is a connection-oriented protocol, UDP (User Datagram Protocol) is a connectionless protocol.

IP datagrams

The IP layer provides a connectionless and unreliable delivery system. It considers each datagram independently of the others. Any association between datagrams must be supplied by the higher layers.

The IP layer supplies a checksum that includes its own header. The header includes the source and destination addresses.

The IP layer handles routing through an Internet. It is also responsible for breaking up large datagrams into smaller ones for transmission and reassembling them at the other end.

Internet adddresses

In order to use a service you must be able to find it. The Internet uses an address scheme for devices such as computers so that they can be located. This addressing scheme was originally devised when there were only a handful of connected computers, and very generously allowed upto 2^32 addresses, using a 32 bit unsigned integer. These are the so-called IPv4 addresses. In recent years, the number of connected (or at least directly addressable) devices has threatened to exceed this number, and so "any day now" we will switch to IPv6 addressing which will allow upto 2^128 addresses, using an unsigned 128 bit integer. The changeover is most likely to be forced by emerging countries, as the developed world has already taken nearly all of the pool of IPv4 addresses.

IPv4 addresses

The address is a 32 bit integer which gives the IP address. This addresses down to a network interface card on a single device. The address is usually written as four bytes in decimal with a dot '.' between them, as in "127.0.0.1" or "66.102.11.104".

The IP address of any device is generally composed of two parts: the address of the network in which the device resides, and the address of the device within that network. Once upon a time, the split between network address and internal address was simple and was based upon the bytes used in the IP address.

In a class A network, the first byte identifies the network, while the last three identify the device. There are only 128 class A networks, owned by the very early players in the internet space such as IBM, the General Electric Company and MIT (http://www.iana.org/assignments/ipv4-address-space/ipv4-address-space.xml)
Class B networks use the first two bytes to identify the network and the last two to identify devices within the subnet. This allows upto 2^16 (65,536) devices on a subnet
Class C networks use the first three bytes to identify the network and the last one to identify devices within that network. This allows upto 2^8 (actually 254, not 256, as the bottom and top addresses are reserved) devices.

This scheme doesn't work well if you want, say, 400 computers on a network. 254 is too small, while 65,536 (-2) is too large. In binary arithmetic terms, you want about 512 (-2). This can be achieved by using a 23 bit network address and 9 bits for the device addresses. Similarly, if you want upto 1024 (-2) devices, you use a 22 bit network address and a 10 bit device address.

Given an IP address of a device, and knowing how many bits N are used for the network address gives a relatively straightforward process for extracting the network address and the device address within that network. Form a "network mask" which is a 32-bit binary number with all ones in the first N places and all zeroes in the remaining ones. For example, if 16 bits are used for the network address, the mask is 11111111111111110000000000000000. It's a little inconvenient using binary, so decimal bytes are usually used. The netmask for 16 bit network addresses is 255.255.0.0, for 24 bit network addresses it is 255.255.255.0, while for 23 bit addresses it would be 255.255.254.0 and for 22 bit addresses it would be 255.255.252.0.

Then to find the network of a device, bit-wise AND it's IP address with the network mask, while the device address within the subnet is found with bit-wise AND of the 1's complement of the mask with the IP address.

IPv6 addresses

The internet has grown vastly beyond original expectations. The initially generous 32-bit addressing scheme is on the verge of running out. There are unpleasant workarounds such as NAT addressing, but eventually we will have to switch to a wider address space. IPv6 uses 128-bit addresses. Even bytes becomes cumbersome to express such addresses, so hexadecimal digits are used, grouped into 4 digits and separated by a colon ':'. A typical address might be 2002:c0e8:82e7:0:0:0:c0e8:82e7.

These addresses are not easy to remember! DNS will become even more important. There are tricks to reducing some addresses, such as eliding zeroes and repeated digits. For example, "localhost" is 0:0:0:0:0:0:0:1, which can be shortened to ::1

Each address is divided into three components: the first is the network address used for internet routing. My ISP for example gives me a 56 bit network address for my home network. Within that, I have 16 bits in which to create subnets. Most homes for example will only have a single subnet. The last part is the device component, of 64 bits, often based on a hosts MAC address, but not necessarily.

IPv6 can be unicast or multicast. Unicast addresses are primarily of three types

Global: these are addresses which are unique across the internet, and are routable across the internet
Link local: these are only routable across a single network link. They may not be unique. On any host, there may be many NICs (network interface cards), and each may be connected to hosts which (probably accidentally) have the same local address. Consequently, to know which one is intended for use by an application, the address often has to have a NIC identifier added, often after a '%' sign e.g. fe80::c474:4605:44af:462c%eth0 They are in the range fe80::10
Unique local: these are only intended for routing across some "site", whatever that means. They can be allocated by site administrators, preferably using some random scheme and are probably (but not guaranteed) to be globally unique addresses. They are in the range fd00::8

A further type (site local) has been deprecated and shld no longer be used (see Deprecating Site Local Addresses ).

Domain name service

For users, working with IP addresses is too difficult. Consequently, most hosts are given a host name such as www.google.com. These names are much easier for users to work with. However, the names must be resolved to IP addresses for most network functions. The resolver may be a list of hard-coded name-address pairs, but much more common is to use the Domain Name Service (DNS). This is a highly distributed service that maps names to IP addresses, and sometimes IP addresses back to names.

We won't go into any of the details of DNS, but most of the rest of this chapter is concerned with how each language uses DNS services to get IP addresses from host names.

Internationalized domain names

The world no longer accepts ASCII as the 'only' text encoding. An increasing number of organisations prefer to work in their own language such as Greek, Arabic, Thai, etc. IDN (Internationalized Domain Names) allows host names to be in any language. But DNS won't accept most of them and they have to be encoded into ASCII for lookup services to find them.

The actual domain name registered is the ASCII name, and s/w has to convert the language-specific name to the ASCII version. This will be discussed in the chapter on Text as it is a complex issue.

Copyright © Jan Newmarch, jan@newmarch.name

" Network Programming using Java, Go, Python, Rust, JavaScript and Julia" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/NetworkProgramming/ .

If you like this book, please contribute using PayPal