Making the Internet faster in 5 minutes (?)
Everyday there are millions of web page requests. Each request should hypothetically (and can) be simple; but, for some reason, pointless, redundant data is sent between machines, wasting bandwidth and time. The idea, which reduces this pointlessness, has the potential to save bandwidth across the Internet. It will reduce the amount of data sent with HTTP requests, which may in turn reduce the number of packets having to be sent.
Talking in Human Language
After making a HTTP request for http://infinity-infinity.com/, this is the response you get back:
HTTP/1.1 200 OK Date: Sat, 27 Jun 2009 11:08:46 GMT Server: Apache/2.0.54 X-Powered-By: PHP/4.4.8 X-Pingback: http://infinity-infinity.com/xmlrpc.php Vary: Accept-Encoding Connection: close Transfer-Encoding: chunked Content-Type: text/html; charset=UTF-8
Following this, you get some data — the “thing” you requested. I am concerned with the above. These headers are not the content; they describe the content. It actually makes sense to ordinary people: even if you have no experience with this, you can deduce some of its meaning — Date: tells us a date, Server: describes the server and Connection: Close literally means “Close the connection”.
Talking in Machine Language
So, where is the problem? Well, all of the above text is literally sent to your browser — which has to then decode this “human-language” and act on it. So, why do we machines need to talk in human language?
They don’t. This is where my idea comes in. Rather than talking in human language, why not talk in “machine language”? In reality, it is more efficient to talk using machine language because we don’t need to send as much data. If we don’t need to send as much data, we don’t need to send as many packets, and the Internet becomes (a little) faster.
In fact, every single header in the above response could be shortened to save some space.
The date header, previously (37 bytes):
Date: Sat, 27 Jun 2009 11:08:46 GMT
could be reduced to this (5 bytes):
D####
Where #### represents the 4 bytes for a Unix timestamp. Since it always contains 4 bytes, there is no need for a separator between headers (\r\n).
The server header (although pretty useless) was previously (23 bytes):
Server: Apache/2.0.54
Reducing this (16 bytes):
SApache/2.0.54
Because this header is not fixed length, a separator (\r\n) is required. Note that there is no way to reduce the value “Apache/2.0.54″ without losing its meaning. The server could simply choose not to send this header (saving all 23 bytes).
“X-Powered-By” and “X-Pingback” are custom headers, and hence out of HTTP’s scope. They could, however, be significantly reduced.
Vary: Accept-Encoding
(23 bytes)
could become this (2 bytes):
VA
Note, the value has been changed here. This is a common header value and so its value (A) could easily be synonymised with “Accept-Encoding”. A header separator is not required because this constant value never has additional data.
Connection: close
(19 bytes)
could become (2 bytes):
CC
Transfer-Encoding: chunked
(28 bytes)
becomes (2 bytes)
Tc
Content-Type: text/html; charset=UTF-8
(40 bytes)
becomes (3 bytes):
N12
Where N represents “Content-Type”, 1 the byte constant for “text/html” and 2 the byte constant for “UTF-8″
The New Response
If this was reality, the new response becomes:
HTTP/1.1 200 OK D####SApache/2.0.54 X-Powered-By: PHP/4.4.8 X-Pingback: http://infinity-infinity.com/xmlrpc.php VACCTcN12
This requires just 127 bytes, compared to the 265 bytes in the ordinary (today’s) request. If we got rid of the Server:, X-Powered-By and X-Pingback headers, it would be 33 bytes against 164 bytes — saving almost 80%! And it is possible — so why aren’t we making HTTP and other commonly used protocols highly optimised? Your guess is as good as mine.
If anybody would like to do some statistics to estimate how much bandwidth could be saved, it would be great to know…

Comments
I’m very interested in how many packets and how much data this would save.
The easiest way to see that would be to sniff the traffic for lets say a few days and save all the HTTP request and see how much data it is and then compress the data in your way and see compare the data.
I’m very new to this programing and such stuff but I think I can manage to do it with Wireshark mabie.
Don’t know how i can compress/exchange all the data thou..
If I do succeed doing it I will post my results.
>why aren’t we making HTTP and other commonly used protocols highly optimised?
Because it isn’t worth it. Yes you could save a few bytes in each header, but that doesn’t make the internet noticeably faster.
Using human-readable text protocols simplifies development and debugging, and the modest bloat in data transfer size is barely noticeable on anything but the slowest connections.
Improvements like content caching (last-modified, etags, etc.), content compression, minification, CSS sprites, CDNs, and basically everything that YSlow tells you to do, can make a huge difference in the end-user performance of web sites. Compared to those things, cutting a few dozen bytes out of the headers is a waste of time.
Consider just the issue of latency. A ping from my house in San Francisco to a server in Dallas and back takes 50ms. In that time, on a 1.5MBit DSL line I could download 7500 bytes. So changing the size of that ping packet doesn’t matter much compared to trying to cut down on the number of round-trips to the server.
And when the content is much larger, the % of time spent on the header is minimal compared to the time spent on the content — which can be compressed, provided that you add yet another header telling the HTTP client so.
In this paper the authors do the hard work of substituting a manually-compressed binary protocol for HTTP traffic and compare the real world results.
ftp://ftp.computer.org/MAGS/MULTIMED/mms/webEng/112272.pdf
To share what I think a relevant quote here:
“There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
– Donald Knuth
It’s not obvious that there would be a substantial improvement by a small reduction in header length; we need a quantitative measure for RTT effects and number of packets in real-world implementations to say. But there would definitely be a substantial interoperability cost; adjusting server and client implementations, and working out a usable migration path, takes more than 5 minutes.
HTTP is such a widely used protocol, that nobody would accept the flag day. To get adoption transition must be smooth.
In addition, one of the benefits of HTTP is the headers aren’t set in stone, new ones can be added, servers and clients can support custom headers, including core extensions like basic authentication.
Using single characters to designate header types basically limits the extensibility.
Your HTTP client has to somehow figure out if the server supports this new header compression extension, which means (most likely) an additional round trip, which takes additional latency (the page will take longer to load, since the client has to somehow figure out that the server supports this new header format).
Or choose to break interoperability, in violation of the robustness principle.
But interoperability and Latency are _everything_ in HTTP on the World wide web, the delay time to actually get a request received is what matters, not the precise size of the request; especially in the 21st century, the age of broadband, a few hundred extra bytes extra per request are almost negligible, as long as the request still fits in one frame.
The complexity avoided by having human-friendly headers may be well worth the small number of extra bytes.
As for the interoperability issues: if the web server doesn’t speak the same language as you, then you don’t get to see any web page, which is even worse than transferring more bytes. There is currently a HTTP standard. It would be essential for any improvements to ensure backwards compatibility, and I don’t see an obvious way to introduce a change like this without unacceptable costs.
In addition, the full extended header (even without the shortening) very likely fits into one packet, and has the advantage that humans can easily disseminate and construct headers.
Protocols that humans can enter messages in directly are easier to troubleshoot; it’s a great debugging aid to be able to simply telnet to a web server, and type in a HTTP request.
The real consumer of HTTP bandwidth is the file itself; typically HTML or XML, which is a text-based language. There is a standards-based compression option to use deflate/gzip, when requested by the client.
However, it might be of some benefit, if a new standard some form of ‘Binary XML/HTML’ were developed. In this case, you’d replace tags like the ‘body’ tag with a 16-bit code.
Instead of ‘img’ tags referring to external files, the img object would contain a binary pointer to the image, which would exist in the same binary stream as the text (probably at the end, so lower-bandwidth text could be rendered first).
In fact, you’d no longer need start and stop tags at all, just a tag-type bit length, and then binary pointers to objects in the document.
Optimizing the size of the actual payload has benefits, because the payload is normally a larger portion of the data being transferred.
In addition, an extra round trip shouldn’t be required, the client can declare it supports the special document type by using HTTP headers.
Thank you both for your informative, extended comments. I agree that using binary representation of information and it’s meta-information would save a lot more bandwidth. However, the main point of my argument is why these standards were not developed in a binary standard form when they were created.
I don’t believe this would be “premature optimization”; I believe it would be sensible. If you are sending useless data between machines millions (billions?) of times per day, then it would make sense to remove the redundant information. Maybe it’s too late now, though.
Thank you for the link to that paper.
Thanks,
Brendon.
Look at http compression. It does this, and more, for both the headers and content. Most modern servers (apache) support this, as do most modern browsers.
The reason for the headers is so that you can issue a head request to see if the page is updated, and if not, serve it from a cache (on your machine, the ISP, or anywhere down stream)
Why? Well, things like google also only have to issue a head request to know if it’s worth re-crawling your page. 99.9% of pages on the net are old, stale, and never change anymore. This saves an enoromous amount of bandwidth and processing for them.