Version 4 (modified by Sam Hocevar, 15 years ago) (diff)

bit allocation

A few notes and thoughts about compressing images to 140 characters for use on Twitter.

The first I read about this "competition" was here.

Bit availability

Twitter allows for 140 characters in a message. UTF-8 is allowed.

UTF-8 is restricted to the formal Unicode definition by RFC 3629. It means that the only legal UTF-8 characters range from U+0000 to U+10FFFF. The following restrictions must also be added:

  • The 2¹¹ high and low surrogates, used for UTF-16 encoding, restricting the Unicode range to U+0000..U+D7FF and U+E000..U+10FFFF.
  • The 66 non-characters.

The final size of this set is:

$(2^{20} + 2^{16}) - 2^{11} - 66 = 1111998$

The number of bits that can be encoded using 140 such characters is computed as follows:

$n_{bits} = \mathrm{floor}\left(\dfrac{140 \log(1111998)}{\log(2)}\right) = 2811$

In theory, 2811 bits is therefore the maximum we can stuff into a Twitter message. However, a lot of these characters are undefined, not yet allocated or are control characters. As of Unicode 5.1 there are 100507 graphic characters, reducing the number of expressed bits to:

$n_{bits} = \mathrm{floor}\left(\dfrac{140 \log(100507)}{\log(2)}\right) = 2326$

We'll go on with this value of 2326 encodable bits.

Bit allocation

A compressed image usually contains the following information:

  • The image geometry information (width and height)
  • Optional colour information (palette)
  • Elementary picture elements (encoded as pixels, triangles, vectors...)

Given the amount of compression we are doing, there is little point in compressing images larger than 512×512. This reduces image geometry information to 18 bits, leaving us with 2308 bits to encode the image information.

Whether to use a palette or to encode colour information into the picture elements is undecided yet. We'll cover both options.

Storing a palette

To do.

Not storing a palette

To do.

Attachments (12)

Download all attachments as: .zip