| 1 | A few notes and thoughts about compressing images to 140 characters for use on Twitter. |
| 2 | |
| 3 | The first I read about this "competition" was [http://www.flickr.com/photos/quasimondo/3518306770/in/set-72057594062596732/ here]. |
| 4 | |
| 5 | == Bit allocation discussion == |
| 6 | |
| 7 | Twitter allows for 140 characters in a message. UTF-8 is allowed. |
| 8 | |
| 9 | UTF-8 is restricted to the formal Unicode definition by RFC 3629. It means that the only legal UTF-8 characters range from U+0000 to U+10FFFF. The following restrictions must also be added: |
| 10 | * The high and low surrogates, used for UTF-16 encoding, restricting the Unicode range to U+0000..U+D7FF and U+E000..U+10FFFF. |
| 11 | * The 66 non-characters. |
| 12 | |
| 13 | The final size of this set is: |
| 14 | |
| 15 | {{{ |
| 16 | #!latex |
| 17 | $(2^20 + 2^16) - 2^11 - 66 = 1111998$ |
| 18 | }}} |
| 19 | |
| 20 | The number of bits that can be encoded using 140 such characters is computed as follows: |
| 21 | |
| 22 | {{{ |
| 23 | #!latex |
| 24 | $n_{bits} = floor(\dfrac{140 \log(1111998)}{\log(2)}) = 2811$ |
| 25 | }}} |
| 26 | |
| 27 | In theory, 2811 bits is therefore the maximum we can stuff into a Twitter message. However, a lot of these characters are undefined, not yet allocated or are control characters. As of Unicode 5.1 there are 100507 graphic characters, reducing the number of expressed bits to: |
| 28 | |
| 29 | {{{ |
| 30 | #!latex |
| 31 | $n_{bits} = floor(\dfrac{140 \log(100507)}{\log(2)}) = 2326$ |
| 32 | }}} |
| 33 | |
| 34 | We'll go on with this value of 2326 encodable bits. |
| 35 | |