|Version 11 (modified by sam, 4 years ago) (diff)|
A few notes and thoughts about compressing images to 140 characters for use on Twitter.
The first I read about this "competition" was here.
How it works
My goal is to reach a reasonable compromise between the following:
- Allow fast decompression
- Achieve decent reconstruction quality
- Work with various message length and character sets
- Do not waste a single bit of information
Here is a rough overview of the encoding process:
- The number of available bits is computed from desired message length and usable charset
- The source image is segmented into as many square cells as the available bits permit
- A fixed number of points (currently 2) is affected to each cell, currently by selecting the darkest and brightest pixels in the cell
- The following is repeated until a quality condition is met:
- A point is chosen a random
- An operation is performed at random on this point (moving it inside its cell, changing its colour)
- If the resulting image (see the decoding process below) is closer to the source image, the operation is kept
- The image size and list of points is encoded in UTF-8
And this is the decoding process:
- The image size and points are read from the UTF-8 stream
- For each pixel in the destination image:
- The list of natural neigbours is computed
- The pixel's final colour is set as a weighted average of its natural neighbours' colours
UTF-8 is restricted to the formal Unicode definition by RFC 3629, meaning that once the 2¹¹ high and low surrogates and the 66 non-characters are removed from the U+0000..U+10FFFF range, the final size of the UTF-8 character set is 1111998. However, a lot of these characters are undefined, not yet allocated or are control characters. As of Unicode 5.1 there are only 100507 graphic characters.
The number of bits that can be expressed in a 140-character message using this charset is:
If we restrict ourselves to the 20902 characters available in the CJK Unified Ideographs block, the number of bits becomes:
And finally, using the 94 non-spacing, printable ASCII characters:
Here are the results of img2twit using 140 characters, restricted to U+4e00..U+9fa5 (CJK Unified Ideographs). The 一一一一 characters at the end of some lines indicate wasted bits that the algorithm is unable to use efficiently yet.
- lena_std_scaled.png (115.7 KB) - added by sam 4 years ago.
- Mona_Lisa_scaled.jpg (29.4 KB) - added by sam 4 years ago.
- so-logo.png (9.6 KB) - added by sam 4 years ago.
- mandrill_scaled.jpg (12.2 KB) - added by sam 4 years ago.
- Cornell_box_scaled.png (45.6 KB) - added by sam 4 years ago.
- twitter3.png (71.5 KB) - added by sam 4 years ago.
- twitter5.png (78.8 KB) - added by sam 4 years ago.
- minimona.jpg (536 bytes) - added by sam 4 years ago.
- minimona2.png (15.0 KB) - added by sam 4 years ago.
- twitter1.png (79.9 KB) - added by sam 4 years ago.
- twitter2.png (103.5 KB) - added by sam 4 years ago.
- twitter4.png (20.0 KB) - added by sam 4 years ago.