Version 11 (modified by Sam Hocevar, 16 years ago) (diff)

--

A few notes and thoughts about compressing images to 140 characters for use on Twitter.

The first I read about this "competition" was here.

How it works

My goal is to reach a reasonable compromise between the following:

  • Allow fast decompression
  • Achieve decent reconstruction quality
  • Work with various message length and character sets
  • Do not waste a single bit of information

Here is a rough overview of the encoding process:

  • The number of available bits is computed from desired message length and usable charset
  • The source image is segmented into as many square cells as the available bits permit
  • A fixed number of points (currently 2) is affected to each cell, currently by selecting the darkest and brightest pixels in the cell
  • The following is repeated until a quality condition is met:
    • A point is chosen a random
    • An operation is performed at random on this point (moving it inside its cell, changing its colour)
    • If the resulting image (see the decoding process below) is closer to the source image, the operation is kept
  • The image size and list of points is encoded in UTF-8

And this is the decoding process:

  • The image size and points are read from the UTF-8 stream
  • For each pixel in the destination image:
    • The list of natural neigbours is computed
    • The pixel's final colour is set as a weighted average of its natural neighbours' colours

Bit allocation

UTF-8 is restricted to the formal Unicode definition by RFC 3629, meaning that once the 2¹¹ high and low surrogates and the 66 non-characters are removed from the U+0000..U+10FFFF range, the final size of the UTF-8 character set is 1111998. However, a lot of these characters are undefined, not yet allocated or are control characters. As of Unicode 5.1 there are only 100507 graphic characters.

The number of bits that can be expressed in a 140-character message using this charset is:

$n_{bits} = \dfrac{140 \log(100507)}{\log(2)} = 2326.37$

If we restrict ourselves to the 20902 characters available in the CJK Unified Ideographs block, the number of bits becomes:

$n_{bits} = \dfrac{140 \log(20902)}{\log(2)} = 2009.18$

And finally, using the 94 non-spacing, printable ASCII characters:

$n_{bits} = \dfrac{140 \log(94)}{\log(2)} = 917.64$

Optimised bitstream

Todo

Preliminary results

Here are the results of img2twit using 140 characters, restricted to U+4e00..U+9fa5 (CJK Unified Ideographs). The 一一一一 characters at the end of some lines indicate wasted bits that the algorithm is unable to use efficiently yet.

輎污涧噊訞巚戴邨姎士踤倭餜洈塉留宒督虞韀澓觀腆趝禄南栥註谎蝲啎狍麃砘焼謩熁迣菝峰嶺綇檂挀黭朿泊攻确碌埬萚鉄毉瘣璚鯱幩冠恈欘肸熈璶礴瘸蚉绋駡碮挺馮譵膀峞黦墮蠅嚓铙觞睡盧孱鳜载襳迠廅榣缹興戰髅垨衦蒺昺醦颥八圪進桊絞螐嗢盉隬岠慷鎃尜淽阨塶沛顭僚計鯬賥占牓吙硘鷸騢挓磪捤鵘坰硖肕萘饼皹侯滗

姲椟筃偡荛璻琛隞夅镬磤湋亹璠熗凋煐攪泲僶壴鋇廷砪臗旝鳑禛渂澣贱涜慘齋芐梡楉迤椻姫閴飙苟稞痡揦麲笉申檫窬偩掛炘忧阙膇样箠愍畄帠掭歜黵歫徯堠傌蕨鱏鷥軠慰糐掭辢猆孏錦戹濸巉魽嚲腫就恩沽厲測沎婲舁铧蠃犱闆醛焊茴鋈叶狝痺矹铓疿镭緄熐魆郗忤櫯嫟韥烥彸漻藉醺夝趼惻炘訶焝汒蝧潚诪躗丌一一一一一一

No image "Cornell_box.png" attached to img2twit

郚辡帅轻垟比芼瘳僺輪磀蜮箁捆婊滭九輤涇玪曃擾褗眗鯮鴚瘟糱靌軏膒跬泵庠譀奘骰偂穖蠵詟沛駮胅卛唁澥込圊澟褡怍两鋅蠤振殰芝耹漹樆蜬龆绎薄核琣捷椁桡痻擏翵峺侞骯溞淎搼曇壧屳跖忄篵雸皰堫谳物渷厸则陃妔醀垻槥彾澪烃瓾鐬轧錈宓皪瀅榦睿宖陗邱鄎巊晨誫鲛蒹瓬棟层鶾牪耣騝墙麇拫途薱鱤朤慦豹怂傍殐欜一

坍嗝昫噰碒荓奊镂胆鶊灦翵歓八罡睂釬篾媹蹛赜言齊瓰谕豭璞捘帔穼柞娀癥卢栀儻湐崽洧俅蔿尘犁倴苠鹯焋朓嗸哉姟箊燐蘫矻豣底氲浧喋婏氾藽憺鄐嵔武枣歚雛胠蓸豑逿娳繜婔涙賁醙淸鍮蕙悢龡賚厴惌鎛膟缷概狙峴鋱鮬刬杂蒟驇骧袹璹導练閦饥頖筎梋炻鼑鎄薕粮軝帨雡豄瀴诨霉窐搉裷橌漣濯超瞉秘驔包颾蘱礜鼦挈蓒諹

豪弅淶鑆斳愔耐俬秱戂孤访红艶劃嬌躑擣昗呎腆猎扭僬题猛嬰頽恓劉檭橀韮闼帣赑峈鮦宝鰢斣麙蓑騰騺鹸希關鈯穨唖秴斮圫究傲駛朘铃邂巢沿譓船櫃晒峩泪蝻鵲皲販口谹鎺侒戣耔凉蠛抏槱戛蝂荄勞攞咉闏涪彃沏全偫吒溸乎洸螕慹鳩弭蚕弣寽砰薨埻铥恣噿悏镏雈壭蒬礡靑徠鼛慗泏郄渺婥俦攨賌羢髙壶耔僪爯姉蔮蠬伣豖弫

Attachments (12)

Download all attachments as: .zip