If a picture's worth 1000 words, how much of a picture can you fit in 140 characters?
Note: That's it folks! Bounty deadline is here, and after some tough deliberation, I have decided that Boojum's entry just barely edged out Sam Hocevar's. I will post more detailed notes once I've had a chance to write them up. Of course, everyone should feel free to continue to submit solutions and improve solutions for people to vote on. Thank you to everyone who submitted and entry; I enjoyed all of them. This has been a lot of fun for me to run, and I hope it's been fun for both the entrants and the spectators.
I came across this interesting post about trying to compress images into a Twitter comment, and lots of people in that thread (and a thread on Reddit) had suggestions about different ways you could do it. So, I figure it would make a good coding challenge; let people put their money where their mouth is, and show how their ideas about encoding can lead to more detail in the limited space that you have available.
I challenge you to come up with a general purpose system for encoding images into 140 character Twitter messages, and decoding them into an image again. You can use Unicode characters, so you get more than 8 bits per character. Even allowing for Unicode characters, however, you will need to compress images into a very small amount of space; this will certainly be a lossy compression, and so there will have to be subjective judgements about how good each result looks.
Here is the result that the original author, Quasimondo, got from his encoding (image is licensed under a Creative Commons Attribution-Noncommercial license):
Can you do better?
U+0000
–U+10FFFF
, excluding non-characters (U+FFFE
, U+FFFF
, U+
nFFFE
, U+
nFFFF
where n is 1
–10
hexadecimal, and the range U+FDD0
–U+FDEF
) and surrogate code points (U+D800
–U+DFFF
). It may be output in any reasonable encoding of your choice; any encoding supported by GNU iconv
will be considered reasonable, and your platform native encoding or locale encoding would likely be a good choice. See Unicode notes below for more details.For the sake of consistency in user interface, your program must behave as follows:
encode
or decode
to set the mode.Your program must take input in one or more of the following ways (if you implement the one that takes file names, you may also read and write from stdin and stdout if file names are missing):
Take input from standard in and produce output on standard out.
my-program encode <input.png >output.txt
my-program decode <output.txt >output.png
Take input from a file named in the second argument, and produce output in the file named in the third.
my-program encode input.png output.txt
my-program decode output.txt output.png
These are basically rules that may be broken, suggestions, or scoring criteria:
As a general guide to how I will be ranking solutions when choosing my accepted solution, lets say that I'll probably be evaluating solutions on a 25 point scale (this is very rough, and I won't be scoring anything directly, just using this as a basic guideline):
Some folks have asked for some reference images. Here are a few reference images that you can try; smaller versions are embedded here, they all link to larger versions of the image if you need those:
I am offering a 500 rep bounty (plus the 50 that StackOverflow kicks in) for the solution that I like the best, based on the above criteria. Of course, I encourage everyone else to vote on their favorite solutions here as well.
This contest will run until the bounty runs out, about 6 PM on Saturday, May 30. I can't say the precise time it will end; it may be anywhere from 5 to 7 PM. I will guarantee that I'll look at all entries submitted by 2 PM, and I will do my best to look at all entries submitted by 4 PM; if solutions are submitted after that, I may not have a chance to give them a fair look before I have to make my decision. Also, the earlier you submit, the more chance you will have for voting to be able to help me pick the best solution, so try and submit earlier rather than right at the deadline.
There has also been some confusion on exactly what Unicode characters are allowed. The range of possible Unicode code points is U+0000
to U+10FFFF
. There are some code points which are never valid to use as Unicode characters in any open interchange of data; these are the noncharacters and the surrogate code points. Noncharacters are defined in the Unidode Standard 5.1.0 section 16.7 as the values U+FFFE
, U+FFFF
, U+
nFFFE
, U+
nFFFF
where n is 1
–10
hexadecimal, and the range U+FDD0
–U+FDEF
. These values are intended to be used for application-specific internal usage, and conforming applications may strip these characters out of text processed by them. Surrogate code points, defined in the Unicode Standard 5.1.0 section 3.8 as U+D800
–U+DFFF
, are used for encoding characters beyond the Basic Multilingual Plane in UTF-16; thus, it is impossible to represent these code points directly in the UTF-16 encoding, and it is invalid to encode them in any other encoding. Thus, for the purpose of this contest, I will allow any program which encodes images into a sequence of no more than 140 Unicode code points from the range U+0000
–U+10FFFF
, excluding all noncharacters and surrogate pairs as defined above.
I will prefer solutions that use only assigned characters, and even better ones that use clever subsets of assigned characters or do something interesting with the character set they use. For a list of assigned characters, see the Unicode Character Database; note that some characters are listed directly, while some are listed only as the start and end of a range. Also note that surrogate code points are listed in the database, but forbidden as mentioned above. If you would like to take advantage of certain properties of characters for making the text you output more interesting, there are a variety of databases of character information available, such as a list of named code blocks and various character properties.
Since Twitter does not specify the exact character set they support, I will be lenient about solutions which do not actually work with Twitter because certain characters count extra or certain characters are stripped. It is preferred but not required that all encoded outputs should be able to be transferred unharmed via Twitter or another microblogging service such as identi.ca. I have seen some documentation stating that Twitter entity-encodes <, >, and &, and thus counts those as 4, 4, and 5 characters respectively, but I have not tested that out myself, and their JavaScript character counter doesn't seem to count them that way.
image files and python source (version 1 and 2)
Version 1 Here is my first attempt. I will update as I go.
I have got the SO logo down to 300 characters almost lossless. My technique uses conversion to SVG vector art so it works best on line art. It is actually an SVG compressor, it still requires the original art go through a vectorisation stage.
For my first attempt I used an online service for the PNG trace however there are MANY free and non-free tools that can handle this part including potrace (open-source).
Here are the results
Original SO Logo http://www.warriorhut.org/graphics/svg_to_unicode/so-logo.png Original Decoded SO Logo http://www.warriorhut.org/graphics/svg_to_unicode/so-logo-decoded.png After encoding and decoding
Characters: 300
Time: Not measured but practically instant (not including vectorisation/rasterisation steps)
The next stage will be to embed 4 symbols (SVG path points and commands) per unicode character. At the moment my python build does not have wide character support UCS4 which limits my resolution per character. I've also limited the maximum range to the lower end of the unicode reserved range 0xD800 however once I build a list of allowed characters and a filter to avoid them I can theoretically push the required number of characters as low as 70-100 for the logo above.
A limitation of this method at present is the output size is not fixed. It depends on number of vector nodes/points after vectorisation. Automating this limit will require either pixelating the image (which removes the main benefit of vectors) or repeated running the paths through a simplification stage until the desired node count is reached (which I'm currently doing manually in Inkscape).
Version 2
UPDATE: v2 is now qualified to compete. Changes:
Characters: 133
Time: A few seconds
v2 decoded http://www.warriorhut.org/graphics/svg_to_unicode/so-logo-decoded-v2.png After encoding and decoding (version 2)
As you can see there are some artifacts this time. It isn't a limitation of the method but a mistake somewhere in my conversions. The artifacts happen when the points go outside the range 0.0 - 127.0 and my attempts to constrain them have had mixed success. The solution is simply to scale the image down however I had trouble scaling the actual points rather than the artboard or group matrix and I'm too tired now to care. In short, if your points are in the supported range it generally works.
I believe the kink in the middle is due to a handle moving to the other side of a handle it's linked to. Basically the points are too close together in the first place. Running a simplify filter over the source image in advance of compressing it should fix this and shave of some unnecessary characters.
UPDATE: This method is fine for simple objects so I needed a way to simplify complex paths and reduce noise. I used Inkscape for this task. I've had some luck with grooming out unnecessary paths using Inkscape but not had time to try automating it. I've made some sample svgs using the Inkscape 'Simplify' function to reduce the number of paths.
Simplify works ok but it can be slow with this many paths.
autotrace example http://www.warriorhut.org/graphics/svg_to_unicode/autotrace_16_color_manual_reduction.png cornell box http://www.warriorhut.com/graphics/svg_to_unicode/cornell_box_simplified.png lena http://www.warriorhut.com/graphics/svg_to_unicode/lena_std_washed_autotrace.png
Here's some ultra low-res shots. These would be closer to the 140 character limit though some clever path compression may be need as well.
groomed http://www.warriorhut.org/graphics/svg_to_unicode/competition_thumbnails_groomed.png Simplified and despeckled.
trianglulated http://www.warriorhut.org/graphics/svg_to_unicode/competition_thumbnails_triangulated.png Simplified, despeckled and triangulated.
autotrace --output-format svg --output-file cornell_box.svg --despeckle-level 20 --color-count 64 cornell_box.png
ABOVE: Simplified paths using autotrace.
Unfortunately my parser doesn't handle the autotrace output so I don't know how may points are in use or how far to simplify, sadly there's little time for writing it before the deadline. It's much easier to parse than the inkscape output though.