How does the entropy of a string of English text signify low quality?

Pandincus picture Pandincus · Feb 22, 2011 · Viewed 9.2k times · Source

Jeff Atwood recently tweeted a link to a CodeReview post where he wanted to know if the community could improve his "calculating entropy of a string" code snippet. He explained, "We're calculating entropy of a string a few places in Stack Overflow as a signifier of low quality."

The gist of his method seemed to be that if you count the number of unique characters in a string, that signifies entropy (code taken from PieterG's answer):

int uniqueCharacterCount = string.Distinct().Count();

I don't understand how the unique character count signifies entropy of a string, and how the entropy of a string signifies low quality. I was wondering if someone with more knowledge in this area could explain what Mr. Atwood is trying to accomplish.

Thanks!

Answer

BlueRaja - Danny Pflughoeft picture BlueRaja - Danny Pflughoeft · Feb 22, 2011

The confusion seems to be from the idea that this is used to block posts from being posted - it's not.

It is just one of several algorithms used to find possible low-quality posts, displayed on the low quality posts tab (requires 10k rep) of the moderator tools. Actual humans still need to look at the post.

The idea is to catch posts like ~~~~~~No.~~~~~~ or FUUUUUUUU------, not to catch all low-quality posts.


As for "How does the unique character-count signify entropy?" - it doesn't, really. The most upvoted answers completely miss the point.

See https://codereview.stackexchange.com/questions/868#878 and https://codereview.stackexchange.com/questions/868#926