_Actual_ Twitter format for hashtags? Not your regex, not his code-- the actual one?

dethSwatch picture dethSwatch · Dec 9, 2011 · Viewed 10.6k times · Source

Update: Use Twitter's Entities if you can- they figured it out for you as well as other items. My case is that I just have the tweet without entities and all the extra metadata

I've spent what I consider an unreasonable amount of time trying to find the actual format for hashtags.

As far as my searching can tell- Twitter has not published one.

I know that many people have come up with regex's to parse them, however, your lib's regex is not my lib's regex and maybe I don't like yours anyway.

So I'm asking- is there any actual official spec? I don't want a regex answer, I want a BNF or something similar. Or minimally- a complete list of delimiters.

Additional difficulty points- grabbing them from random unicode messages (non-English) text is important too.

Note: I'm quite aware of entities and they aren't applicable to my case (months of twitter messages stored in a db).

Answer

jball picture jball · Dec 9, 2011

From the starting point of twitter's support the basic rules seems to be that hashtags must be preceded by a space and stop on any whitespace or punctuation.


Quote from Twitter's support:

Check your hashtags for the following:

  • Is there any symbol in or after the hashtag?
    • If you write #noican't, your message will be categorized under #noican. Punctuation marks ( , . ; ' ? ! etc.) will end your hashtag wherever punctuation occurs.
  • Is there any letter preceding the #symbol?
    • If you write 23#idoittoo or word#idoittoo, your Tweets will not show in searches for the hashtag #idoittoo. Hashtags will not work with letters or numbers in front of the # symbol. The # symbol must have a space directly in front of it in order for it to show correctly in searches.

Therefore, the initial token is # preceded by a space, and the terminator is any whitespace or punctuation. The "etc" in their list of punctuation (" , . ; ' ? ! etc.") is annoying, but I'll keep digging and see if I can find something authoritative on what else counts as punctuation.

After digging around a while, I found some interesting blog articles by Terence Eden (Hashtags and Implicit Knowledge, Hashtag Standards) that provide evidence that Twitter doesn't even have a standard, given that the software it develops on different platforms seems to have different rules of what constitutes a hashtag.

It also provided a link to the Twitter Conformance Library, which has twitter / twitter-text-conformance / autolink.yml. The hashtag section in autolink.yml has many cases matching the above rules, but also some that violate them are are still supposed to be autolinked. Some examples:

- description: "DO NOT Autolink all-numeric hashtags"
  text: "text #1234"
  expected: "text #1234"

- description: "Autolink hashtag preceded by a period"
  text: "text.#hashtag"
  expected: "text.<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"

- description: "Autolink hashtag with full-width hash (U+FF03)"
  text: "#hashtag"
  expected: "<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"

Those are just a few examples that don't match the basic rules given in the first support article, and unfortunately the yml is full of other examples as well.