I've spent what I consider an unreasonable amount of time trying to find the actual format for hashtags.
As far as my searching can tell- Twitter has not published one.
I know that many people have come up with regex's to parse them, however, your lib's regex is not my lib's regex and maybe I don't like yours anyway.
So I'm asking- is there any actual official spec? I don't want a regex answer, I want a BNF or something similar. Or minimally- a complete list of delimiters.
Additional difficulty points- grabbing them from random unicode messages (non-English) text is important too.
Note: I'm quite aware of entities and they aren't applicable to my case (months of twitter messages stored in a db).
From the starting point of twitter's support the basic rules seems to be that hashtags must be preceded by a space and stop on any whitespace or punctuation.
Quote from Twitter's support:
Check your hashtags for the following:
Therefore, the initial token is #
preceded by a space, and the terminator is any whitespace or punctuation. The "etc" in their list of punctuation (" , . ; ' ? ! etc.") is annoying, but I'll keep digging and see if I can find something authoritative on what else counts as punctuation.
After digging around a while, I found some interesting blog articles by Terence Eden (Hashtags and Implicit Knowledge, Hashtag Standards) that provide evidence that Twitter doesn't even have a standard, given that the software it develops on different platforms seems to have different rules of what constitutes a hashtag.
It also provided a link to the Twitter Conformance Library, which has twitter / twitter-text-conformance / autolink.yml. The hashtag
section in autolink.yml has many cases matching the above rules, but also some that violate them are are still supposed to be autolinked. Some examples:
- description: "DO NOT Autolink all-numeric hashtags"
text: "text #1234"
expected: "text #1234"
- description: "Autolink hashtag preceded by a period"
text: "text.#hashtag"
expected: "text.<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"
- description: "Autolink hashtag with full-width hash (U+FF03)"
text: "#hashtag"
expected: "<a href=\"http://twitter.com/search?q=%23hashtag\" title=\"#hashtag\" class=\"tweet-url hashtag\">#hashtag</a>"
Those are just a few examples that don't match the basic rules given in the first support article, and unfortunately the yml
is full of other examples as well.