Is there a known XSS or other attack that makes it past a
$content = "some HTML code";
$content = strip_tags($content);
echo $content;
?
The manual has a warning:
This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users.
but that is related to using the allowable_tags
parameter only.
With no allowed tags set, is strip_tags()
vulnerable to any attack?
Chris Shiflett seems to say it's safe:
Use Mature Solutions
When possible, use mature, existing solutions instead of trying to create your own. Functions like strip_tags() and htmlentities() are good choices.
is this correct? Please if possible, quote sources.
I know about HTML purifier, htmlspecialchars() etc.- I am not looking for the best method to sanitize HTML. I just want to know about this specific issue. This is a theoretical question that came up here.
Reference: strip_tags()
implementation in the PHP source code
As its name may suggest, strip_tags
should remove all HTML tags. The only way we can proof it is by analyzing the source code. The next analysis applies to a strip_tags('...')
call, without a second argument for whitelisted tags.
First at all, some theory about HTML tags: a tag starts with a <
followed by non-whitespace characters. If this string starts with a ?
, it should not be parsed. If this string starts with a !--
, it's considered a comment and the following text should neither be parsed. A comment is terminated with a -->
, inside such a comment, characters like <
and >
are allowed. Attributes can occur in tags, their values may optionally be surrounded by a quote character ('
or "
). If such a quote exist, it must be closed, otherwise if a >
is encountered, the tag is not closed.
The code <a href="example>xxx</a><a href="second">text</a>
is interpreted in Firefox as:
<a href="http://example.com%3Exxx%3C/a%3E%3Ca%20href=" second"="">text</a>
The PHP function strip_tags
is referenced in line 4036 of ext/standard/string.c. That function calls the internal function php_strip_tags_ex.
Two buffers exist, one for the output, the other for "inside HTML tags". A counter named depth
holds the number of open angle brackets (<
).
The variable in_q
contains the quote character ('
or "
) if any, and 0
otherwise. The last character is stored in the variable lc
.
The functions holds five states, three are mentioned in the description above the function. Based on this information and the function body, the following states can be derived:
<
)<
and !
characters (the tag buffer contains <!
)We need just to be careful that no tag can be inserted. That is, <
followed by a non-whitespace character. Line 4326 checks an case with the <
character which is described below:
<a href="inside quotes">
), the <
character is ignored (removed from the output).<
is added to the output buffer.1
("inside HTML tag") and the last character lc
is set to <
depth
is incremented and the character ignored.If >
is met while the tag is open (state == 1
), in_q
becomes 0
("not in a quote") and state
becomes 0
("not in a tag"). The tag buffer is discarded.
Attribute checks (for characters like '
and "
) are done on the tag buffer which is discarded. So the conclusion is:
strip_tags without a tag whitelist is safe for inclusion outside tags, no tag will be allowed.
By "outside tags", I mean not in tags as in <a href="in tag">outside tag</a>
. Text may contain <
and >
though, as in >< a>>
. The result is not valid HTML though, <
, >
and &
need still to be escaped, especially the &
. That can be done with htmlspecialchars()
.
The description for strip_tags
without an whitelist argument would be:
Makes sure that no HTML tag exist in the returned string.