remove script tag from HTML content

I-M-JM picture I-M-JM · Aug 20, 2011 · Viewed 121.8k times · Source

I am using HTML Purifier (http://htmlpurifier.org/)

I just want to remove <script> tags only. I don't want to remove inline formatting or any other things.

How can I achieve this?

One more thing, it there any other way to remove script tags from HTML

Answer

Dejan Marjanović picture Dejan Marjanović · Aug 20, 2011

Because this question is tagged with I'm going to answer with poor man's solution in this situation:

$html = preg_replace('#<script(.*?)>(.*?)</script>#is', '', $html);

However, regular expressions are not for parsing HTML/XML, even if you write the perfect expression it will break eventually, it's not worth it, although, in some cases it's useful to quickly fix some markup, and as it is with quick fixes, forget about security. Use regex only on content/markup you trust.

Remember, anything that user inputs should be considered not safe.

Better solution here would be to use DOMDocument which is designed for this. Here is a snippet that demonstrate how easy, clean (compared to regex), (almost) reliable and (nearly) safe is to do the same:

<?php

$html = <<<HTML
...
HTML;

$dom = new DOMDocument();

$dom->loadHTML($html);

$script = $dom->getElementsByTagName('script');

$remove = [];
foreach($script as $item)
{
  $remove[] = $item;
}

foreach ($remove as $item)
{
  $item->parentNode->removeChild($item); 
}

$html = $dom->saveHTML();

I have removed the HTML intentionally because even this can bork.