Friday, September 2, 2011

strip_tags() is not enough

All web developers know that any kind of text which comes from user input, should be sanitized before rendering. You must remove all dangerous HTML tags to avoid HTML and Javascript injections, XSS attacks - good old well-known issues, I'm not going to bore you with their concept. When you sanitize the input, in most of the cases you don't want to remove all HTML tags, you want to give your user the freedom to use some formatting tags. PHP has a built-in function to do that called strip_tags() and it is used widely by PHP developers. As a quick reminder let's see how to use it:
$secure_text = strip_tags($original_text, '<b><i>'); // we only allow the <b> and <i> tags, everything else will be removed
Do you feel peaceful, calming sense of safety? You shouldn't ;)

The problem with strip_tags() is that it only removes the unwanted HTML elements, but it doesn't do anything with the attributes of the allowed elements; and the HTML attributes can be javascript event handlers, this way they can contain malicious code, therefore even a tame <b> tag can be dangerous. For example if the input is <b onclick="alert('PWNED')">click me</b> then the onclick attribute won't be removed if you simply sanitize the input with strip_tags($input, '<b><i>') and the attacker successfully made a javascript injection.

You may think that you can write a simple function to properly remove the dangerous attributes of the allowed elements, and use that instead of strip_tags(), but there are 2 problems with that:
  • you can't do that safely with regexp
  • if you take into account the <a> tag, then the problems become more complicated: if you allow links in the user input - why not, the user must be able to put links into his forum post - then the obviously needed href attribute can also contain malicious code. Example: <a href="javascript:alert('PWNED')">click me</a> - and this javascript will be executed by the browser. It's one more thing to take care about.

To sum up, strip_tags() is not enough. Let's see what else we have.

If you are up-to-date with the well-known PHP libraries, then you have surely heard about HTMLPurifier. It's known to be a very mature tool to securely handle the user input. This is the first thing you should have in mind. Secondly, your template engine may also be responsible for properly escape the unsecure data. I didn't want to check all template engines out there, so I picked Twig - which I have had to use recently - and tried out what it knows. I have made the testing using the following PHP script:

    <title>JavaScript Injection testing</title>

<?php echo strip_tags($_GET['txt'], '<b><a>') ?>



require 'path/to/htmlpurifier/';

$purifier = new HTMLPurifier(array(
	'HTML.AllowedElements' => array('b', 'a')

echo $purifier->purify($_GET['txt']);



require 'path/to/twig/Twig/Autoloader.php';


$loader = new Twig_Loader_String();
$twig = new Twig_Environment($loader);

$template = $twig->loadTemplate('{{ input|striptags("<b><a>")|raw }}');

$template->display(array('input' => $_GET['txt']));


<form action="">
<input name="txt"/>
<input type="submit"/>


After downloading HTMLPurifier and Twig, you can simply try it out.

For testing, I used the following form data:
<b onclick="alert('PWNED')">click me</b> | <a href="javascript:alert('PWNED')">funny pics here</a>

So I tested both the javascript event handler removal and the malicious <a href="javascript:..."> value removal on strip_tags(), HTMLPurifier and Twig. Nothing much to say about it, both strip_tags() and Twig failed to avoid both javascript injections. HTMLPurifier did a good job as expected.

  • If you want to allow some HTML in the untrusted input, then don't rely on strip_tags(), use HTMLPurifier instead.
  • If you want to use Twig then escape the input in your controller using HTMLPurifier, before rendering, and don't rely on the striptags filter of Twig, which in fact doesn't do anything else just calls strip_tags(). (Otherwise Twig is not a bad choice for templating in my opinion)
  • You may provide a custom formatting syntax for your user and completely disallow HTML, for example BB codes and wiki syntax can be an option in some cases, but due to the lack of WYSIWYG editors for those "languages" it's not possible in a lot of cases nowadays.


  1. Actually there are WYSIWYG editors for various syntaxes, see for example:
    So in my opinion this is the best solution so far.

  2. This comment has been removed by a blog administrator.