Home Misc Index
  Special Character Handling
 P van Diemen

Special characters in this context are characters that have a special meaning in the formal language they are used in.  Usually it has a special meaning in the syntactical sense.  Typically they include the ‘escape’ codes in that language.

For HTML, the special characters are single quote ', double quote ", the less than < and greater than > signs, and the ampersand & sign (HTML escape code).

For SQL it involves the string capabilities of the scripting language, as SQL-queries are denoted as a command in a string which contains quoted string parameters (i.e. quotes).  In the examples here we use PHP for scripting.

HTML Forms Handling

For normal HTML texts, the HTML special codes (', ", <, >, &) are usually no problem, though it is wise the encode the characters <, > and & with their corresponding escaped values &lt;, &gt; and &amp; (their so called Numerical Character Reference, NCR).  HTML is rather forgiving for (syntactical) errors (it doesn’t display error messages, but tries to make the best of it).  However, within HTML tags you must escape all the special characters in attribute data, in particular the quotes as they may otherwise be misinterpreted for syntax.

This occurs in particular for text <input> in HTML-forms, e.g.

<INPUT TYPE=TEXT NAME=F1 VALUE='It's late'>
which is an HTML syntax error.  HTML sees 'It' as string for the value to be presented to the user, doesn’t recognise (and ignores) “s late'” and continues after the >.

When coding HTML by hand, you can often obtain a desirable result by using the appropriate quotes (single or double) to enclose the attribute value.  However, when the value is unknown beforehand (e.g. a value from a database), you have a problem:  the value may include ' and/or " (and/or any of the other special characters) which obviously interferes with the syntax.
So you always have to escape such use of (database) values in a form.

In PHP you can use the htmlspecialchars-function (or htmlentities) for this:  it will encode the special characters (' " < > &) to their NCR-equivalent (usually the named variant &apos; &quot; &lt; &gt; and &amp; respectively).  It has no consequence for the display of these characters, or for editing them (HTML considers an NCR sequence as a single character).

But this also means that you have to translate the HTML-entities back to their character code equivalent before entering the value in a database (or you will pollute the database with NCR sequences).  This can be achieved through the html_entity_decode (or htmlspecialchars_decode)-function.

The last parameter in the htmlspecialchars- (and htmlentities-)function is for double encoding of HTML-entities;  it should be false for our purposes (otherwise an entity like &eacute; is shown literally and not like é, usually an unwanted effect).

Htmlspecialchars and htmlentities are equivalent except that htmlentities will convert all characters which have an HTML-entity equivalent.  For handling <input> values, htmlspecialchars conversion of ', ", <, > and & is adequate.

Similarly, html_entity_decode and htmlspecialchars_decode are equivalent except that html_entity_decode will decode all HTML-entities (not juist &apos; &quot; &lt; &gt; and &amp; but also e.g. &eacute;).  It is usually desirable to convert all HTML-entities for a database (with potentially less desirable cases &nbsp; and &shy; as they are not discernable anymore).

Example (in PHP, assuming HTML5 and character set UTF-8):

$val = ...;       //e.g. "It's late"
echo "<INPUT TYPE=TEXT NAME=F1 VALUE='" . htmlspecialchars( $val, ENT_QUOTES|ENT_HTML5,
'UTF-8', false ) . "'>";

In the script which processes the form containing this <input> you use:

$F1 = ... $_GET['F1'] / $_PUT['F1'];
$F1 = html_entity_decode( $F1, ENT_QUOTES|ENT_HTML5, 'UTF-8' );

That’s all.

Above is also valid for the <textarea>-tag;  not so much for the quotes as for the angular brackets.  Make sure that you perform htmlspecialchars only once for each field, otherwise the consequences are the same as with double_encoding parameter true:  you get the HTML-entities visible in the input field.

If you want to display values containing HTML-tags (e.g. from a database) as text (show the tags and not let them be interpreted by the browser), you should use the htmlentities-function with double encoding true.


SQL Query Handling

When handling SQL from a scripting language there is a basic problem in creating the query:  SQL requires its commands as an ASCII string, but that query may contain parameters having quoted strings as values.  E.g.

$name = "O'Neill";
$query = "UPDATE tbl3 SET name='$name' WHERE id=12345";
which is interpreted by SQL as:
UPDATE tbl3 SET name='O'Neill' WHERE id=12345
and leads to an SQL syntax error.

As single quotes ' are more common than double quotes ", one may occasionally circumvent the above problem by using double quotes as string separator in the query.  But if a parameter value may (also) contain double quotes, it proves not to be a solution.

One may encode the quotes as &apos; or &#39 and &quot; or &#34; respectively:  in HTML it will look the same, but it is not really the same in the database.  So not a good solution either.

SQL allows an escape code in strings through the backslash \:  the next character will be taken literally in the string.  This is valid for strings enclosed by single quotes or by double quotes.  And the same escape method applies to PHP !

The point is now to add the backslash just before each occurrence of a quote in the string.  And before any occurrence of the backslash itself as it is the escape character (though it is a rare character).  And PHP offers a simple function for that:  addcslashes.

$query = "UPDATE tbl3 SET name='" . addcslashes( $name, '\'"\\' ) . "' WHERE id=12345";

Problem solved.  And it prevents simple SQL-inserts at the same time.
But here as well, make sure you perform addcslashes only once, otherwise you may get backslashes in your database.


=O=