Home Misc Index
  Special Character Handling
 P van Diemen

Special characters in this context are characters that have a special meaning in the formal language they are used in.  Usually it has a special meaning in the syntactical sense.  Typically they include the ‘escape’ codes in that language.

It is not complex, but you should be aware of the intricacies when using HTML-forms to enter data in an SQL database.

HTML handling

For normal HTML texts, the HTML special codes (', ", <, >, &) are usually no problem, though it is wise the encode the characters <, > and & with their corresponding escaped values &lt;, &gt; and &amp; (their so called Numerical Character Reference, NCR).  HTML is rather forgiving for (syntactical) errors (it doesn’t display error messages, but tries to make the best of it).  However, within HTML tags you must escape all the special characters in attribute data, in particular the quotes as they may otherwise be misinterpreted for syntax.

This occurs in particular for text <input> in HTML-forms, e.g.

<INPUT TYPE=TEXT NAME=F1 VALUE='It's late'>
which is an HTML syntax error.  HTML sees 'It' as string for the value to be presented to the user, doesn’t recognise (and ignores) “s late'” and continues after the >.

When coding HTML by hand, you can often obtain a desirable result by using the appropriate quotes (single or double) to enclose the attribute value.  However, when the value is unknown beforehand (e.g. a value from a database), you have a problem:  the value may include ' and/or " (and/or any of the other special characters) which obviously interferes with the syntax.
So you always have to escape such use of (database) values in a form.

When generating such forms in PHP you can use the htmlspecialchars-function (or htmlentities) for this:  it will encode the special characters (' " < > &) to their NCR-equivalent (&apos; &quot; &lt; &gt; and &amp; respectively).
The second parameter in htmlspecialchars reflects the translation flags;  exact translation depends on the ENT_HTML401/­ENT_HTML5 flag but that is hardly relevant.  Important is that a single quote is only translated when the ENT_QUOTES flag is set.
The third parameter is the character set (e.g. 'UTF-8'), and the fourth and last parameter in the htmlspecialchars- (and htmlentities-)function is for double encoding of HTML-entities;  it should be false for our purposes (otherwise an entity like &eacute; is shown literally and not like é, usually an unwanted effect).
Such a translation has no consequence for the display of these characters, or for editing them (HTML considers an NCR sequence as a single character).

Consequently, the <input> should be coded as:

$val = "It's late";
echo "<INPUT TYPE=TEXT NAME=F1 VALUE='" . htmlspecialchars( $val, ENT_QUOTES, 'UTF-8', false ) . "'>";

Htmlspecialchars and htmlentities are equivalent except that htmlentities will convert all characters which have an HTML-entity equivalent.  For handling <input> values, htmlspecialchars conversion of  ', ", <, > and & is adequate.

Above is also valid for the <textarea>-tag;  it is more likely that such text contains single and double quotes, the angular brackets and the ampersand.  Make sure that you perform htmlspecialchars only once for each field, otherwise the consequences are the same as with double_encoding parameter true:  you get the HTML-entities visibly in the input field.  If the string contains a <br> and you want to preserve the layout, use str_replace( "<br>", "\n", $string ) before the encoding (you may not want to reverse the substitution).
On the other hand, if you display such a text as normal HTML text, you want the newline "\n" replaced by a "<br>".

If you want to display values containing HTML-tags (e.g. from a database) as text (show the tags and not let them be interpreted by the browser), you should use the htmlentities-function with double encoding true.

If you need to reverse the effect of htmlspecialchars or htmlentities, you can use the html_entity_decode (or htmlspecialchars_decode)-function.  The html_entity_decode and htmlspecialchars_decode functions are equivalent except that html_entity_decode will decode all HTML-entities (not juist &apos; &quot; &lt; &gt; and &amp; but also e.g. &eacute;).  It is usually desirable to convert all HTML-entities for a database (with potentially less desirable cases &nbsp; and &shy; as these are not visible anymore).
Apart from the string, these decode function have one additional parameter:  the translation flags.  Usually, ENT_QUOTES | ENT_HTML5 will do.


Parameter Transfer handling

The encoding of parameters through htmlspecialchars suggests that you have to decode these HTML-entities back to their character code equivalent before entering the value in a database (or you would pollute the database with NCR sequences). 
But, surprise, the parameters through the GET- and POST-mechanisms have already undergone a htmlspecialchars_decode.

In old PHP versions, the parameters were also processed by addslashes() but that is no longer applicable.


SQL Query handling

When handling SQL from a scripting language there is a basic problem in creating the query:  SQL requires its commands as an ASCII string, but that query may contain parameters having quoted strings as values.  E.g.

$name = "O'Neill";
$query = "UPDATE tbl3 SET name='$name' WHERE id=12345";
which is interpreted by SQL as:
UPDATE tbl3 SET name='O'Neill' WHERE id=12345
and which leads to an SQL syntax error.

Again, as single quotes ' are more common than double quotes ", one may occasionally circumvent the above problem by using double quotes as string separator in the query.  But if a parameter value may (also) contain double quotes, it proves not to be a solution.
One may encode the quotes as &apos; or &#39 and &quot; or &#34; respectively (i.e. the htmlspecialchars()-treatment as used in HTML):  in HTML it will look the same, but it is not really the same in the database.  So not a good solution either.

SQL allows an escape code in strings through the backslash \:  the next character will be taken literally in the string.  This is valid for strings enclosed by single quotes or by double quotes.  And the same escape method applies to PHP !

The point is now to add the backslash just before each occurrence of a quote in the string.  And before any occurrence of the backslash itself as it is the escape character (though it is a rare character).  And PHP offers a simple function for that:  addslashes.

$query = "UPDATE tbl3 SET name='" . addslashes( $name ) . "' WHERE id=12345";

Problem solved (but also see next section).  And it may prevent simple SQL-inserts at the same time.
But here as well, make sure you perform addslashes only once, otherwise you get backslashes in your database. 

NB:  There is also the function addcslashes (and stripcslashes) which has an extra parameter for the characters to quote.  That parameter may even include ranges of character, which is more powerful than we need.  In our case you would use '\'"\\'.
You may also use DBMS-specific escape functions (e.g. mysqli_real_escape_string) but these are less easy to use.


=O=