Home Misc Index
MySQL database conversion to UTF8
 P van Diemen

With the application of HTML5, the use of the UTF-8 character set is now common.  However many databases still use the Latin1 character set;  the developers and/or website managers may not even be aware that they do.  It works, with potentially some occasional malformed letters, but searching and sorting do not provide optimal results.  Some may have wondered how these malformed characters came into existent, and are mystified by there cause (as I was).

Others may even have tried to convert the database, but reverted to the previous state because of the dramatic results (all non-ASCII characters malformed).  And yet, such a conversion is not very complex, but it is tricky !

This page is about converting a MySQL database from Latin1 to UTF8.  Here we assume ISO-8895-1/­Latin1 as a starting point, but character sets in the series ISO 8895-n or Latinn work similarly.

Introduction

Before actually discussing the conversion, you need to know some more about character sets and their properties.

Character sets

ASCII:  The standard character set for nearly all applications –and specifically in our case for HTML and MySQL– is ASCII (see HTML Char - ASCII).  ASCII might be sufficient for English, but it is lacking for other languages.  Most Western European languages use some accented letters (e.g. â, é, ü, ñ) and so-called ligatures:  a combination of two letters to a single character (e.g. Æ, fi, ij).

ASCII extensions:  As ASCII is a 7-bit character set (codes 0··127) and most computers use 8-bit bytes, there is room for another 128 character codes.  However, 128 codes are not enough to accommodate all other characters, so multiple character sets were ‘standardized’ for specific language regions.  This led to the ISO 8895-n/Latinn series.

Latin1/CP1252:  For Western Europe, ISO 8895-1/Latin1 (see HTML Char - Latin1) is popular and we use that in this page as example.  Note that formally ISO 8895-1 defines printable symbols for the range 160··255;  the range 128··159 contains control codes.  But these control codes were rarely used, so another character set, Windows Code Page 1252 (CP1252), defined printable symbols for that range as well.  And there are other CP125x code pages for ISO 8895-n or Latinn.
When in HTML or in MySQL they say they use Latin1, they actually use CP1252 !  And what Latin1/CP1252 adds to ASCII are mainly the letters with accents (so called diacriticals).  For example the accented A’s: À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å.

Unicode/UTF-8:  To solve the issue of foreign characters in all of these character sets, Unicode was defined.  Basically it defines a 31-bit character set (i.e. code space for more than 2 billion characters) so all kinds of foreign characters can be accommodated.  Obviously, a Unicode character requires multiple bytes.  There are several ways to encode a Unicode character;  the most important ones are the NCR-method (discussed below) and Unicode Transformation Format 8 (UTF-8, standard for HTML5).

UTF-8 is a scheme that encodes the Unicode code number into a multi-byte code.  We don’t go into the details here (for those see HTML Char - UTF8), but check the following notes.
Notes:

NCR:  All browsers accept characters denoted via the Numerical Character Reference (NCR):  &«value»; where «value» it either #decimal_code_number, #xhexadecimal_code_number or character_name.  The code_numbers and character_names are all from the Unicode standard.

NCR is an old method for HTML as it is the only way to denote characters also used in the HTML syntax without any problems:  ", ', &, < and >. 
With NCR the number of bytes is the same as the number of (ASCII) characters, but it takes multiple characters to encode a single ‘foreign’ character.  So it takes more bytes than characters in the browser.

Collation sequence

A collation sequence provides the order in which characters are sorted.  But it is also used to match characters when searching (in MySQL).

For example, when no collation sequence is applied, sorting single ASCII letters will result in the order A B C ... Z a b ... z.  And that is not what you want;  you want upper and lower case letters together, like A a B b C c ... Z z.

And when you add the diacritical characters, the desired order becomes:  A À Â Ã Ä Å Æ a à á â ã ä å æ B b C Ç c ç etc.  This is what a collation sequence takes care of.

Also when searching (in MySQL), you want to match case insensitive:  searching for an “a” should not only match “a” but also “A”.  When including diacritical characters, it should also match À, Â, Ã, Ä, Å, Æ, à, á, â, ã, ä, å and æ.  This is called accent insensitive, and is also handled by the collation sequence in MySQL.

Above sorting and search matching works perfectly when MySQL is set correctly, and the database contains the corresponding characters.  It will not work for the diacritical characters when for example the database is set to Latin1 while the contents is in UTF-8.  Or when characters are denoted through NCR.


Cause of Problem

It is very easy to get UTF-8 characters in a database set to an ISO 8895‑n/Latinn character set.  If you were using an ISO 8895‑n/Latinn character set in your web pages, e.g.

META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
and you switched your pages to UTF-8, i.e.
META http-equiv="Content-Type" content="text/html; charset=utf-8">
everything seems to work Ok:  you can enter new values with diacriticals into the database and everything comes out nicely.  Until you come across some diacritical characters in older data like �:  those characters are malformed.  At first you may thing somebody made an error when entering that data, but then all older non-ASCII characters appear malformed.  What has happened ?

If your data entry page –the web page with all the HTML <input>-statements– has UTF-8 as charset, the entered text will be in UTF-8.  And as ISO 8895-n/Latinn uses all codes in a byte, a (MySQL) database set to any ISO 8895-n/Latinn will store all bytes transparently, even if the actual text is from another character set.  So UTF8 codes are stored unchanged and retrieved unchanged, and displayed in a page with charset=utf-8.  No problem for true UTF-8 codes, but old texts using Latinn diacriticals are interpreted as UTF8 codes and prove to be invalid, and are therefore displayed malformed.

The change of a web page to charset=utf-8 may happen implicitly, for example when you start using HTML5 –i.e. when you add the HTML5 Document Type Definition (DTD: <!DOCTYPE html>)– without explicitly specifying charset=ISO-8859-1, or when you (or your hosting provider for you) upgraded to PHP v5.6 or later.  From PHP5.6 on, PHP sets charset = UTF8 in the generated page’s http header (not the HTML head section), and that overrules any <META http-equiv="Content-Type" you set !  See HTML Char - Representation of Glyphs for more details.

Now you have two problems: 

  1. you get sometimes malformed characters from the database (Latinn diacriticals entered when the input web page was still using the ISO 8895‑n character set);
  2. The search results are not optimal (an “a” will not match an UTF-8 “á”).

You can try to correct the old Latinn diacriticals to the corresponding UTF-8 diacriticals (if you still have access to the old texts, and are willing to spend the effort entering it all again), but you can’t solve the problem with poor search results.  So you want to convert everything to UTF-8, or keep everything in Latinn.  But in both cases you need to have a uniform character set for the contents of your database (either UTF-8 or Latinn).  So much of the same problem (so why not convert to UTF-8).

The conversion of database contents to UTF8 is simple when the existing contents is of a uniform character set (e.g. Latinn).  And we have just shown that that is probably not the case (as has happened to me).

As a temporary measure, you can add an accept-charset=ISO-8895‑n attribute to the input forms in your web pages (and translate database output to UTF-8 in the results pages) as that limits the impact of UTF-8 in the database.  Or use “header( 'Content-Type: text/html; charset=ISO-8895-1' );” in PHP before any HTML is generated to counteract PHP5.6 default behaviour.

So the conversion will be more complex, and it starts with a check whether the above mentioned issue (having UTF8 codes in a Latinn database) is applicable for you.  If so, you first must make the database contents uniform again.  And that all is possible via (PHP) scripts.


MySQL

General

This section is about character sets in MySQL to provide background for the actual conversion process.

MySQL has an extensive number of parameters regarding character sets and collation sequences, which makes it a bit confusing.  Here we first discuss the rather static character set and collation sequence in the database itself, later we discuss the settings controlling the dynamic interaction with a client during a query.

MySQL supports a series of character sets and corresponding collation sequences depending on the parameters in the MySQL configuration/­installation.  Commonly it includes Latin1 and UTF-8.  Use the command SHOW CHARACTER SET in MySQL to show the supported character sets on your SQL-server, and SHOW COLLATION to show the supported collation sequences (e.g. via a page as described in MySQL Access).

Note that MySQL uses the character set name “utf8” (lowercase, no hyphen) for what is formally called UTF-8 (e.g. in HTML and PHP).

Note:

Database

MySQL discerns character set and collation (sequence) settings at various levels in the database;  in descending order:

  1. Server Character set & Collation
    The character set and the collation with which the SQL server was started.  This setting acts as a default for databases which are created later.  This is a setting done by the hosting provider (who usually doesn’t like to deviate from the default in the package, often set to Latin1).
  2. Database Character Set & Collation
    The character set and the collation of an existing database.  This acts like a default for new tables created later in this database.  One can create a new database, or modify an existing database, through the command:
    CREATE / ALTER DATABASE «db_name»
       CHARACTER SET «charset_name»
            #implies default collation sequence
    and/or
       COLLATE «collation_name»            #implies the character set
  3. Table Character Set & Collation
    The character set and the collation of an existing table.  This acts as a default for new text columns created in this table (character set and collation sequence are only applicable on text of course).  One can create a new table, or modify an existing table, through the command:
    CREATE / ALTER TABLE  «table_name»
       CHARACTER SET «charset_name»
       [COLLATE «collation_name»]
  4. Column Character Set & Collation
    The character set and the collation of a text column (in a table).  One can create a (text) column through the command:
    CREATE TABLE  «table_name»
       «col_name» { Char/Varchar/Text (length) }
       CHARACTER SET «charset_name»
       [COLLATE «collation_name»]

    or modify an existing column by:
    ALTER TABLE  «table_name»
    MODIFY «col_name»
       CHARACTER SET «charset_name»
       [COLLATE «collation_name»]
      [, MODIFY «col_name» …]

    NB:  You may (create or) alter multiple columns at the same time.
     
    Use the SQL-command
        SHOW FULL COLUMNS IN «table_name»
    to show current use of (character set and) collation sequence in the columns.

Note that modifying the character set and/or the collation sequence of a database or a table changes the default setting but doesn’t change the contents.

However, changing the character set (potentially via changing the collation sequence) of a column does change the contents !  MySQL will translate the contents to the new character set.  And if the contents was not according to the old character set, it will translated incorrectly (and result in malformed characters).
In Analysis below we will go into more details of issues and solutions.

Note also that it is possible to have a table with columns using distinct character sets and/or collation sequences.  Not that you would normally want that, but it may be useful during conversion.

Interaction

For the interaction between the database and the application –the queries– SQL discerns:

  1. Client Character Set & Collation
    The character set and the collation for the commands sent to the SQL-server (e.g. from a (PHP-)application).
  2. Connection Character Set & Collation
    The character set and the collation used by the SQL-server to execute the query (i.e. after potential translation from the client, and after potential translation of text from table columns).
  3. Results Character Set & Collation
    The character set and the collation where the results of the query are translated to before sending it to the (PHP-)application.

Then there are some system variables which are set by the hosting provider, but usually have acceptable values.

  1. Character_set_system (& collation)
    The character set and the collation which the server uses for identifiers.  This should be ut8 (if it’s not you might be restricted to names with only ASCII characters – not really a problem).
  2. Character_set_filesystem (& collation)
    The character set and the collation used in the file system (relevant for commands like LOAD FILE).  This is default set to binary, meaning that there will be no conversion.  If there could be multi-byte filenames, utf8 is preferable.
  3. Character_sets_dir
    The directory where character sets (and the collations) are stored.  Not relevant as we have normally no access to that directory.

The interaction for a query can be depicted as follows:

Application   MySQL-server   Database
Client charset
Results charset


pot. translation
Connection charset
 


pot. translation
tab1.col2 charset
tab2.col3 charset

NB:  By setting above mentioned charsets appropriately, one can work with Latin1 at the client side (PHP/HTML) while the database contains UTF8.  Or the reverse.

In general, one would avoid translations as much as possible (delay/processing power, risk of mistakes or poor translations), but during conversion it can be useful.

Use the SQL commands

SHOW VARIABLES LIKE 'character_set%';
and
SHOW VARIABLES LIKE 'collation%';
to show the settings for your server.

These commands should provide the following lists (when ultimately everything is set to UTF8):

Variable_nameValue 
character_set_clientutf8 (previously latin1)see [5]
character_set_connectionutf8 (previously latin1)see [6]
character_set_databaseutf8 (previously latin1)see [2]
character_set_filesystembinarysee [9]
character_set_resultsutf8 (previously latin1)see [7]
character_set_serverlatin1 or utf8see [1]
character_set_systemutf8see [8]
character_sets_dir/usr/share/mysql/charsets/see [10]

and:

Variable_nameValue
collation_connectionutf8_general_ci (previously latin1_swedish_ci)
collation_databaseutf8_general_ci (previously latin1_swedish_ci)
collation_serverutf8_general_ci (previously latin1_swedish_ci)

The collation sequence utf8_general_ci is pretty good;  it is case insensitive and accent insensitive.
For Latin1 Latin1_swedish_ci is usually the default;  my impression is that (outside Sweden) latin1_german2_ci performs better.

Obviously one would have the Client and the Results character set the same, and equal to the character set of the application to avoid any translation (with accompanying processing and risk of mistakes or poor translations).
This can be achieved through the command:
     SET NAMES 'charset_name'   [COLLATE 'collation_name']
or by:
     SET CHARACTER SET 'charset_name'      # e.g. 'utf8'
The distinction between these variants is that the first sets Client, Connection and the Results character sets all to the indicated character set, while the second sets Client and the Results character sets to the indicated value, and the Connection character set equal to the Database character set & collation (making sure that there are no translations when collecting or setting database values).  But after conversion everything should be on UTF8 (so no difference).

The PHP-manual has a strong preference for the 2nd variant through:

mysqli_set_charset( $dblink, 'utf8' ));     //not utf-8
(example using the mysqli procedural interface).

If the Client, Connection and the Results character sets are not what you need, and the hosting provider can’t/won’t change them permanently, you have to set them each session (i.e. after each database open).

When using constant strings in MySQL (and that happens in about every query in the WHERE-clause), you can specify explicitly the character set to be used by preceding the string with _«charset», and/or specify the collation sequence by following the string with COLLATION «col_seq» (e.g. SELECT ... WHERE name=_utf8"äbç" COLLATION utf8_general_ci).
Such a construct skips the translation from client charset to connection charset for that string.  It can be useful during a transition period, but once the character sets are uniform, it is superfluous.


Analysis

Basic conversion routes

The conversion is per text column for each table (for each database).  Basically, there are 2 methods/routes:

Straightforward:  When the column is set to charset Latinn, and all the contents in that column is in the same Latinn, you can use the SQL command

ALTER TABLE «table_name»
  MODIFY «col_name»  CHARACTER SET utf8
  [, MODIFY «col2_name»  CHARACTER SET utf8 …]

MySQL will do the translation, and takes care of the fact that UTF8 is a multi-byte code (it will assume that every character takes 3 bytes – you don’t have to modify a column’s length).  Such a conversion is rather quick and can be done on a live database. 
Converting multiple columns (in the same table) at the same time is faster than converting those columns one-by-one.  For text columns which only contain simple ASCII (e.g. a ZIP-code), you can use the direct conversion route.

Indirect:  When the column is set to charset Latinn, but all the contents in that column is in UTF8, you need to change the column’s charset without translating the contents.  This is achieved in a 2-step process:

  1. Convert the column to a charset-less format (i.e. the contents stays the same, but the character set and the collation sequence attributes disappear).
  2. Convert from the charset-less format to the original text-type but now with charset set to UTF8 (again no translation done, but the character set en the collation sequence attributes are set).

The intermediate format is binary.  Use the following table for corresponding text and binary types:

Text versionBinary version(size)
CHAR(len)BINARY(len)(len+1)
VARCHAR(len)VARBINARY(len)(len+1)
TINYTEXTTINYBLOB (28)
TEXTBLOB (216)
MEDIUMTEXTMEDIUMBLOB (224)
LONGTEXTLONGBLOB (232)

For the conversion you use similar commands as in the straightforward conversion, and it can be done on an active database, but as it is a 2-step process, the conversion is a bit slower.  Preferably you do it at idle time.
And of course you may have to use a mixed approach:  straightforward conversion for the columns only containing Latinn, and the indirect process for columns only containing UTF8.

For more details consult the MySQL manual on Globalization.

In both cases you have to be sure that the column contents is strictly Latinn or strictly UTF8, not a mix.  In the straightforward process, any UTF8 in the column is interpreted as Latinn and converted to UTF8, leading to garbled text (virtually impossible to correct).  And using the indirect process, any Latinn in a text column is just copied and interpreted as (incorrect) UTF8 code (this is correctable).  Preferably you check the text columns in your database for use of Latinn and/or UTF8 before you do any conversion (see next section).

Analysing/checking the database contents should give you an inventory of the issues, and helps you determining the best route to full conversion.

Checking database content

Checking text to see whether it is Latinn or UTF8 involves the character code values >127 (otherwise it is ASCII, which Latinn and UTF8 have in common).

The nice thing about UTF8 is that an encoding follows strict rules.  Specifically:

So if a series of bytes adhere to the UTF8 encoding rules, you can assume it is an UTF8 character;  otherwise it is a Latinn character.
This can be implemented in a PHP function to be used on all the text columns (in all tables in your database).  This check provides an inventory of the issues (or the certainty that there are no issues), and helps determining the optimal solution in your case.
You also have to check whether there are any NCR-codes in the database !  (No reference yet.  Otherwise, contact me).

Actual conversion routes:

  1. If you have no mix of Latinn and UTF8 codes in a column, that column can be converted via
    1. (all in Latinn) the straightforward procedure, or
    2. (all in UTF8) the indirect procedure.
    Any NCR-codes should be converted in a post processing step.
  2. If you have a mix, you have a choice:
    1. convert the Latinn codes to UTF8, and use the indirect procedure.
    2. convert the UTF8 codes back to Latinn, and use the straightforward procedure.
      This might not be possible if special UTF8 codes are used for which there is no equivalent Latinn character (Unicode values > 255).
    3. convert the UTF8 codes to NCR codes, use the straightforward procedure to convert all to UTF8 (which leaves the NCR codes unchanged), and do post processing to convert NCR codes to UTF8 codes.
    4. a mix of [b] & [c]: convert UTF8 codes which have a Latinn counterpart to Latinn characters and the rest to NCR codes, convert all straightforward to UTF8, and do post processing to convert NCR codes to UTF8 codes.
      This option is a bit unpractical as it involves a more complex step.

The conversion of UTF8 codes to NCR and back is rather simple.

Your best choice also depends whether you also have NCR codes in the current database (which you ultimately don’t want as it impedes proper searching and sorting):  if so option [2.c] is a good choice.


Conversion

After analysing potential issues with mixed use of Latinn and UTF8 codes in text columns, you can determine the best conversion route (might be different for distinct columns).

Preparation:

  1. First make a list (per table) for each text column to show how it will be converted.
    Then check each text column whether it has any mixed use of Latinn and UTF8 codes (and potentially NCR codes).
    Determine how to convert those column (i.e. to Latinn, or UTF8 or NCR;  options [2.a], [2.b] or [2.c] respectively).
  2. Where applicable, correct any mixed use of Latinn and UTF8 codes in text columns (i.e. either convert to Latinn / UTF8 / NCR ).

Conversion:

  1. Set the default character set (or the collation sequence when the default collation sequence is not preferred) for the database pertaining to new tables and columns (to avoid later problems with new tables and/or columns):
    ALTER DATABASE «db_name»
       CHARACTER SET utf8 | COLLATE utf8 utf8_general_ci

    and for all tables:
    ALTER TABLE «table_name»
       CHARACTER SET utf8 | COLLATE utf8 utf8_general_ci
  2. For each table, do the conversion for all text columns according to your list (from [1]).
  3. Set Client, Results and Connection charsets to UTF8.
    If you can’t change these MySQL parameter settings, add the “SET NAMES 'charset_name' [COLLATE 'collation_name']” or “mysqli_set_charset( «$dblink», 'utf8' )” or its equivalent to the database open.
  4. Do any post processing (like conversion of NCR to UTF8) if required.

Ready; your database is now in UTF8.

Now you have to convert your web pages –or the PHP scripts that generates them– to make all pages UTF-8.  I.e. insert a

<META http-equiv="Content-Type" content="text/html; charset=utf-8">
in the header section of each page.  Starting with the pages that have input-forms to make sure it is UTF-8 and not something else that goes in your database.

If it is inconvenient to change all the web pages (at nearly the same time), at least add an “accept‑charset='UTF-8'” attribute to all forms.  Alternatively, you have to convert all input fields to UTF-8 in PHP before updating the database.

=O=


MySQL database access

For many purposes it is nice to have immediate access to the MySQL database, i.e. have access to the SQL prompt where you can directly enter any SQL command.  On most servers you don’t have that.  Sometimes the PHPmyAdmin tool may provide some of the required capabilities, but for many cases that is not very convenient.

On the webserver where your (PHP-)scripts run you can enter commands indirectly, for example queries through a PHP script.  So you can make a script that asks for a command to be typed in, send that to the SQL-server, and displays the response.  It is not complex;  the point is to do it in a convenient way.

You need 2 pages:  the first to accept the command, and issue it to the second page.  The second page should execute the command and display the result.  And the most convenient way to keep these 2 pages together is through an HTML frame construct (HTML5 discarded frames, but the browsers still support HTML4 including frames, so it works fine).

Note:  You don't want to give just anybody uncontrolled access to your database !  So you have to place these pages behind some firewall (login or htaccess).

So you get SQL.html:

<HTML>
  <FRAMESET ROWS='25%,*'>
    <FRAME NAME='input' SRC='SQLinput.html'>
    <FRAME NAME='output' SRC='SQLoutput.php'>
  </FRAMESET>
</HTML>

and SQLinput.html:

<HTML><HEAD>
 <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">   <!-- adapt; later UTF-8 -->
 <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style">
 <TITLE>Input</TITLE>
</HEAD><BODY>
<H2>SQL utility</H2>
  <FORM ACTION="SQLoutput.php" TARGET="output">
    <TABLE><TR><TD valign=top>SQL&gt;<TD>
      <TEXTAREA COLS=80 ROWS=6 NAME="cmd">SHOW TABLES</TEXTAREA>
    </TABLE>
  <BR>&nbsp;<BR><INPUT TYPE="SUBMIT" VALUE="Execute"></FORM>
</BODY></HTML>

and SQLoutput.php:

<HTML><HEAD>
 <TITLE>Output</TITLE>
 <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">   <!-- adapt; later UTF-8 -->
 <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style">
</HEAD><BODY>
<?PHP
$cmd = @$_GET['cmd'];
if (!empty($cmd)) {
 //Database open; adapt for your database
 //(make this an "include('dbopen.php');" )
 if (!mysqli_real_connect( $dbl, string $host, string $username, string $passwd, string $dbname )) {
   die( 'DB Connect Error(' . mysqli_connect_errno() . '): ' . mysqli_connect_error() );
 }
/* Uncomment after conversion to UTF-8
 if (!mysqli_set_charset( $dbl, 'utf8' )) {     //utf8 | latin1 | ...
  echo '<br><span style='color:red;'>Charset failed: </span>' . mysqli_error($dbc);
 };
*/
 echo "<P><I>Command</I> $cmd:<P>";
 $result = mysqli_query( $dbl, $cmd ) or die( mysqli_error( $dbl ) );
 if (gettype($result) == 'boolean') {
  if ($result) echo '<br><b>Successful;</b> '. mysqli_info( $dbl );
  else echo '<br><b style='color:red;'>Unsuccessful</b>';
 } else {
  $cnt = 0;
  echo "\n<TABLE border=1>";
  echo "\n<TR>";
  while ($finfo = mysqli_fetch_field($result)) {  /* Get field information for all fields */
   $currentfield = mysqli_field_tell($result);    /* get fieldpointer offset */
   echo '<TH>' . $finfo->name;
  }
  while ($line = mysqli_fetch_assoc($result)) {
   echo "\n<TR>";  $cnt++;
   foreach ($line as $col_value) {
    if (strlen($col_value)>260)  print "<td><pre>$col_value</pre>";
    $else  print "<td>$col_value";
  } }
  echo "</TABLE>\n";
  echo "<P>$cnt lines.";
 }
 mysqli_close($dbl);
}
?>
</BODY></HTML>

The SQLoutput.php is rather sophisticated as it tries to provide a structured display without knowing in advance what structure the data has (other than most likely some table format).

I have used this method used for years now, and find it very convenient.

=O=