| MySQL database conversion to UTF8 |
|
With the application of HTML5, the use of the UTF-8 character set is now common. However many databases still use the Latin1 character set; the developers and/or website managers may not even be aware that they do. It works, with potentially some occasional malformed letters, but searching and sorting do not provide optimal results. Some may have wondered how these malformed characters came into existent, and are mystified by there cause (as I was).
Others may even have tried to convert the database, but reverted to the previous state because of the dramatic results (all non-ASCII characters malformed). And yet, such a conversion is not very complex, but it is tricky !
This page is about converting a MySQL database from Latin1 to UTF8. Here we assume ISO-8895-1/Latin1 as a starting point, but character sets in the series ISO 8895-n or Latinn work similarly.
Before actually discussing the conversion, you need to know some more about character sets and their properties.
ASCII: The standard character set for nearly all applications –and specifically in our case for HTML and MySQL– is ASCII (see HTML Char - ASCII). ASCII might be sufficient for English, but it is lacking for other languages. Most Western European languages use some accented letters (e.g. â, é, ü, ñ) and so-called ligatures: a combination of two letters to a single character (e.g. Æ, fi, ij).
ASCII extensions: As ASCII is a 7-bit character set (codes 0··127) and most computers use 8-bit bytes, there is room for another 128 character codes. However, 128 codes are not enough to accommodate all other characters, so multiple character sets were ‘standardized’ for specific language regions. This led to the ISO 8895-n/Latinn series.
Latin1/CP1252:
For Western Europe, ISO 8895-1/Latin1 (see HTML Char - Latin1) is popular and we use that in this page as example.
Note that formally ISO 8895-1 defines printable symbols for the range 160··255; the range 128··159 contains control codes.
But these control codes were rarely used, so another character set, Windows Code Page 1252 (CP1252), defined printable symbols for that range as well.
And there are other CP125x code pages for ISO 8895-n or Latinn.
When in HTML or in MySQL they say they use Latin1, they actually use CP1252 !
And what Latin1/CP1252 adds to ASCII are mainly the letters with accents (so called diacriticals).
For example the accented A’s: À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å.
Unicode/UTF-8: To solve the issue of foreign characters in all of these character sets, Unicode was defined. Basically it defines a 31-bit character set (i.e. code space for more than 2 billion characters) so all kinds of foreign characters can be accommodated. Obviously, a Unicode character requires multiple bytes. There are several ways to encode a Unicode character; the most important ones are the NCR-method (discussed below) and Unicode Transformation Format 8 (UTF-8, standard for HTML5).
UTF-8 is a scheme that encodes the Unicode code number into multiple bytes.
We don’t go into the details here (for those see HTML Char - UTF8), but check
the following notes.
Notes:
NCR: All browsers accept characters denoted via the Numerical Character Reference (NCR): &«value»; where «value» it either #decimal_code_number, #xhexadecimal_code_number or character_name. The code_numbers and character_names are all from the Unicode standard.
A collation sequence provides the order in which characters are sorted. But it is also used to match characters when searching (in MySQL).
For example, when no collation sequence is applied, sorting single ASCII letters will result in the order A B C ... Z a b ... z. And that is not what you want; you want upper and lower case letters together, like A a B b C c ... Z z.
And when you add the diacritical characters, the desired order becomes: A À Â Ã Ä Å Æ a à á â ã ä å æ B b C Ç c ç etc. This is what a collation sequence takes care of.
Also when searching (in MySQL), you want to match case insensitive: searching for an “a” should not only match “a” but also “A”. When including diacritical characters, it should also match À, Â, Ã, Ä, Å, Æ, à, á, â, ã, ä, å and æ. This is called accent insensitive, and is also handled by the collation sequence in MySQL.
Above sorting and search matching works perfectly when MySQL is set correctly, and the database contains the corresponding characters. It will not work for the diacritical characters when for example the database is set to Latin1 while the contents is in UTF-8. Or when characters are denoted through NCR.
It is very easy to get UTF-8 characters in a database set to an ISO 8895‑n/Latinn character set. If you were using an ISO 8895‑n/Latinn character set in your web pages, e.g.
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">and you switched your pages to UTF-8, i.e.
<META http-equiv="Content-Type" content="text/html; charset=utf-8">everything seems to work Ok: you can enter new values with diacriticals into the database and everything comes out nicely. Until you come across some diacritical characters in older data like �: those characters are malformed. At first you may thing somebody made an error when entering that data, but then all older non-ASCII characters appear malformed. What has happened ?
Inversely, if your data is in UTF-8 but your webpage uses the ISO-8859-1/Latin1 charset, you also get malformed characters. But characteristically you get 2 or 3 incorrect characters for a single UTF-8 character (e.g. ë for ë).
If your data entry page –the web page with all the HTML <input>-statements– has UTF-8 as charset, the entered text will be in UTF-8.
And as ISO 8895-n/Latinn uses all codes in a byte, a (MySQL) database set to any ISO 8895-n/Latinn will store all bytes transparently, even if the actual text is from another character set.
So UTF-8 codes are stored unchanged and retrieved unchanged, and displayed in a page with charset=utf-8.
No problem for true UTF-8 codes, but old texts using Latinn diacriticals are interpreted as UTF-8 codes and prove to be invalid UTF-8, and are therefore displayed malformed (UTF-8 has some rules a (multi-byte) UTF-8 character has to adhere to).
The other way around, when you have UTF-8 (multi-byte) characters in a Latinn page, each byte is interpreted as a Latinn character. And as Latinn has a character for about any non-ASCII byte, you get multiple incorrect characters.
The change of a web page to charset=utf-8 may happen implicitly, for example when you start using HTML5 –i.e. when you add the HTML5 Document Type Definition (DTD: <!DOCTYPE html>)– without explicitly specifying charset=ISO-8859-1, or when you (or your hosting provider for you) upgraded to PHP v5.6 or later. From PHP5.6 on, PHP sets charset = UTF8 in the generated page’s http header (not the HTML head section), and that overrules any <META http-equiv="Content-Type" you set ! See HTML Char - Representation of Glyphs for more details.
Now you have two problems:
You can try to correct the old Latinn diacriticals to the corresponding UTF-8 diacriticals (if you still have access to the old texts, and are willing to spend the effort entering it all again), but you can’t solve the problem with poor search results. So you want to convert everything to UTF-8, or keep everything in Latinn. But in both cases you need to have a uniform character set for the contents of your database (either UTF-8 or Latinn). So much of the same problem (so why not convert to UTF-8).
The conversion of database contents to UTF8 is simple when the existing contents is of a uniform character set (e.g. Latinn). And we have just shown that that is probably not the case (as has happened to me).
As a temporary measure, you can add an accept-charset=ISO-8895‑n attribute to the input forms in your web pages (and translate database output to UTF-8 in the results pages) as that limits the impact of UTF-8 in the database. Or use “header( 'Content-Type: text/html; charset=ISO-8895-1' );” in PHP before any HTML is generated to counteract PHP5.6 default behaviour.
So the conversion will be more complex, and it starts with a check whether the above mentioned issue (having UTF8 codes in a Latinn database) is applicable for you. If so, you first must make the database contents uniform again. And that all is possible via (PHP) scripts.
This section is about character sets in MySQL to provide background for the actual conversion process.
MySQL has an extensive number of parameters regarding character sets and collation sequences, which makes it a bit confusing. Here we first discuss the rather static character set and collation sequence in the database itself, later we discuss the settings controlling the dynamic interaction with a client during a query.
MySQL supports a series of character sets and corresponding collation sequences depending on the parameters in the MySQL configuration/installation. Commonly it includes Latin1 and UTF-8. Use the command SHOW CHARACTER SET in MySQL to show the supported character sets on your SQL-server, and SHOW COLLATION to show the supported collation sequences (e.g. via a page as described in MySQL Access).
Note that MySQL uses the character set name “utf8” (lowercase, no hyphen) for what is formally called UTF-8 (e.g. in HTML and PHP).
Note:
MySQL discerns character set and collation (sequence) settings at various levels in the database; in descending order:
Note that modifying the character set and/or the collation sequence of a database or a table changes the default setting but doesn’t change the contents.
However, changing the character set (potentially via changing the collation sequence) of a column does change the contents !
MySQL will translate the contents to the new character set.
And if the contents was not according to the old character set, it will translated incorrectly (and result in malformed characters).
In Analysis below we will go into more details of issues and solutions.
Note also that it is possible to have a table with columns using distinct character sets and/or collation sequences. Not that you would normally want that, but it may be useful during conversion.
For the interaction between the database and the application –the queries– SQL discerns:
Then there are some system variables which are set by the hosting provider, but usually have acceptable values.
The interaction for a query can be depicted as follows:
Application | MySQL-server | Database | ||
---|---|---|---|---|
Client charset Results charset | ⇒ ⇐ pot. translation | Connection charset | ⇔ ⇔ pot. translation | tab1.col2 charset tab2.col3 charset |
NB: By setting above mentioned charsets appropriately, one can work with Latin1 at the client side (PHP/HTML) while the database contains UTF8. Or the reverse.
In general, one would avoid translations as much as possible (delay/processing power, risk of mistakes or poor translations), but during conversion it can be useful.
Use the SQL commands
These commands should provide the following lists (when ultimately everything is set to UTF8):
Variable_name | Value | |
---|---|---|
character_set_client | utf8 (previously latin1) | see [5] |
character_set_connection | utf8 (previously latin1) | see [6] |
character_set_database | utf8 (previously latin1) | see [2] |
character_set_filesystem | binary | see [9] |
character_set_results | utf8 (previously latin1) | see [7] |
character_set_server | latin1 or utf8 | see [1] |
character_set_system | utf8 | see [8] |
character_sets_dir | /usr/share/mysql/charsets/ | see [10] |
and:
Variable_name | Value |
---|---|
collation_connection | utf8_general_ci (previously latin1_swedish_ci) |
collation_database | utf8_general_ci (previously latin1_swedish_ci) |
collation_server | utf8_general_ci (previously latin1_swedish_ci) |
The collation sequence utf8_general_ci is pretty good; it is case insensitive and accent insensitive.
For Latin1 Latin1_swedish_ci is usually the default; my impression is that (outside Sweden) latin1_german2_ci performs better.
Obviously one would have the Client and the Results character set the same, and equal to the character set of the application to avoid any translation (with accompanying processing and risk of mistakes or poor translations).
This can be achieved through the command:
SET NAMES 'charset_name' [COLLATE 'collation_name']
or by:
SET CHARACTER SET 'charset_name' # e.g. 'utf8'
The distinction between these variants is that the first sets Client, Connection and the Results character sets all to the indicated character set, while the second sets Client and
the Results character sets to the indicated value, and the Connection character set equal to the Database character set & collation (making sure that there are no translations when collecting or setting database values).
But after conversion everything should be on UTF8 (so ultimately no difference).
The PHP-manual has a strong preference for the 2nd variant through:
mysqli_set_charset( $dblink, 'utf8' ); //not utf-8
(example using the mysqli procedural interface).
If the Client, Connection and the Results character sets are not what you need, and the hosting provider can’t/won’t change them permanently, you have to set them each session (i.e. after each database open).
When using constant strings in MySQL (and that happens in about every query in the WHERE-clause), you can specify explicitly the character set to be used by preceding the string with
_«charset», and/or specify the collation sequence by following the
string with COLLATION «col_seq» (e.g. SELECT ... WHERE name=_utf8"äbç" COLLATION utf8_general_ci).
Such a construct skips the translation from client charset to connection
charset for that string.
It can be useful during a transition period, but once the character sets are uniform, it is superfluous.
The conversion is per text column for each table (for each database). Basically, there are 2 methods/routes:
Straightforward: When the column is set to charset Latinn, and all the contents in that column is in the same Latinn, you can use the SQL command
Indirect: When the column is set to charset Latinn, but all the contents in that column is in UTF8, you need to change the column’s charset without translating the contents. This is achieved in a 2-step process:
The intermediate format is binary. Use the following table for corresponding text and binary types:
Text version | Binary version | (size) |
---|---|---|
CHAR(len) | BINARY(len) | (len) |
VARCHAR(len) | VARBINARY(len) | (len+1) |
TINYTEXT | TINYBLOB | (28) |
TEXT | BLOB | (216) |
MEDIUMTEXT | MEDIUMBLOB | (224) |
LONGTEXT | LONGBLOB | (232) |
For the conversion you use similar commands as in the straightforward conversion, and it can be done on an active database, but as it is a 2-step process, the conversion is a bit slower.
Preferably you do it at idle time.
And of course you may have to use a mixed approach: straightforward conversion for the columns only containing Latinn, and the indirect process for columns only containing UTF8.
For more details consult the MySQL manual on Globalization.
In both cases you have to be sure that the column contents is strictly Latinn or strictly UTF8, not a mix. In the straightforward process, any UTF8 in the column is interpreted as Latinn and converted to UTF8, leading to garbled text (virtually impossible to correct). And using the indirect process, any Latinn in a text column is just copied and interpreted as (incorrect) UTF8 code (this is correctable). Preferably you check the text columns in your database for use of Latinn and/or UTF8 before you do any conversion (see next section).
Analysing/checking the database contents should give you an inventory of the issues, and helps you determining the best route to full conversion.
Checking text to see whether it is Latinn or UTF8 involves the character code values >127 (otherwise it is ASCII, which Latinn and UTF8 have in common).
The nice thing about UTF8 is that an encoding follows strict rules. Specifically:
So if a series of bytes adhere to the UTF8 encoding rules, you can assume it is an UTF8 character; otherwise it is a Latinn character.
This can be implemented in a PHP function to be used on all the text columns (in all tables in your database).
This check provides an inventory of the issues (or the certainty that there are no issues), and helps determining the optimal solution in your case.
You also have to check whether there are any NCR-codes in the database !
(No reference yet. Otherwise, contact me).
Actual conversion routes:
The conversion of UTF8 codes to NCR and back is rather simple.
Your best choice also depends whether you also have NCR codes in the current database (which you ultimately don’t want as it impedes proper searching and sorting): if so option [2.c] is a good choice.
After analysing potential issues with mixed use of Latinn and UTF8 codes in text columns, you can determine the best conversion route (might be different for distinct columns).
Preparation:
Conversion:
Ready; your database is now in UTF8, and everything works fine. Your webpages are still in Latinn but the interface with SQL (client & results) will translate to and from UTF-8.
To convert your website to UTF-8 you have to convert your webpages.
You can set all the SQL-interfaces to UTF-8 (if you are allowed to change these MySQL parameter settings), but that is rather inconvenient as you have to addapt all the pages having database access simultaneously to charset=UTF-8.
A potential alternative solution –but not very elegant– is to have PHP convert all input fields to UTF-8 before updating the database or add an “accept‑charset='UTF-8'” attribute to all input forms, and have PHP convert all database results from UTF-8.
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
Now your site is in UTF-8.
=O=
For many purposes it is nice to have immediate access to the MySQL database, i.e. have access to the SQL prompt where you can directly enter any SQL command. On most servers you don’t have that. Sometimes the PHPmyAdmin tool may provide some of the required capabilities, but in many cases that is not very convenient.
On the webserver where your (PHP-)scripts run you can enter commands indirectly, for example queries through a PHP script. So you can make a script that asks for a command to be typed in, send that to the SQL-server, and displays the response. It is not complex; the point is to do it in a convenient way.
You need 2 pages: the first to accept the command, and issue it to the second page. The second page should execute the command and display the result. And the most convenient way to keep these 2 pages together is through an HTML frame construct (HTML5 discarded frames, but the browsers still support HTML4 including frames, so it works fine).
Note: You don't want to give just anybody uncontrolled access to your database !
So you have to place these pages behind some firewall (login or htaccess).
And you have to be carefull what command you issue (a "DROP TABLE ABC" will delete your tabel ABC without any questions asked).
So you get SQL.html:
<HTML> <FRAMESET ROWS='25%,*'> <FRAME NAME='input' SRC='SQLinput.html'> <FRAME NAME='output' SRC='SQLoutput.php'> </FRAMESET> </HTML>
and SQLinput.html:
<HTML><HEAD> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <!-- adapt; later UTF-8 --> <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style"> <TITLE>Input</TITLE> </HEAD><BODY> <H2>SQL utility</H2> <FORM ACTION="SQLoutput.php" TARGET="output"> <TABLE><TR><TD valign=top>SQL><TD> <TEXTAREA COLS=80 ROWS=6 NAME="cmd">SHOW TABLES</TEXTAREA> </TABLE> <BR> <BR><INPUT TYPE="SUBMIT" VALUE="Execute"></FORM> </BODY></HTML>
and SQLoutput.php:
<HTML><HEAD> <TITLE>Output</TITLE> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <!-- adapt; later UTF-8 --> <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style"> </HEAD><BODY> <?PHP $cmd = @$_GET['cmd']; if (!empty($cmd)) { //Database open; adapt for your database //(make this an "include('dbopen.php');" ) if (!mysqli_real_connect( $dbl, string $host, string $username, string $passwd, string $dbname )) { die( 'DB Connect Error(' . mysqli_connect_errno() . '): ' . mysqli_connect_error() ); } /* Uncomment after conversion to UTF-8 if (!mysqli_set_charset( $dbl, 'utf8' )) { //utf8 | latin1 | ... echo '<br><span style='color:red;'>Charset failed: </span>' . mysqli_error($dbc); }; */ echo "<P><I>Command</I> $cmd:<P>"; $result = mysqli_query( $dbl, $cmd ) or die( mysqli_error( $dbl ) ); if (gettype($result) == 'boolean') { if ($result) echo '<br><b>Successful;</b> '. mysqli_info( $dbl ); else echo '<br><b style='color:red;'>Unsuccessful</b>'; } else { $cnt = 0; echo "\n<TABLE border=1>"; echo "\n<TR>"; while ($finfo = mysqli_fetch_field($result)) { /* Get field information for all fields */ $currentfield = mysqli_field_tell($result); /* get fieldpointer offset */ echo '<TH>' . $finfo->name; } while ($line = mysqli_fetch_assoc($result)) { echo "\n<TR>"; $cnt++; foreach ($line as $col_value) { if (strlen($col_value)>260) print "<td><pre>$col_value</pre>"; $else print "<td>$col_value"; } } echo "</TABLE>\n"; echo "<P>$cnt lines."; } mysqli_close($dbl); } ?> </BODY></HTML>
The SQLoutput.php is rather sophisticated as it tries to provide a structured display without knowing in advance what structure the data has (other than most likely some table format).
I have used this method for years now, and find it very convenient.
=O=