| MySQL database conversion to UTF8 |
|
With the application of HTML5, the use of the UTF-8 character set is now common. However many databases still use the Latin1 character set; the developers and/or website managers may not even be aware that they do. It works, however with potentially some occasional malformed letters, and searching and sorting that do not provide optimal results. Some may have wondered how these malformed characters came into existence, and are mystified by the cause (as I was).
Others may even have tried to convert the database, but reverted to the previous state because of the dramatic results (all non-ASCII characters malformed). And yet, such a conversion is not very complex, but it is tricky: it requires several careful steps !
This page is about converting a MySQL database from Latin1 to UTF8 (the formal denotation is UTF-8, but that hyphen is often left out). Here we assume ISO-8859-1/Latin1 as a starting point, but character sets in the series ISO‑8859-n or Latinn all are single byte charsets and work similarly.
Before actually discussing the conversion, you need to know some more about character sets and their properties.
ASCII: The standard character set for nearly all applications –and specifically in our case for HTML and MySQL– is ASCII (see HTML Char - ASCII). ASCII might be sufficient for English, but it is lacking symbols for other languages. Most Western European languages use some accented letters (e.g. â, é, ü, ñ) and so-called ligatures: a combination of two letters to a single character (e.g. Æ, fi, ij).
ASCII extensions: As ASCII is a 7-bit character set (codes 0··127) and most computers use 8-bit bytes, there is room for another 128 character codes. However, 128 codes are not enough to accommodate all other characters, so multiple character sets were ‘standardized’ for specific language regions. This led to the ISO‑8859-n/Latinn series.
Latin1/CP1252:
For Western Europe, ISO‑8859-1/Latin1 (see HTML Char - Latin1) is popular and we use that in this page as example.
Note that formally ISO‑8859-1 defines printable symbols for the range 160··255; the range 128··159 contains control codes.
But these control codes were rarely used, so another character set –the Windows Code Page 1252 (CP1252)– defined printable symbols for that range as well.
And there are other CP125x code pages for ISO‑8859-n or Latinn.
When in HTML or in MySQL they say they use Latin1, they actually use CP1252 !
And what Latin1/CP1252 adds to ASCII are mainly the letters with accents (so called diacriticals).
For example the accented A’s: À, Á, Â, Ã, Ä, Å, à, á, â, ã, ä, å.
Note: that ISO8859-1 and Latin1 are identical, but not identical to CP1252; CP1252 is an extension to ISO8859-1/Latin1.
HTML and SQL will treat ISO8859-1/Latin1 as CP1252, but PHP does not !
Unicode/UTF-8:
To solve the issue of foreign characters in all of these character sets, Unicode was defined.
Basically it defines a 31-bit character set (i.e. code space for more than 2 billion characters) so all kinds of foreign characters can be accommodated.
Obviously, a Unicode character requires multiple bytes.
If you would use Unicode straightforward for all your texts, your files would be 4 times larger even if you would only have ASCII characters in it, and that is not practical.
So it comes down to using ASCII with a clever way to 'escape' the (non-ASCII) Unicode characters.
There are several ways to encode a Unicode character; the most important ones are the Unicode Transformation Format 8 (UTF-8, standard for HTML5) and the NCR-method (discussed below).
UTF-8 is a scheme that encodes the Unicode code number into multiple bytes. The characteristic of UTF-8 –you might call it the trick of UTF-8– is that it uses a single byte for 'normal' ASCII characters, and the code space 128-255 to encode the 'special characters' in multiple bytes. We don’t go into the details here (for that see HTML Char - UTF8), but check the following notes:
NCR:
All browsers accept characters denoted via the Numerical Character Reference (NCR):
&«value»; where «value» is either #decimal_code_number, #xhexadecimal_code_number or character_name.
The code_numbers and character_names are all from the Unicode standard.
NCR is an old method for HTML as it is the only proper way to denote characters also used in the HTML syntax without any problems: single quote or apostrophe ' ('), double quote " ("), ampersand & (&), less than < (<) and greater than > (>) symbols.
A collation sequence provides the order in which characters are sorted. But it is also used to match characters when searching (in MySQL).
For example, when no collation sequence is applied, sorting ASCII text will result in the order A B C ... Z a b ... z.
And that is not what you want; you want upper and lower case letters together, like A a B b C c ... Z z.
And when you add the diacritical characters, the desired order should become: A À Â Ã Ä Å Æ a à á â ã ä å æ B b C Ç c ç etc.
That is what a collation sequence takes care of.
Also when searching (in MySQL), you want to match case insensitive: searching for an “a” should not only match “a” but also “A”. When including diacritical characters, it should also match À, Â, Ã, Ä, Å, Æ, à, á, â, ã, ä, å and æ. This is called accent insensitive, and is also handled by the collation sequence in MySQL.
Above sorting and search matching works perfectly when MySQL is set correctly, and the database contains the corresponding characters. It will not work for the diacritical characters when for example the database is set to Latin1 while the contents is in UTF-8, or the other way around. Or when characters are denoted through NCR (NCR is an HTML-thing, for SQL it is just a funny character sequence).
It is very easy to get UTF-8 characters in a database set to an ISO‑8859‑n/Latinn character set. If you were using an ISO‑8859‑n/Latinn character set in your web pages, e.g.
<META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">and you switched your pages to UTF-8, i.e.
<META http-equiv="Content-Type" content="text/html; charset=utf-8">everything seems to work Ok: you can enter new values with diacriticals into the database and everything comes out nicely. Until you come across some diacritical characters in older data appearing like �: those characters are malformed. At first you may thing somebody made an error when entering that data, but then all older non-ASCII characters appear malformed. What has happened ?
Inversely, if your data is in UTF‑8 but your webpage uses the ISO-8859-1/Latin1 charset, you also get malformed characters. But characteristically you get 2 or 3 malformed characters for a single UTF-8 character (e.g. ë for ë).
If your data entry page –the web page with all the HTML <input>-statements– has UTF‑8 as charset, the entered text will be in UTF‑8.
And as ISO‑8859‑n/Latinn uses all codes in a byte, a (MySQL) database set to any ISO‑8859‑n/Latinn will store all bytes unchanged ('transparently'), even if the actual text is from another character set.
So UTF-8 codes are stored unchanged and retrieved unchanged, and displayed in a page with charset=utf-8.
No problem for true UTF-8 codes, but old texts using Latinn diacriticals are now interpreted as UTF-8 codes and prove to be invalid UTF-8, and are therefore displayed malformed (UTF-8 has some rules a multi-byte UTF-8 character has to adhere to).
The other way around, when you have UTF-8 (multi-byte) characters in a Latinn page, each byte is interpreted as a Latinn character. And as Latinn has a character for about any non-ASCII byte, you get multiple malformed characters.
The change of a web page to charset=utf-8 may happen implicitly, for example when you start using HTML5 –i.e. when you add the HTML5 Document Type Definition (DTD: <!DOCTYPE html>)– without explicitly specifying charset=ISO-8859-1, or when you (or your hosting provider for you) upgraded to PHP v5.6 or later. From PHP5.6 on, PHP sets charset=UTF8 in the generated page’s http header (not the HTML head section), and that overrules any <META http-equiv="Content-Type" you set ! See HTML Char - Representation of Glyphs for more details.
In both cases there is a mismatch between the charset the database is set to, and the charset your HTML page uses. And now you have two problems:
You can try to correct the old Latinn diacriticals to the corresponding UTF-8 diacriticals (if you still have access to the old texts, and are willing to spend the effort entering it all again), but you can’t solve the problem with poor search results. So you want to convert everything to UTF-8, or keep everything in Latinn. But in both cases you need to have a uniform character set for the contents of your database (either UTF-8 or Latinn). Basically the same problem, so why not convert to UTF-8.
The conversion of database contents to UTF8 is simple when the existing contents is of a uniform character set (e.g. Latinn). But regrettably, as we have just shown that is probably not the case (as has happened to me).
As a temporary measure, you can add an accept-charset=ISO-8859‑N attribute to the input forms in your web pages (and translate database output to UTF-8 in the results pages) as that limits the impact of UTF-8 in the database. Or use “header( 'Content-Type: text/html; charset=ISO-8859-1' );” in PHP before any output is generated to counteract PHP5.6 default behaviour.
So the conversion will be more complex, and it starts with a check whether the above mentioned issue (having UTF8 codes in a Latinn database, or the other way around) is applicable for you. If so, you first must make the database contents uniform again. And that all is possible via (PHP) scripts.
This section is about character sets in MySQL to provide background for the actual conversion process.
MySQL has an extensive number of parameters regarding character sets and collation sequences, which makes it a bit confusing. Here we first discuss the rather static character set and collation sequence in the database itself, later we discuss the settings controlling the dynamic interaction with a client during a query.
MySQL supports a series of character sets and corresponding collation sequences depending on the parameters in the MySQL configuration/installation. Commonly it includes Latin1 and UTF-8. Use the command SHOW CHARACTER SET in MySQL to show the supported character sets on your SQL-server, and SHOW COLLATION to show the supported collation sequences (e.g. via a page as described in MySQL Access).
Note that MySQL uses the character set name “utf8” (lowercase, no hyphen) for what is formally called UTF-8 (e.g. in HTML and PHP).
And that in actual operations –including storage– SQL uses MB3 (multi-byte 3) internally for utf8 to avoid conversions depending on actual byte values.
For input and output it will use utf8 (convert it to and from the internal mb3 respectively).
Note:
MySQL discerns character set and collation (sequence) settings at various levels in the database; in descending order:
Note that modifying the character set and/or the collation sequence of a database or a table changes the default setting but doesn’t change the contents.
However, changing the character set –potentially via changing the collation sequence– of a column through ALTER TABLE «table_name» … MODIFY «col_name» … does change the contents:
MySQL will translate the contents to the new character set.
But if the contents was not according to the old character set, it will translate incorrectly (and result in malformed characters).
In Analysis below we will go into more details of issues and solutions.
Note also that it is possible to have a table with columns using distinct character sets and/or collation sequences. Not that you would normally want that, but it may be useful during conversion.
For the interaction between the database and the application –the queries– SQL discerns:
Normally you want [5] Client and [7] Results charsets to be the same.
Then there are some system variables which are set by the hosting provider, but usually have acceptable values.
The interaction for a query can be depicted as follows:
Application | MySQL-server | Database | ||||
---|---|---|---|---|---|---|
Client charset Results charset | ⇒ ⇐ pot. translation | Connection charset | ⇔ ⇔ pot. translation | tab1.col2 charset tab2.col3 charset | ||
By setting above mentioned charsets appropriately, one can work with UTF8
at the client side (PHP/HTML) while the database contains Latin1. Or the reverse.
In general, one would avoid translations as much as possible (delay/processing power, risk of mistakes or poor translations), but during conversion it is useful: the database can still be used while you change the charset table by table or even column by column.
Use the SQL commands
These commands should provide the following lists (when ultimately everything is set to UTF8):
Variable_name | Value | |
---|---|---|
character_set_client | utf8 (previously latin1) | see [5] |
character_set_connection | utf8 (previously latin1) | see [6] |
character_set_database | utf8 (previously latin1) | see [2] |
character_set_filesystem | binary | see [9] |
character_set_results | utf8 (previously latin1) | see [7] |
character_set_server | latin1 or utf8 | see [1] |
character_set_system | utf8 | see [8] |
character_sets_dir | /usr/share/mysql/charsets/ | see [10] |
and:
Variable_name | Value |
---|---|
collation_connection | utf8_general_ci (previously latin1_swedish_ci) |
collation_database | utf8_general_ci (previously latin1_swedish_ci) |
collation_server | utf8_general_ci (previously latin1_swedish_ci) |
The collation sequence utf8_general_ci is pretty good; it is case insensitive and accent insensitive.
For Latin1 Latin1_swedish_ci is usually the default; my impression is that (outside Sweden) latin1_german2_ci performs better.
Obviously one would have the Client and the Results character set the same, and equal to the character set of the application to avoid any translation (with accompanying processing and risk of mistakes or poor translations).
This can be achieved through the command:
SET NAMES 'charset_name' [COLLATE 'collation_name']
or by:
SET CHARACTER SET 'charset_name' # e.g. 'utf8'
The distinction between these variants is that the first sets Client, Connection and the Results character sets all to the indicated character set, while the second sets Client and
the Results character sets to the indicated value, and the Connection character set equal to the Database character set & collation (making sure that there are no translations when collecting or setting database values).
But after conversion everything should be on UTF8 (so ultimately no difference).
The PHP-manual has a strong preference for the 2nd variant through:
mysqli_set_charset( $dblink, 'utf8' ); //not utf-8
(example using the mysqli procedural interface).
If the Client, Connection and the Results character sets are not what you need, and the hosting provider can’t/won’t change them permanently, you have to set them each session (i.e. after each database open/connect).
NOTE: Do not change the character_set_client or the character_set_results until you have finished the database conversion; leave these charsets to the 'transparent' ISO-8859-N/ LatinN/ CP125N for conversion.
When using constant strings in MySQL (and that happens in about every query in the WHERE-clause), you can specify explicitly the character set to be used by preceding the string with
_«charset», and/or specify the collation sequence by following the
string with COLLATION «col_seq» (e.g. SELECT ... WHERE name=_utf8"äbç" COLLATION utf8_general_ci).
Such a construct skips the translation from client charset to connection
charset for that string.
It can be useful during a transition period, but once the character sets are uniform, it is superfluous.
The conversion is per text column for each table (for each database). Basically, there are 2 methods/routes:
Direct/Straightforward: When the column is set to charset Latinn, and all the contents in that column is in the same Latinn, you can use the SQL command
Indirect: When the column is set to charset Latinn, but all the contents in that column is in UTF8, you need to change the column’s charset without translating the contents. This is achieved in a 2-step process:
The intermediate format is binary. Use the following table for corresponding text and binary types:
Text version | Binary version | (size) |
---|---|---|
CHAR(len) | BINARY(len) | (len) |
VARCHAR(len) | VARBINARY(len) | (len+1) |
TINYTEXT | TINYBLOB | (28) |
TEXT | BLOB | (216) |
MEDIUMTEXT | MEDIUMBLOB | (224) |
LONGTEXT | LONGBLOB | (232) |
For the conversion you use similar commands as in the straightforward conversion, and it can be done on an active database, but as it is a 2-step process, the conversion is a bit slower.
Preferably you do it at idle time.
And of course you may have to use a mixed approach: straightforward conversion for the columns only containing Latinn, and the indirect process for columns only containing UTF8.
For more details consult the MySQL manual on Globalization.
In both cases you have to be sure that the column contents is strictly Latinn
or strictly UTF8, not a mix.
In the straightforward process, any UTF8-byte (not character) in the column is interpreted as a Latinn character and converted to UTF8, leading to garbled text
(virtually impossible to correct).
And using the indirect process, any Latinn in a text column is just
copied and interpreted as an (incorrect) UTF8 code (this is correctable but preferably avoided).
Preferably you check the text columns in your database for use of Latinn and/or UTF8 codes before you do any conversion (see next sections).
Analysing/checking the database contents should give you an inventory of the issues, and helps you determining the best route to a total conversion.
When you have a mix of Latinn and UTF-8 codes in a column in your database, which is likely, you have a problem. You may think that you can solve all problems by converting the non-ASCII codes to NCR-codes, do the database conversion either by the direct or the indirect route, and than convert the NCR-codes to UTF-8. But than you are forgetting that the PHP function htmlentities( $str, ENT_SUBSTITUTE | ENT_HTML5, '«charset»', false) also requires a consistent charset, and you don't have that (you are only shifting the problem to the htmlentities() translation).
When you suspect that you have mixed content (and regrettably that is likely given the phenomena with malformed characters), you have to scan all text in the database for non-ASCII codes, and determine for each code whether it is a Latinn-code or an UTF8-code, and than do the appropriate conversion.
But you can combine that conversion while checking the database content for problems with mixed charsets.
Checking text to see whether it is Latinn or UTF8 involves the character code values >127 (otherwise it is ASCII, which Latinn and UTF8 have in common).
The nice thing about UTF8 is that an encoding follows strict rules. Specifically:
So if a serie of bytes adheres to the UTF8 encoding rules, you can assume it is an UTF8 character; otherwise it is a Latinn character.
This can be implemented in a PHP function to be used on all the text columns (in all tables in your database); see Routine.
This check provides an inventory of the issues (or the certainty that there are no issues), and does the conversion of Latinn characters to UTF8 characters.
Note that conversion of an UTF8 character to the correcponding Latinn character is in general not a viable route as their is a multitude of UTF8 characters not representable in Latinn.
You may also have to check whether there are any NCR-codes in the database ! You may scan the text for the occcurence of & and a semicolon within 10 bytes after that, but you can also simply use the PHP-function html_entity_decode( $string, ENT_QUOTES|ENT_HTML5, 'UTF-8' ) to convert any potential NCR-code to their UTF-8 character equivalent.
Note that NCR-codes (and UTF8-codes) are longer than Latinn-codes, so your SQL column size may be insufficient if the charset of the column is still in Latinn.
The extra preparatory step also converts any NCR-codes in the current database (which you ultimately don’t want in your database as it impedes proper searching and sorting; each converted NCR-code will also reduce the actual text size).
But you can also do a separate post processing step to get rid of any NCR-codes.
After analysing potential issues with mixed use of Latinn and UTF8 codes (and NCR-codes) in text columns, you can determine the actual conversion route (might be different for distinct columns).
Preparation:
Conversion:
Ready; your database is now in UTF8, and everything works fine. The texts in your webpages may still be in Latinn but the interface with SQL (client & results) will translate to and from UTF-8.
To convert your website to UTF-8 you also have to convert your webpages.
You can set all the SQL-interfaces to UTF-8 (if you are allowed to change these MySQL parameter settings), but that is rather inconvenient as you have to adapt all the pages having database access simultaneously to charset=UTF‑8.
A potential alternative solution –but not very elegant– is to have PHP convert all input fields to UTF‑8 before updating the database or add an “accept‑charset='UTF‑8'” attribute to all input forms, and have PHP convert all database results from UTF‑8.
<META http-equiv="Content-Type" content="text/html; charset=UTF‑8">
Now your site is in UTF-8.
=O=
The following routine in PHP converts any Latin-character to UTF‑8 in a string while skipping over valid UTF‑8 codes (or alternatively converts UTF‑8 codes to their NCR equivalent).
function Latin2utf( $mixed ) { $magic = 128; //start non-Ascii $i = 0; while ($i < strlen($mixed)) { $kar = substr( $mixed, $i, 1 ); $num = ord( $kar ); if ($num >= $magic) { $mask = $magic; $len = $utf = 0; while ($mask & $num) { $len++; $mask = $mask/2; }; //Count leading 1's //$len has the claimed length of the UTF8 sequence $utf = ($mask-1) & $num; //Start calculating the Unicode value //Check whether the following len-1 bytes start with 0xb10.. $j = 1; while ($j < $len) { $kar = substr( $mixed, $i+$j, 1 ); if ((0b11000000 & ord( $kar )) != 0b10000000) break; $utf = ($utf * 64) + (63 & ord($kar)); $j++; }; //echo "\n<br>Pos=$i, len=$len,j=$j"; if (($j==$len) and ($len>=2) and ($len<=4)) { //Assume UTF (theoretical $len<=6) //$utf contains the Unicode code point value //echo "; Valid UTF8; pos=>".($i+1); //Alternatively, insert NCR equivalent $utf; //$mixed = substr($mixed,0,$i) ."$utf;" .substr($mixed,$i+$len); $i += $len-1; //Skip over UTF-codes } else { //else $utf incorrect: assume Latin1 $kar = substr( $mixed, $i, 1 ); $utf = mb_convert_encoding( $kar , 'UTF-8', 'Windows-1252' ); //You can check conversion (e.g. ($utf=='?') or ($utf=='') ) //and generate an error message if unsuccessful; else //echo "; Latin1; converted to $utf; pos=>".($i+1); } }; //fi $num < $magic $i++; }; //while i return( $mixed ); };
Note: that utf8_encode() works on ISO-8859-1/Latin1, not on CP1252.
You may have to addapt mb_convert_encoding( $kar , 'UTF-8', 'Windows-1252' ) for your specific code page.
If you only want to check an SQL table, use the above routine on all text columns (on all records) in that table, and compare the output of the routine (potentially converted) with the input: any difference indicates Latin characters in the text.
Simarlarly, you can use html_entity_decode( $string, ENT_QUOTES | ENT_HTML5, 'UTF-8' ) to check or get rid of any NCR sequence in the data.
For many purposes it is nice to have immediate access to the MySQL database, i.e. have access to the SQL prompt where you can directly enter any SQL command. On most servers you don’t have that. Sometimes the PHPmyAdmin tool may provide some of the required capabilities, but in many cases that is not very convenient.
On the webserver where your (PHP-)scripts run you can enter commands indirectly, for example queries through a PHP script. So you can make a script that asks for a command to be typed in, send that to the SQL-server, and displays the response. It is not complex; the point is to do it in a convenient way.
You need 2 pages: the first to accept the command, and issue it to the second page. The second page should execute the command and display the result. And the most convenient way to keep these 2 pages together is through an HTML frame construct (HTML5 deprecated frames, but the browsers still support HTML4 including frames, so it works fine).
Note: You don't want to give just anybody uncontrolled access to your database !
So you have to place these pages behind some firewall (login or htaccess).
And you have to be carefull what command you issue (a "DROP TABLE ABC" will delete your tabel ABC without any questions asked).
So you get SQL.html:
<HTML> <FRAMESET ROWS='25%,*'> <FRAME NAME='input' SRC='SQLinput.html'> <FRAME NAME='output' SRC='SQLoutput.php'> </FRAMESET> </HTML>
and SQLinput.html:
<HTML><HEAD> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <!-- adapt; later UTF-8 --> <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style"> <TITLE>Input</TITLE> </HEAD><BODY> <H2>SQL utility</H2> <FORM ACTION="SQLoutput.php" TARGET="output"> <TABLE><TR><TD valign=top>SQL><TD> <TEXTAREA COLS=80 ROWS=6 NAME="cmd">SHOW TABLES</TEXTAREA> </TABLE> <BR> <BR><INPUT TYPE="SUBMIT" VALUE="Execute"></FORM> </BODY></HTML>
and SQLoutput.php:
<HTML><HEAD> <TITLE>Output</TITLE> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <!-- adapt; later UTF-8 --> <LINK TYPE="text/css" HREF="style.css" REL="stylesheet" TITLE="Preferred_style"> </HEAD><BODY> <?PHP $cmd = @$_GET['cmd']; if (!empty($cmd)) { //Database open; adapt for your database //(make this an "include('dbopen.php');" ) if (!mysqli_real_connect( $dbl, string $host, string $username, string $passwd, string $dbname )) { die( 'DB Connect Error(' . mysqli_connect_errno() . '): ' . mysqli_connect_error() ); } /* Uncomment after conversion to UTF-8 if (!mysqli_set_charset( $dbl, 'utf8' )) { //utf8 | latin1 | ... echo '<br><span style='color:red;'>Charset failed: </span>' . mysqli_error($dbc); }; */ echo "<P><I>Command</I> $cmd:<P>"; $result = mysqli_query( $dbl, $cmd ) or die( mysqli_error( $dbl ) ); if (gettype($result) == 'boolean') { if ($result) echo '<br><b>Successful;</b> '. mysqli_info( $dbl ); else echo '<br><b style='color:red;'>Unsuccessful</b>'; } else { $cnt = 0; echo "\n<TABLE border=1>"; echo "\n<TR>"; while ($finfo = mysqli_fetch_field($result)) { /* Get field information for all fields */ $currentfield = mysqli_field_tell($result); /* get fieldpointer offset */ echo '<TH>' . $finfo->name; } while ($line = mysqli_fetch_assoc($result)) { echo "\n<TR>"; $cnt++; foreach ($line as $col_value) { if (strlen($col_value)>260) print "<td><pre>$col_value</pre>"; $else print "<td>$col_value"; } } echo "</TABLE>\n"; echo "<P>$cnt lines."; } mysqli_close($dbl); } ?> </BODY></HTML>
The SQLoutput.php is rather sophisticated as it tries to provide a structured display without knowing in advance what structure the data has (other than most likely some table format).
I have used this method for years now, and find it very convenient.
=O=