Friday, July 11, 2014

The Ultimate Guide To UTF-8 and MySQL

The Ultimate Guide To UTF-8 and MySQL

How character encodings work in MySQL seem to continue baffle people, at least based on the number of questions posted on Stackoverflow. Read on if you have an application that still outputs funny text like “señor” or “se�or” where “señor” is expected.

The truth about text in MySQL

Strings in MySQL consist of an encoding marker and a list of bytes. The CHARSET function reads the encoding while HEX can be used to read bytes:
SELECT CHARSET("€"), HEX("€")
=> latin1, 80
This enables MySQL to use different encodings while working on a query. You can write a query that compares text columns in different encodings and returns the results where the text is the same, rather than the bytes that are stored.

I have everything set to utf8 and still the database gives me garbage!

That’s because of the MySQL client encoding. One query can return text in several different encodings, it’s unreasonable to expect the user software to check the encoding of each individual string. So all text returned by the MySQL client is encoded in only one encoding.
The fast and easy solution to encoding problems: set the MySQL client encoding.
How this is done depends on what you use to connect. If you work in PHP and the old MySQL functions you should use mysql_set_charset("utf8"). If you have an old version try mysql_query("SET NAMES utf8"). If you use MySQLi it would be mysqli_set_charset($link, "utf8") (or $link->set_charset("utf8") if you use the OO API).

What are the encodings in the tables good for, then?

When you create a table you can specify a character encoding for a column. It means how the MySQL server will convert text like “abc123″ to bytes that can be written to disk. The encoding affects two things:
  1. the range of characters that can be stored,  and
  2. the number of bytes the text will occupy on disk.
Popular encodings for storage include latin1 and utf8:
  • latin1: the characters that you can store cover most European languages, and the text occupies 1 byte per character.
  • utf8: you can store any(*) Unicode characters, and the text occupies from 1 to 3 bytes per character, depending on the character.
Other encodings exist, of course. For example latin2 has characters used in Eastern European languages. If you use ucs2 you can store any Unicode characters and the text occupies 2 bytes per character. If you are storing a lot of text in Japanese you may want to use ucs2 instead of utf8 because you are saving a third in the storage.
You can also specify a collation like utf8_general_ci for a column. It means how MySQL will sort the data. This affects indices and ORDER BY clauses. Cultures have different rules for alphabetic order: for example in Swedish Ä is the second to last letter of the alphabet, while in English it’s equivalent to A. So with Swedish collation you get a < b < ä and with English collation you get a = ä < b. Use the ordering your users expect. You can read more about collations in my other post.
What if you don’t specify an encoding and collation for the column? Then MySQL will use the default specified for the table, the database, or the server.

Converting encodings

Strings in MySQL consist of an encoding marker and a list of bytes. To convert strings from one encoding to another you can use the CONVERT(… USING …) function.  What CONVERT actually does is change both the marker and the bytes so that the resulting string is the same as the original; not very useful for fixing broken text.
The solution is stripping the encoding marker from the string before passing it to CONVERT. This is done by converting the string to “binary” first:
CONVERT(BINARY broken_text USING utf8)
I’ve written a detailed case study on fixing a particularly complicated MySQL encoding problem in another post.

Practical demonstration

Let’s create a table with columns in different encodings and fill it with some data.
CREATE TABLE test_encoding (
  utf8   varchar(4) CHARACTER SET utf8,
  latin1 varchar(4) CHARACTER SET latin1,
  latin2 varchar(4) CHARACTER SET latin2
);
INSERT INTO test_encoding (utf8,latin1,latin2) values ('A','A','A');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ñ','Ñ','Ñ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ŕ','Ŕ','Ŕ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('☺','☺','☺');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('€','€','€');
When you select the data from this table with select * from test_encoding you get the following results:
+------+--------+--------+
| utf8 | latin1 | latin2 |
+------+--------+--------+
| A    | A      | A      |
| Ñ    | Ñ      | ?      |
| Ŕ    | ?      | Ŕ      |
| ☺    | ?      | ?      |
| €    | €      | ?      |
+------+--------+--------+
Even though each column is encoded differently everything is still readable. This is thanks to the MySQL client encoding. The characters that could not be encoded are replaced with question marks: there is no Ñ in latin2, nor is there a Ŕ in latin1. The smiley face is not present in either.
The last line is interesting: the euro sign can be encoded in latin1 although technically it’s not really a latin1 character. This shows that MySQL’s latin1 is not really the standard ISO-8859-1 encoding, commonly known as “ISO latin 1″. In reality it is Windows-1252, the “Western European” encoding that also has curly quotes and other niceties where ISO-8859-1 has unprintable control characters.
To see that MySQL really is storing the columns in different encodings you could do this:
mysql> select utf8,hex(utf8),hex(latin1) from test_encoding;
+------+-----------+-------------+
| utf8 | hex(utf8) | hex(latin1) |
+------+-----------+-------------+
| A    | 41        | 41          |
| Ñ    | C391      | D1          |
| Ŕ    | C594      | 3F          |
| ☺    | E298BA    | 3F          |
| €    | E282AC    | 80          |
+------+-----------+-------------+
What happens when we compare two columns encoded differently?
mysql> select utf8,latin1 from test_encoding where utf8=latin1;
+------+--------+
| utf8 | latin1 |
+------+--------+
| A    | A      |
| Ñ    | Ñ      |
| €    | €      |
+------+--------+
MySQL knows that the character repertoire of latin1 is a subset of utf8, so it converts the latin1 column into unicode so that it can be compared. If however you try to compare latin1 with latin2 this is what happens:
mysql> select latin1,latin2 from test_encoding where latin1=latin2;
ERROR 1267 (HY000): Illegal mix of collations (latin1_bin,IMPLICIT) and (latin2_bin,IMPLICIT) for operation '='
Since neither latin1 and latin2 is a subset of the other, MySQL cannot compare the columns. To be able to compare them you have to convert at least one to unicode, e.g. by using CONVERT(.. USING ..):
select latin1, latin2 from test_encoding
where convert(latin1 using utf8)=latin2;
I hope this information is enough to solve any character encoding problems you may have with MySQL once and for all.
(*) Did I say utf8 can store any Unicode characters? That’s not really true. MySQL’s utf8 and ucs2 only store unicode from the basic multilingual plane (BMP), which includes “only” the characters U+0000 through U+FFFF. If you have MySQL 5.5 or later you have additional encodings to choose from: utf16, utf32, utf8mb4. These encodings You can read the details, as usual, in the official MySQL documentation.

Resource: http://jonisalonen.com/2012/ultimate-guide-to-utf8-and-mysql/

No comments:

Post a Comment