How character encodings work in MySQL seem to continue baffle people,
at least based on the number of questions posted on Stackoverflow. Read
on if you have an application that still outputs funny text like
“señor” or “se�or” where “señor” is expected.
The truth about text in MySQL
Strings in MySQL consist of an encoding marker and a list of bytes. The
CHARSET
function reads the encoding while
HEX
can be used to read bytes:
SELECT CHARSET("€"), HEX("€")
=> latin1, 80
This enables MySQL to use different encodings while working on a
query. You can write a query that compares text columns in different
encodings and returns the results where the text is the same, rather
than the bytes that are stored.
I have everything set to utf8
and still the database gives me garbage!
That’s because of the MySQL client encoding. One query can return
text in several different encodings, it’s unreasonable to expect the
user software to check the encoding of each individual string. So all
text returned by the MySQL client is encoded in only one encoding.
The fast and easy solution to encoding problems: set the MySQL client encoding.
How this is done depends on what you use to connect. If you work in PHP and the old MySQL functions you should use
mysql_set_charset("utf8")
. If you have an old version try
mysql_query("SET NAMES utf8")
. If you use MySQLi it would be
mysqli_set_charset($link, "utf8")
(or
$link->set_charset("utf8")
if you use the OO API).
What are the encodings in the tables good for, then?
When you create a table you can specify a
character encoding for a
column. It means how the MySQL server will convert text like “abc123″
to bytes that can be written to disk. The encoding affects two things:
- the range of characters that can be stored, and
- the number of bytes the text will occupy on disk.
Popular encodings for storage include latin1 and utf8:
- latin1: the characters that you can store cover most European languages, and the text occupies 1 byte per character.
- utf8: you can store any(*) Unicode characters, and the text occupies from 1 to 3 bytes per character, depending on the character.
Other encodings exist, of course. For example
latin2 has characters used in Eastern European languages. If you use
ucs2
you can store any Unicode characters and the text occupies 2 bytes per
character. If you are storing a lot of text in Japanese you may want to
use ucs2 instead of utf8 because you are saving a third in the storage.
You can also specify a
collation like
utf8_general_ci
for a column. It means how MySQL will
sort the data. This affects
indices and
ORDER BY
clauses. Cultures have different rules for alphabetic order: for
example in Swedish Ä is the second to last letter of the alphabet, while
in English it’s equivalent to A. So with Swedish collation you get a
< b < ä and with English collation you get a = ä < b.
Use the ordering your users expect. You can read
more about collations in my other post.
What if you don’t specify an encoding and collation for the column?
Then MySQL will use the default specified for the table, the database,
or the server.
Converting encodings
Strings in MySQL consist of an encoding marker and a list of bytes.
To convert strings from one encoding to another you can use
the CONVERT(… USING …) function. What CONVERT actually does is change
both the marker and the bytes so that the resulting string is the same
as the original; not very useful for fixing broken text.
The solution is
stripping the encoding marker from the string before passing it to CONVERT. This is done by converting the string to “binary” first:
CONVERT(BINARY broken_text USING utf8)
I’ve written a detailed case study on
fixing a particularly complicated MySQL encoding problem in another post.
Practical demonstration
Let’s create a table with columns in different encodings and fill it with some data.
CREATE TABLE test_encoding (
utf8 varchar(4) CHARACTER SET utf8,
latin1 varchar(4) CHARACTER SET latin1,
latin2 varchar(4) CHARACTER SET latin2
);
INSERT INTO test_encoding (utf8,latin1,latin2) values ('A','A','A');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ñ','Ñ','Ñ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('Ŕ','Ŕ','Ŕ');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('☺','☺','☺');
INSERT INTO test_encoding (utf8,latin1,latin2) values ('€','€','€');
When you select the data from this table with
select * from test_encoding
you get the following results:
+------+--------+--------+
| utf8 | latin1 | latin2 |
+------+--------+--------+
| A | A | A |
| Ñ | Ñ | ? |
| Ŕ | ? | Ŕ |
| ☺ | ? | ? |
| € | € | ? |
+------+--------+--------+
Even though each column is encoded differently everything is still
readable. This is thanks to the MySQL client encoding. The characters
that could not be encoded are replaced with question marks: there is no Ñ
in latin2, nor is there a Ŕ in latin1. The smiley face is not present
in either.
The last line is interesting: the euro sign
can be encoded in latin1 although technically it’s not really a latin1 character.
This shows that MySQL’s latin1 is not really the standard ISO-8859-1
encoding, commonly known as “ISO latin 1″. In reality it is
Windows-1252,
the “Western European” encoding that also has curly quotes and other
niceties where ISO-8859-1 has unprintable control characters.
To see that MySQL really is storing the columns in different encodings you could do this:
mysql> select utf8,hex(utf8),hex(latin1) from test_encoding;
+------+-----------+-------------+
| utf8 | hex(utf8) | hex(latin1) |
+------+-----------+-------------+
| A | 41 | 41 |
| Ñ | C391 | D1 |
| Ŕ | C594 | 3F |
| ☺ | E298BA | 3F |
| € | E282AC | 80 |
+------+-----------+-------------+
What happens when we compare two columns encoded differently?
mysql> select utf8,latin1 from test_encoding where utf8=latin1;
+------+--------+
| utf8 | latin1 |
+------+--------+
| A | A |
| Ñ | Ñ |
| € | € |
+------+--------+
MySQL knows that the
character repertoire
of latin1 is a subset of utf8, so it converts the latin1 column into
unicode so that it can be compared. If however you try to compare latin1
with latin2 this is what happens:
mysql> select latin1,latin2 from test_encoding where latin1=latin2;
ERROR 1267 (HY000): Illegal mix of collations (latin1_bin,IMPLICIT) and (latin2_bin,IMPLICIT) for operation '='
Since neither latin1 and latin2 is a subset of the other, MySQL
cannot compare the columns. To be able to compare them you have to
convert at least one to unicode, e.g. by using
CONVERT(.. USING ..)
:
select latin1, latin2 from test_encoding
where convert(latin1 using utf8)=latin2;
I hope this information is enough to solve any character encoding problems you may have with MySQL once and for all.
(*) Did I say
utf8
can store
any Unicode
characters? That’s not really true. MySQL’s utf8 and ucs2 only store
unicode from the basic multilingual plane (BMP), which includes “only”
the characters U+0000 through U+FFFF. If you have MySQL 5.5 or later you
have additional encodings to choose from: utf16, utf32, utf8mb4. These
encodings You can read the details, as usual, in the
official MySQL documentation.
Resource:
http://jonisalonen.com/2012/ultimate-guide-to-utf8-and-mysql/