Sunday, June 24, 2012

That's not my name! A story about character sets

When computers were still using large black text oriented screens or no screens at all, a computer only knew how to store a limited set of characters. Then it was normal to store a name with the more complicated characters replaced by more basic characters. The ASCII standard was used to make communication between multiple systems (or applications) easier. Storing characters as ASCII needs little space and is quite strait forward.

Then DOS used CP850 and CP437 and so on to make it possible to use language /location specific characters.
Then ISO8859-1, ISO8859-15 and more of these character sets were defined as standard.

And now there is Unicode: UTF-8, UTF-16, UCS2, etc. which allow you to store many different kinds of characters in the same character set.

But all those character sets only work correctly if you configure all applications correctly. Many of the character sets are very similar and seem to work correctly even if one of the systems is not correctly configured. If this happens most characters will be correct except the special ones.  And my name does contain a 'special' character, the 'ë'.

Below is a picture of two letters I received in the mail recently:





So what went wrong?

This is called Mojibake.

The first one:

mysql> CREATE TABLE t1(v1 VARCHAR(200)) ENGINE=InnoDB DEFAULT CHARACTER SET utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW SESSION VARIABLES LIKE '%char%';
+--------------------------+----------------------------------------------------+
| Variable_name            | Value                                              |
+--------------------------+----------------------------------------------------+
| character_set_client     | utf8                                               |
| character_set_connection | utf8                                               |
| character_set_database   | latin1                                             |
| character_set_filesystem | binary                                             |
| character_set_results    | utf8                                               |
| character_set_server     | latin1                                             |
| character_set_system     | utf8                                               |
| character_sets_dir       | /home/dveeden/mysql/5.5.22-mariadb/share/charsets/ |
+--------------------------+----------------------------------------------------+
8 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM t1;
+-------------------+
| v1                |
+-------------------+
| Daniël van Eeden  |
+-------------------+
1 row in set (0.00 sec)

mysql> set session character_set_client=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> set session character_set_connection=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
+---------------------+
2 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
| Daniël van Eeden   |
+---------------------+
3 rows in set (0.00 sec)

So we can reproduce this issue by setting the client and connection charset to latin1.

mysql> SELECT v1,HEX(v1) FROM t1;
+---------------------+----------------------------------------+
| v1                  | HEX(v1)                                |
+---------------------+----------------------------------------+
| Daniël van Eeden    | 44616E69C3AB6C2076616E20456564656E     |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
+---------------------+----------------------------------------+
3 rows in set (0.00 sec)

mysql> SELECT CONVERT(X'C3AB' USING utf8),CONVERT(X'C3AB' USING latin1);
+-----------------------------+-------------------------------+
| CONVERT(X'C3AB' USING utf8) | CONVERT(X'C3AB' USING latin1) |
+-----------------------------+-------------------------------+
| ë                           | ë                            |
+-----------------------------+-------------------------------+
1 row in set (0.00 sec)

The ë is stored as C383 C2AB and is rendered as two latin1 characters.

unicode C3 83 (Ã) is latin1 C3
unicode C2 AB («) is latin1 AB
C3AB = unicode ë

charutf8latin1cp850
ëC3 ABEB89
ÃC3 83C3C7
«C2 ABABAE
ÙC3 99D9EB

mysql> SELECT CONVERT(X'C3AB' USING latin1);
+-------------------------------+
| CONVERT(X'C3AB' USING latin1) |
+-------------------------------+
| ë                            |
+-------------------------------+
1 row in set (0.00 sec)

The sender of the first letter is probably storing UTF-8 from the web application in a database, but uses Latin1 when sending the letters.

The second letter renders the ë as Ù.

After some tries it seems like the sender is storing the ë as EB (latin1) and rendering it as CP850.

mysql> SELECT CONVERT(X'EB' USING cp850);
+----------------------------+
| CONVERT(X'EB' USING cp850) |
+----------------------------+
| Ù                          |
+----------------------------+
1 row in set (0.00 sec)

Anders Karslsson recently wrote a nice introduction about Unicode. And Ronald Bradford wrote a blogpost about UTF-8 With MySQL and LAMP.

1 comment:

  1. Me too!

    Hi DaieUl,
    I mean, DaniA<!,
    Umm, Daniel?

    I wrote some on charsets and collations, a few years back: <a href="http://code.openark.org/blog/mysql/mysqls-character-sets-and-collations-demystified>MySQL's character sets and collations demystified</a>.

    ReplyDelete