Showing posts with label utf8. Show all posts
Showing posts with label utf8. Show all posts

Monday, September 7, 2015

Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata.

An example:

>>> import unicodedata
>>> unicodedata.name('☺')
'WHITE SMILING FACE'

This module uses the data as released in the UnicodeData.txt file from the unicode.org website.

So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!

I wrote a small Python script to automate this. The basic steps are:

  • Download UnicodeData.txt
  • Create a unicodedata.ucd table
  • Use LOAD DATA LOCAL INFILE to load the data

This isn't difficult especially because the file doesn't have the actual characters in it. It is just an ASCII file with codepoints and the data related to it.

But it would be useful to have those characters in the database. So what I did is to create those with CONVERT(UNHEX(value) USING utf32.

What that does is this:

  • UNHEX: get the number of the codepoint
  • CONVERT that number from utf32. In utf32 all characters are 4-byte/32-bit and are mapped 1-on-1 to their codepoint. This is what ucs2 does with 2-byte/16-bit, but then we can't use the full range of characters. That is why UTF-16 replaced ucs2. UTF-16 uses a variable length, just like UTF-8.

So now I can do this:

mysql> USE unicodedata
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> SELECT name FROM ucd WHERE `char`='☺';
+--------------------+
| name               |
+--------------------+
| WHITE SMILING FACE |
+--------------------+
1 row in set (0.00 sec)

Or this:

Screenshot of DOLPHIN character

To get the data (or just a dump of the database): github.com/dveeden/mysqlunicodedata

I'm speaking at Percona Live Amsterdam about MySQL and Unicode. And you can also cath me the Booking.com booth (#205) or at the Community Diner.

Update 1

Set your encoding to utf8mb4 in your connection properties (e.g. --default-character-set=utf8mb4) or use SET NAMES utf8mb4 to switch to the utf8mb4 character set.

On Linux you might want to install google-noto-color-emoji-fonts, gdouros-symbola-fonts and/or google-android-emoji-fonts to see the emoji characters. For other groups of characters you might need to install additional fonts specific to that script.

Emoji fonts should be present on Windows 10 and might not be present on Windows 7 and earlier. Use the unicode enable mysql option in the start menu which is installed by MySQL Installer. This is to enable unicode for cmd.exe (Powershell or cmd.exe with the correct codepage set might also work)

Thanks to Peter Laursen for testing this.

Sunday, June 24, 2012

That's not my name! A story about character sets

When computers were still using large black text oriented screens or no screens at all, a computer only knew how to store a limited set of characters. Then it was normal to store a name with the more complicated characters replaced by more basic characters. The ASCII standard was used to make communication between multiple systems (or applications) easier. Storing characters as ASCII needs little space and is quite strait forward.

Then DOS used CP850 and CP437 and so on to make it possible to use language /location specific characters.
Then ISO8859-1, ISO8859-15 and more of these character sets were defined as standard.

And now there is Unicode: UTF-8, UTF-16, UCS2, etc. which allow you to store many different kinds of characters in the same character set.

But all those character sets only work correctly if you configure all applications correctly. Many of the character sets are very similar and seem to work correctly even if one of the systems is not correctly configured. If this happens most characters will be correct except the special ones.  And my name does contain a 'special' character, the 'ë'.

Below is a picture of two letters I received in the mail recently:





So what went wrong?

This is called Mojibake.

The first one:

mysql> CREATE TABLE t1(v1 VARCHAR(200)) ENGINE=InnoDB DEFAULT CHARACTER SET utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW SESSION VARIABLES LIKE '%char%';
+--------------------------+----------------------------------------------------+
| Variable_name            | Value                                              |
+--------------------------+----------------------------------------------------+
| character_set_client     | utf8                                               |
| character_set_connection | utf8                                               |
| character_set_database   | latin1                                             |
| character_set_filesystem | binary                                             |
| character_set_results    | utf8                                               |
| character_set_server     | latin1                                             |
| character_set_system     | utf8                                               |
| character_sets_dir       | /home/dveeden/mysql/5.5.22-mariadb/share/charsets/ |
+--------------------------+----------------------------------------------------+
8 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM t1;
+-------------------+
| v1                |
+-------------------+
| Daniël van Eeden  |
+-------------------+
1 row in set (0.00 sec)

mysql> set session character_set_client=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> set session character_set_connection=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
+---------------------+
2 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
| Daniël van Eeden   |
+---------------------+
3 rows in set (0.00 sec)

So we can reproduce this issue by setting the client and connection charset to latin1.

mysql> SELECT v1,HEX(v1) FROM t1;
+---------------------+----------------------------------------+
| v1                  | HEX(v1)                                |
+---------------------+----------------------------------------+
| Daniël van Eeden    | 44616E69C3AB6C2076616E20456564656E     |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
+---------------------+----------------------------------------+
3 rows in set (0.00 sec)

mysql> SELECT CONVERT(X'C3AB' USING utf8),CONVERT(X'C3AB' USING latin1);
+-----------------------------+-------------------------------+
| CONVERT(X'C3AB' USING utf8) | CONVERT(X'C3AB' USING latin1) |
+-----------------------------+-------------------------------+
| ë                           | ë                            |
+-----------------------------+-------------------------------+
1 row in set (0.00 sec)

The ë is stored as C383 C2AB and is rendered as two latin1 characters.

unicode C3 83 (Ã) is latin1 C3
unicode C2 AB («) is latin1 AB
C3AB = unicode ë

charutf8latin1cp850
ëC3 ABEB89
ÃC3 83C3C7
«C2 ABABAE
ÙC3 99D9EB

mysql> SELECT CONVERT(X'C3AB' USING latin1);
+-------------------------------+
| CONVERT(X'C3AB' USING latin1) |
+-------------------------------+
| ë                            |
+-------------------------------+
1 row in set (0.00 sec)

The sender of the first letter is probably storing UTF-8 from the web application in a database, but uses Latin1 when sending the letters.

The second letter renders the ë as Ù.

After some tries it seems like the sender is storing the ë as EB (latin1) and rendering it as CP850.

mysql> SELECT CONVERT(X'EB' USING cp850);
+----------------------------+
| CONVERT(X'EB' USING cp850) |
+----------------------------+
| Ù                          |
+----------------------------+
1 row in set (0.00 sec)

Anders Karslsson recently wrote a nice introduction about Unicode. And Ronald Bradford wrote a blogpost about UTF-8 With MySQL and LAMP.