Monday, September 7, 2015

Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata.

An example:

>>> import unicodedata
>>> unicodedata.name('☺')
'WHITE SMILING FACE'

This module uses the data as released in the UnicodeData.txt file from the unicode.org website.

So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!

I wrote a small Python script to automate this. The basic steps are:

  • Download UnicodeData.txt
  • Create a unicodedata.ucd table
  • Use LOAD DATA LOCAL INFILE to load the data

This isn't difficult especially because the file doesn't have the actual characters in it. It is just an ASCII file with codepoints and the data related to it.

But it would be useful to have those characters in the database. So what I did is to create those with CONVERT(UNHEX(value) USING utf32.

What that does is this:

  • UNHEX: get the number of the codepoint
  • CONVERT that number from utf32. In utf32 all characters are 4-byte/32-bit and are mapped 1-on-1 to their codepoint. This is what ucs2 does with 2-byte/16-bit, but then we can't use the full range of characters. That is why UTF-16 replaced ucs2. UTF-16 uses a variable length, just like UTF-8.

So now I can do this:

mysql> USE unicodedata
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> SELECT name FROM ucd WHERE `char`='☺';
+--------------------+
| name               |
+--------------------+
| WHITE SMILING FACE |
+--------------------+
1 row in set (0.00 sec)

Or this:

Screenshot of DOLPHIN character

To get the data (or just a dump of the database): github.com/dveeden/mysqlunicodedata

I'm speaking at Percona Live Amsterdam about MySQL and Unicode. And you can also cath me the Booking.com booth (#205) or at the Community Diner.

Update 1

Set your encoding to utf8mb4 in your connection properties (e.g. --default-character-set=utf8mb4) or use SET NAMES utf8mb4 to switch to the utf8mb4 character set.

On Linux you might want to install google-noto-color-emoji-fonts, gdouros-symbola-fonts and/or google-android-emoji-fonts to see the emoji characters. For other groups of characters you might need to install additional fonts specific to that script.

Emoji fonts should be present on Windows 10 and might not be present on Windows 7 and earlier. Use the unicode enable mysql option in the start menu which is installed by MySQL Installer. This is to enable unicode for cmd.exe (Powershell or cmd.exe with the correct codepage set might also work)

Thanks to Peter Laursen for testing this.