Monday, September 7, 2015

Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata.

An example:

>>> import unicodedata
>>> unicodedata.name('☺')
'WHITE SMILING FACE'

This module uses the data as released in the UnicodeData.txt file from the unicode.org website.

So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!

I wrote a small Python script to automate this. The basic steps are:

  • Download UnicodeData.txt
  • Create a unicodedata.ucd table
  • Use LOAD DATA LOCAL INFILE to load the data

This isn't difficult especially because the file doesn't have the actual characters in it. It is just an ASCII file with codepoints and the data related to it.

But it would be useful to have those characters in the database. So what I did is to create those with CONVERT(UNHEX(value) USING utf32.

What that does is this:

  • UNHEX: get the number of the codepoint
  • CONVERT that number from utf32. In utf32 all characters are 4-byte/32-bit and are mapped 1-on-1 to their codepoint. This is what ucs2 does with 2-byte/16-bit, but then we can't use the full range of characters. That is why UTF-16 replaced ucs2. UTF-16 uses a variable length, just like UTF-8.

So now I can do this:

mysql> USE unicodedata
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> SELECT name FROM ucd WHERE `char`='☺';
+--------------------+
| name               |
+--------------------+
| WHITE SMILING FACE |
+--------------------+
1 row in set (0.00 sec)

Or this:

Screenshot of DOLPHIN character

To get the data (or just a dump of the database): github.com/dveeden/mysqlunicodedata

I'm speaking at Percona Live Amsterdam about MySQL and Unicode. And you can also cath me the Booking.com booth (#205) or at the Community Diner.

Update 1

Set your encoding to utf8mb4 in your connection properties (e.g. --default-character-set=utf8mb4) or use SET NAMES utf8mb4 to switch to the utf8mb4 character set.

On Linux you might want to install google-noto-color-emoji-fonts, gdouros-symbola-fonts and/or google-android-emoji-fonts to see the emoji characters. For other groups of characters you might need to install additional fonts specific to that script.

Emoji fonts should be present on Windows 10 and might not be present on Windows 7 and earlier. Use the unicode enable mysql option in the start menu which is installed by MySQL Installer. This is to enable unicode for cmd.exe (Powershell or cmd.exe with the correct codepage set might also work)

Thanks to Peter Laursen for testing this.

11 comments:

  1. Quite funny example with the dolphin. Good catch!

    I tried to make it work on both Windows (using the Unicode enabled 'mysql' command line shipipng with the MySQL Windows Installer) as well on OpenSuse Linux development version (Leap 42.1, Milestone 2) from a terminal/'mysql client'. The dolphin glyph does not display anywhere . Probably a font problem? Have you configured your terminal with some specific font option?

    I also tried a few Windows GUI clients in Windows and Wine, but it will also need a proper font for this.

    (I did not use the python script but executed the .sql for structure and data from the github repository . Simply copy/pasted into the SQL editor of one of the GUI clients and executed. It should not matter (?). Almost every character having a graphical representation do display as expected. But no dolphin in any of the clients here And I have checked that is not April 1st today!)

    ReplyDelete
    Replies
    1. You probably need to set your connection to utf8mb4.
      The easiest way to do that is "SET NAMES utf8mb4".

      A good emoji font is Google Noto (google-noto-color-emoji-fonts on Fedora)
      https://www.google.com/get/noto/

      On windows you might want to see if cmd.exe or PowerShell gives different results.
      A screenshot from Workbench on Fedora:
      https://app.box.com/s/takj4btd2bmi05c19oymp14udu9pyxy2

      Delete
  2. addition .. but executed the .sql for structure and data from the github repository *accessed from a plain browser*.

    ReplyDelete
  3. hmmm . no luck with "SET NAMES utf8mb4" and Noto Sans -font on Windows (SQLyog) .

    With "SET NAMES utf8mb4" on SuSE there is a difference though: a small box now is displaying the 'value' (beforeand without "SET NAMES utf8mb4" it was plain garbage) . I interpret this as the terminal understand the character but the "font_interpreting_layer" (can't find a better name) does not find a glyph for the value.

    I have no clue how to configure a Linux terninal (I use XFCE terminal, if it matters) for use of a specific font)

    ReplyDelete
    Replies
    1. If you see a number then indeed the last issue to solve is to get it to use a correct font.

      In the XFCE terminal you should have a advanced tab in the settings where you could select UTF-8.
      Also make sure LANG is set to a unicode locale: "export LANG=en_US.utf8"

      Delete
  4. "export LANG=en_US.utf8" did not change anything. UTF8 was already selected (from menu .. terminal .. encoding) for the terminal.

    I read some of criticism in SuSE Forums about 'native' OpenSuSE font support (mostly criticism of rendering quality, though). There should be a 3rd party repo available with better quality fonts for replacement of system fonts somewhere, as I understand.

    I have a VM with Fedora 22 as well. Will try it when I get time.

    ReplyDelete
  5. About Windows. I had tried tis on Windows 7 (my 'workhorse' machine). On a Win 10 (insider preview 10532) system it works fine in SQLyog using "Courier New" font for display. So recently and after Win7 Microsoft seem to have updated their system fonts to support emoji's but haven't backported this to Win7.

    I think we can claim the problem solved!

    ReplyDelete
  6. No luck in XFCE terminal in neither Fedora 21 nor 22. Same as in SuSE. The character is recognized
    , but the dolphin glyph does not display. Instead a square with a very small depiction inside of the HEX of the character byte sequence (as far as I can read it). Terminal encoding is UTF8.

    ReplyDelete
  7. And my absolutely last comment for now: no luck either in GNOME terminal on Fedora 21.

    @Daniël - I think you must have tweaked your system or terminal settings somehow to use a non-default font in the terminal?

    ReplyDelete
    Replies
    1. On Fedora 22 with xfce4-terminal and gnome-terminal with default font settings this works.

      The DOLPHIN character is in these fonts:
      Android Emoji
      Noto Color Emoji
      Symbola

      So install google-android-emoji-fonts google-noto-color-emoji-fonts gdouros-symbola-fonts and xfce4-terminal or gnome-terminal and it should work.

      Note that you might need to restart your terminal emulator to pickup the fonts.

      Delete
    2. Thanks!

      Wth google-android-emoji-fonts, google-noto-color-emoji-fonts and gdouros-symbola-fonts installed I finally see the dolphin in XFCE console in Fedora 22. :-)

      Please don't takei t as a personal criticism, but I think you are careless elaborting the client environment and platforms where test cases from this blog will work. Your have oen something extraardinary (installing fonts that OS's won't install as default) and that should be listed.

      If I should rewrite, I would update it with this (and IMO none of it is trivial):

      0) on all platforms* "SET NAMES utf8mb4" for the connection before querying the database. ""SET NAMES utf8" will not handle emojis for instance.

      1) On Linux using 'mysql' client and MySQL Workbench (and posibly more native Linux clients)
      * ensure that your terminal uses UTF8 encoding.
      * ensure that fonts are installed supporting emojis display (such as google-noto-color-emoji-fonts, gdouros-symbola-fonts and gdouros-symbola-fonts on Fedora 22.(on other distros identical or similar fonts may be available with other names).

      2) On Windows using GUI clients
      * native(fonts on Windows 10 are supporting emojis. On Windows 7 and earler they don't . Uncertain about Win 8/8.1 at the moment. On a Windows system where native fonts don't support emojis you will need install a font that does and configure the client to use it for display. Same applies to Windows clients running in Wine on any *nix platform.
      * same applies to the unicode-enabled 'mysql' command line client shipped with MySQL Installer 5.6 or higher and pre-installed in the Windows start menu.This client interface is not avialable with MariaDB currently (but a 'mysql' client from Oracle will connect to MariaDB, obviously).
      * Not sure about Windows Powershell on the various WIndoys version where it is available.

      On Mac OS/X
      ** currently unknown **


      Modify, expand etc. this as you want just let this comment stand for itself. and let otherr eaders fill the holes.

      Delete