Skip to main content

Posts

Showing posts with the label Unicode

Using a parser plugin for improved search results with MySQL 5.7 and InnoDB.

With Unicode it is possible for strings to look the same, but with slight differences in which codepoints are used. For example the é in Café can be <U+0065 U+0301> or <U+00E9>. The solution is to use Unicode normalization, which is supported in every major programming language. Both versions of Café will be normalized to use U+00E9. In the best situation the application inserting data into the database will do the normalization, but that often not the case. This gives the following issue: If you search for Café in the normalized form it won't return non-normalized entries. I made a proof-of-concept parser plugin which indexes the normalized version of words. A very short demo: mysql> CREATE TABLE test1 (id int auto_increment primary key, -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt)); Query OK, 0 rows affected (0.30 sec) mysql> CREATE TABLE test2 (id int auto_increment primary key, -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt) WITH P...

Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata . An example: >>> import unicodedata >>> unicodedata.name( '☺' ) 'WHITE SMILING FACE' This module uses the data as released in the UnicodeData.txt file from the unicode.org website. So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it! I wrote a small Python script to automate this. The basic steps are: Download UnicodeData.txt Create a unicodedata.ucd table Use LOAD DATA LOCAL INFILE to load the data This isn't difficult especially because the file doesn't have the actual characters in it. It is just an ASCII file with codepoints and the data related to it. But it would be useful to have those characters in the database. So what I did is to create those with CONVERT(UNHEX(value) USING utf32 . What that does...