Daniël's Database Blog: normalization

With Unicode it is possible for strings to look the same, but with slight differences in which codepoints are used.

For example the é in Café can be <U+0065 U+0301> or <U+00E9>.

The solution is to use Unicode normalization, which is supported in every major programming language. Both versions of Café will be normalized to use U+00E9.

In the best situation the application inserting data into the database will do the normalization, but that often not the case.

This gives the following issue: If you search for Café in the normalized form it won't return non-normalized entries.

I made a proof-of-concept parser plugin which indexes the normalized version of words.

A very short demo:

mysql> CREATE TABLE test1 (id int auto_increment primary key,
    -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt));
Query OK, 0 rows affected (0.30 sec)

mysql> CREATE TABLE test2 (id int auto_increment primary key,
    -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt) WITH PARSER norm_parser);
Query OK, 0 rows affected (0.16 sec)

mysql> INSERT INTO test1(txt) VALUES(X'436166C3A9'),(X'43616665CC81');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO test2(txt) VALUES(X'436166C3A9'),(X'43616665CC81');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test1;
+----+--------+
| id | txt    |
+----+--------+
|  1 | Café   |
|  2 | Café  |
+----+--------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM test1 WHERE MATCH (txt) AGAINST ('Café');
+----+-------+
| id | txt   |
+----+-------+
|  1 | Café  |
+----+-------+
1 row in set (0.00 sec)

mysql> SELECT * FROM test2 WHERE MATCH (txt) AGAINST ('Café');
+----+--------+
| id | txt    |
+----+--------+
|  1 | Café   |
|  2 | Café  |
+----+--------+
2 rows in set (0.00 sec)

The source is here.

See also the NORMALIZE feature on the Modern SQL in MySQL page.

Daniël's Database Blog

Sunday, December 13, 2015

Using a parser plugin for improved search results with MySQL 5.7 and InnoDB.