Sunday, December 13, 2015

Using a parser plugin for improved search results with MySQL 5.7 and InnoDB.

With Unicode it is possible for strings to look the same, but with slight differences in which codepoints are used.

For example the é in Café can be <U+0065 U+0301> or <U+00E9>.

The solution is to use Unicode normalization, which is supported in every major programming language. Both versions of Café will be normalized to use U+00E9.

In the best situation the application inserting data into the database will do the normalization, but that often not the case.

This gives the following issue: If you search for Café in the normalized form it won't return non-normalized entries.

I made a proof-of-concept parser plugin which indexes the normalized version of words.

A very short demo:
mysql> CREATE TABLE test1 (id int auto_increment primary key,
    -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt));
Query OK, 0 rows affected (0.30 sec)

mysql> CREATE TABLE test2 (id int auto_increment primary key,
    -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt) WITH PARSER norm_parser);
Query OK, 0 rows affected (0.16 sec)

mysql> INSERT INTO test1(txt) VALUES(X'436166C3A9'),(X'43616665CC81');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> INSERT INTO test2(txt) VALUES(X'436166C3A9'),(X'43616665CC81');
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test1;
+----+--------+
| id | txt    |
+----+--------+
|  1 | Café   |
|  2 | Café  |
+----+--------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM test1 WHERE MATCH (txt) AGAINST ('Café');
+----+-------+
| id | txt   |
+----+-------+
|  1 | Café  |
+----+-------+
1 row in set (0.00 sec)

mysql> SELECT * FROM test2 WHERE MATCH (txt) AGAINST ('Café');
+----+--------+
| id | txt    |
+----+--------+
|  1 | Café   |
|  2 | Café  |
+----+--------+
2 rows in set (0.00 sec)

The source is here.

See also the NORMALIZE feature on the Modern SQL in MySQL page.

3 comments:

  1. The correct solution to this issue is to simply use the utf8mb4_unicode_ci collation.

    mariadb 10.1.8-MariaDB-log (root) [test]> CREATE TABLE test1 (id int auto_increment primary key, txt TEXT CHARACTER SET utf8mb4, fulltext (txt));
    Query OK, 0 rows affected (0.02 sec)

    mariadb 10.1.8-MariaDB-log (root) [test]> INSERT INTO test1(txt) VALUES(X'436166C3A9'),(X'43616665CC81');
    Query OK, 2 rows affected (0.00 sec)
    Records: 2 Duplicates: 0 Warnings: 0

    mariadb 10.1.8-MariaDB-log (root) [test]> SELECT * FROM test1 WHERE MATCH (txt) AGAINST ('Café');
    +----+-------+
    | id | txt |
    +----+-------+
    | 1 | Café |
    +----+-------+
    1 row in set (0.00 sec)

    mariadb 10.1.8-MariaDB-log (root) [test]> ALTER TABLE test1 MODIFY txt TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
    Query OK, 2 rows affected (0.03 sec)
    Records: 2 Duplicates: 0 Warnings: 0

    mariadb 10.1.8-MariaDB-log (root) [test]> SELECT * FROM test1 WHERE MATCH (txt) AGAINST ('Café');
    +----+--------+
    | id | txt |
    +----+--------+
    | 1 | Café |
    | 2 | Café |
    +----+--------+
    2 rows in set (0.00 sec)

    ReplyDelete
    Replies
    1. Yes, you're right, although this basically does this for the index instead of the data and could also allow NFKD normalization etc. If a case sensitive(-like) collation like utf8mb4_bin is somehow needed or if you want to add an index without updating the (large) table this could also be useful, but I don't think those are good usecases.

      Also note that often utf8mb4_unicode_520_ci is a better choice than utf8mb4_unicode_ci as it uses UCA 5.2.0 instead of UCA 4.0.0. (my code uses UCA 8.0.0)

      mysql> SELECT 'Café' COLLATE utf8mb4_unicode_ci = 'Cafe' COLLATE utf8mb4_unicode_ci AS test1;
      +-------+
      | test1 |
      +-------+
      | 1 |
      +-------+
      1 row in set (0.00 sec)

      mysql> SELECT '🍣 ' COLLATE utf8mb4_unicode_ci = '🍺 ' COLLATE utf8mb4_unicode_ci AS test2;
      +-------+
      | test2 |
      +-------+
      | 1 |
      +-------+
      1 row in set (0.00 sec)

      mysql> SELECT '🍣 ' COLLATE utf8mb4_unicode_520_ci = '🍺 ' COLLATE utf8mb4_unicode_520_ci AS test3;
      +-------+
      | test3 |
      +-------+
      | 0 |
      +-------+
      1 row in set (0.01 sec)

      Delete
    2. https://gist.github.com/dveeden/e5fac8f206a5bf52b5d6

      Delete