With Unicode it is possible for strings to look the same, but with slight differences in which codepoints are used.
For example the é in Café can be <U+0065 U+0301> or <U+00E9>.
The solution is to use Unicode normalization, which is supported in every major programming language. Both versions of Café will be normalized to use U+00E9.
In the best situation the application inserting data into the database will do the normalization, but that often not the case.
This gives the following issue: If you search for Café in the normalized form it won't return non-normalized entries.
I made a proof-of-concept parser plugin which indexes the normalized version of words.
A very short demo:mysql> CREATE TABLE test1 (id int auto_increment primary key, -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt)); Query OK, 0 rows affected (0.30 sec) mysql> CREATE TABLE test2 (id int auto_increment primary key, -> txt TEXT CHARACTER SET utf8mb4, fulltext (txt) WITH PARSER norm_parser); Query OK, 0 rows affected (0.16 sec) mysql> INSERT INTO test1(txt) VALUES(X'436166C3A9'),(X'43616665CC81'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> INSERT INTO test2(txt) VALUES(X'436166C3A9'),(X'43616665CC81'); Query OK, 2 rows affected (0.00 sec) Records: 2 Duplicates: 0 Warnings: 0 mysql> SELECT * FROM test1; +----+--------+ | id | txt | +----+--------+ | 1 | Café | | 2 | Café | +----+--------+ 2 rows in set (0.00 sec) mysql> SELECT * FROM test1 WHERE MATCH (txt) AGAINST ('Café'); +----+-------+ | id | txt | +----+-------+ | 1 | Café | +----+-------+ 1 row in set (0.00 sec) mysql> SELECT * FROM test2 WHERE MATCH (txt) AGAINST ('Café'); +----+--------+ | id | txt | +----+--------+ | 1 | Café | | 2 | Café | +----+--------+ 2 rows in set (0.00 sec)
The source is here.
See also the NORMALIZE feature on the Modern SQL in MySQL page.