Monday, May 6, 2024

MySQL Protocol: Collations

This story starts with a pull request for go-mysql to allow setting the collation in auth handshake that I was reviewing. The reason why the author wanted to do this is to speedup the connection setup as he has a latency sensitive application and a lot of connection setups and tear downs.

While looking at this I noticed that the collation would be stored in a single byte. However the list of supported collations shows collations with an ID that’s more than 255.

mysql> SELECT MIN(ID),MAX(ID) FROM information_schema.collations;
+---------+---------+
| MIN(ID) | MAX(ID) |
+---------+---------+
|       1 |     323 |
+---------+---------+
1 row in set (0.00 sec)

The protocol documentation for Protocol::HandshakeResponse41 says that the value sent here is only the lower 8-bits.

So I was wondering how do other connectors send this to the server? Are the other 8-bits sent elsewhere in the protocol?

So I used MySQL Connector/Python to try this out.

import mysql.connector
c = mysql.connector.connect(
    host='127.0.0.1',
    port=3306,
    user="test",
    password="test",
    collation="utf8mb4_ja_0900_as_cs",
    ssl_disabled=True
)
cur = c.cursor()
cur.execute("SHOW SESSION VARIABLES LIKE '%collation%'");
for col in cur:
    print(col)
cur.close()
c.close()

Here utf8mb4_ja_0900_as_cs is used which has collation ID 303 or 0x12F in hex. Another way to write this down is 0x01 0x2F

This is how this looked in Wireshark:

So the collation is set to latin1_bin (47) instead of utf8mb4_ja_0900_as_cs (303). Note that 47 in binary is 0x2F. And then we can spot the other part of this as 0x01 in the Unused field, for which the docs say it should be a filler of all zeros.

So I concluded that the protocol docs on this are probably outdated and I started a few merge/pull requests for this:

First this MR for Wireshark to have it decode the 2 bytes that we have seen as a single collation number. And then this PR for mysql-server to update the docs. I also created this PR for MySQL Connector/Python to update the comments in the code.

After doing this I also added a comment to the MariaDB docs for the protocol. And here Vladislav Vaintroub quickly responded with a conclusion which was different than my conclusion. He checked the code to see what the server (both MySQL and MariaDB) is reading and there it is only reading one byte.

After that I have created another MR for Wireshark to revert some of my changes. I also added some comments to my previous PRs/bugreports.

I also did some more testing with Connector/Python and found out that depending on whether or not the C Extension is used and also depending on the version that it might set the collation during the handshake but also after the handshake with SET NAMES. And not just once, it sometimes does it twice. And the second time it only specifies the character set and not the collation, causing the collation to reset to the default one for that character set. I’ve filed this bug for that.

The current state is that the collation can only be set during the handshake if it is <= 255. For other cases you need to send a SET NAMES statement. This isn’t great as this adds at least one roundtrip to the connection setup.

As all character sets have a default collation that’s <=255 this is only an issue with some of the non-default collations. It might also become an issue collations that might be added in the future.

mysql> SELECT MAX(id) FROM information_schema.collations WHERE IS_DEFAULT='YES';
+---------+
| MAX(id) |
+---------+
|     255 |
+---------+
1 row in set (0.01 sec)

Another indication on why this only affects few users is the MySQL C API.

This is how one can set the character set:

mysql_options(&mysql, MYSQL_SET_CHARSET_NAME, "latin1");
mysql_real_connect(&mysql, ...);

Note that MySQL doesn’t send this latin1 string to the server, it looks up the default collation and sends the ID of that to the server instead. So anything that uses the C API isn’t affected by this as it has to use SET NAMES to set non-standard collations.

So I assume that the field in the handshake packet either wasn’t considered once collations with ID’s of more than 255 were added or it was a way to send the character set in a more efficient way.

Note that character sets’s have only names and no numeric ID.

As MySQL Connector/Python takes a collation name, but sends a collation ID it needs to keep a list so it can lookup the ID before connecting to the server. So to connect with a new collation you might have to update the MySQL Connector/Python version.

No comments:

Post a Comment