Sunday, June 24, 2012

That's not my name! A story about character sets

When computers were still using large black text oriented screens or no screens at all, a computer only knew how to store a limited set of characters. Then it was normal to store a name with the more complicated characters replaced by more basic characters. The ASCII standard was used to make communication between multiple systems (or applications) easier. Storing characters as ASCII needs little space and is quite strait forward.

Then DOS used CP850 and CP437 and so on to make it possible to use language /location specific characters.
Then ISO8859-1, ISO8859-15 and more of these character sets were defined as standard.

And now there is Unicode: UTF-8, UTF-16, UCS2, etc. which allow you to store many different kinds of characters in the same character set.

But all those character sets only work correctly if you configure all applications correctly. Many of the character sets are very similar and seem to work correctly even if one of the systems is not correctly configured. If this happens most characters will be correct except the special ones.  And my name does contain a 'special' character, the 'ë'.

Below is a picture of two letters I received in the mail recently:





So what went wrong?

This is called Mojibake.

The first one:

mysql> CREATE TABLE t1(v1 VARCHAR(200)) ENGINE=InnoDB DEFAULT CHARACTER SET utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> SHOW SESSION VARIABLES LIKE '%char%';
+--------------------------+----------------------------------------------------+
| Variable_name            | Value                                              |
+--------------------------+----------------------------------------------------+
| character_set_client     | utf8                                               |
| character_set_connection | utf8                                               |
| character_set_database   | latin1                                             |
| character_set_filesystem | binary                                             |
| character_set_results    | utf8                                               |
| character_set_server     | latin1                                             |
| character_set_system     | utf8                                               |
| character_sets_dir       | /home/dveeden/mysql/5.5.22-mariadb/share/charsets/ |
+--------------------------+----------------------------------------------------+
8 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.01 sec)

mysql> SELECT * FROM t1;
+-------------------+
| v1                |
+-------------------+
| Daniël van Eeden  |
+-------------------+
1 row in set (0.00 sec)

mysql> set session character_set_client=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> set session character_set_connection=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
+---------------------+
2 rows in set (0.00 sec)

mysql> INSERT INTO t1 VALUES('Daniël van Eeden');
Query OK, 1 row affected (0.00 sec)

mysql> SELECT * FROM t1;
+---------------------+
| v1                  |
+---------------------+
| Daniël van Eeden    |
| Daniël van Eeden   |
| Daniël van Eeden   |
+---------------------+
3 rows in set (0.00 sec)

So we can reproduce this issue by setting the client and connection charset to latin1.

mysql> SELECT v1,HEX(v1) FROM t1;
+---------------------+----------------------------------------+
| v1                  | HEX(v1)                                |
+---------------------+----------------------------------------+
| Daniël van Eeden    | 44616E69C3AB6C2076616E20456564656E     |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
| Daniël van Eeden   | 44616E69C383C2AB6C2076616E20456564656E |
+---------------------+----------------------------------------+
3 rows in set (0.00 sec)

mysql> SELECT CONVERT(X'C3AB' USING utf8),CONVERT(X'C3AB' USING latin1);
+-----------------------------+-------------------------------+
| CONVERT(X'C3AB' USING utf8) | CONVERT(X'C3AB' USING latin1) |
+-----------------------------+-------------------------------+
| ë                           | ë                            |
+-----------------------------+-------------------------------+
1 row in set (0.00 sec)

The ë is stored as C383 C2AB and is rendered as two latin1 characters.

unicode C3 83 (Ã) is latin1 C3
unicode C2 AB («) is latin1 AB
C3AB = unicode ë

charutf8latin1cp850
ëC3 ABEB89
ÃC3 83C3C7
«C2 ABABAE
ÙC3 99D9EB

mysql> SELECT CONVERT(X'C3AB' USING latin1);
+-------------------------------+
| CONVERT(X'C3AB' USING latin1) |
+-------------------------------+
| ë                            |
+-------------------------------+
1 row in set (0.00 sec)

The sender of the first letter is probably storing UTF-8 from the web application in a database, but uses Latin1 when sending the letters.

The second letter renders the ë as Ù.

After some tries it seems like the sender is storing the ë as EB (latin1) and rendering it as CP850.

mysql> SELECT CONVERT(X'EB' USING cp850);
+----------------------------+
| CONVERT(X'EB' USING cp850) |
+----------------------------+
| Ù                          |
+----------------------------+
1 row in set (0.00 sec)

Anders Karslsson recently wrote a nice introduction about Unicode. And Ronald Bradford wrote a blogpost about UTF-8 With MySQL and LAMP.

Sunday, June 3, 2012

XA Transactions between TokuDB and InnoDB

The recently released TokuDB brings many features. One of those features is support for XA Transactions. InnoDB already has support for XA Transactions.

XA Transactions are transactions which span multiple databases and or applications. XA Transactions use 2-phase commit, which is also the same method which MySQL Cluster uses.

Internal XA Transactions are used to keep the binary log and InnoDB in sync.

Demo 1: XA Transaction on 1 node:
mysql55-tokudb6> XA START 'demo01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> INSERT INTO xatest(name) VALUES('demo01');
Query OK, 1 row affected (0.01 sec)

mysql55-tokudb6> SELECT * FROM xatest;
+----+--------+
| id | name   |
+----+--------+
|  3 | demo01 |
+----+--------+
1 row in set (0.00 sec)

mysql55-tokudb6> XA END 'demo01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA PREPARE 'demo01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA COMMIT 'demo01';
Query OK, 0 rows affected (0.00 sec) 
This show a transaction with a transaction ID of 'demo01'.    

XA START Starts the transaction and puts it in ACTIVE state.  
XA END Ends the transaction and puts it IDLE state.
XA PREPARE Will prepare the transaction and puts it in PREPARED state.  
Then XA COMMIT can be used to COMMIT the transaction or XA ROLLBACK can be used to rollback the transaction.    

Demo 2: XA Transaction between 2 nodes:
mysql55-tokudb6> XA START 'tr01';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA START 'tr01';
Query OK, 0 rows affected (0.01 sec)

mysql55-tokudb6> INSERT INTO xatest(name) VALUES('tr01');
Query OK, 1 row affected (0.00 sec)
 
mysql56-innodb> INSERT INTO xatest(name) VALUES('tr01');
Query OK, 1 row affected (0.00 sec)

mysql55-tokudb6> XA END 'tr01';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA END 'tr01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA PREPARE 'tr01';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA PREPARE 'tr01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA COMMIT 'tr01';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA COMMIT 'tr01';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> SELECT * FROM xatest;
+----+------+
| id | name |
+----+------+
|  1 | tr01 |
+----+------+
1 row in set (0.00 sec)

mysql56-innodb> SELECT * FROM xatest;
+----+------+
| id | name |
+----+------+
|  1 | tr01 |
+----+------+
1 row in set (0.00 sec) 
 
Demo 3: XA Transaction with rollback:
mysql55-tokudb6> XA START 'tr02';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA START 'tr02';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> INSERT INTO xatest(name) VALUES('tr02');
Query OK, 1 row affected (0.00 sec)
 
mysql56-innodb> INSERT INTO xatest(name) VALUES('tr02');
Query OK, 1 row affected (0.00 sec)

mysql55-tokudb6> XA END 'tr02';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA END 'tr02';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA PREPARE 'tr02';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA PREPARE 'tr02';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA ROLLBACK 'tr02';
Query OK, 0 rows affected (0.00 sec)

mysql56-innodb> XA ROLLBACK 'tr02';
Query OK, 0 rows affected (0.00 sec)
 
mysql56-innodb> XA ROLLBACK 'tr02';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> SELECT * FROM xatest;
+----+------+
| id | name |
+----+------+
|  1 | tr01 |
+----+------+
1 row in set (0.00 sec)

mysql56-innodb> SELECT * FROM xatest;
+----+------+
| id | name |
+----+------+
|  1 | tr01 |
+----+------+
1 row in set (0.00 sec)



 
 
These transactions are between two MySQL instances: One with InnoDB and one with TokuDB. It's possible to run TokuDB and InnoDB both in one database instance, but separating them between instances (and hosts) might be needed for performance or some other reason.

It's possible to run a transaction with TokuDB and a PostgreSQL database:
mysql55-tokudb6> XA START 'tr03';
Query OK, 0 rows affected (0.00 sec)

xatest=# BEGIN;
BEGIN

mysql55-tokudb6> INSERT INTO xatest(name) VALUES('tr03');
Query OK, 1 row affected (0.00 sec)

mysql55-tokudb6> DELETE FROM xatest WHERE name='tr02';
Query OK, 0 rows affected (0.00 sec)

xatest=# INSERT INTO xatest(name) VALUES('tr03');
INSERT 0 1
mysql55-tokudb6> XA END 'tr03';
Query OK, 0 rows affected (0.00 sec)

mysql55-tokudb6> XA PREPARE 'tr03';
Query OK, 0 rows affected (0.00 sec)
xatest=# PREPARE TRANSACTION 'tr03';
PREPARE TRANSACTION
mysql55-tokudb6> XA COMMIT 'tr03';
Query OK, 0 rows affected (0.00 sec)

xatest=# COMMIT PREPARED 'tr03';
COMMIT PREPARED

For PostgreSQL this only works if max_prepared_transactions is set to a non-zero value.

Transactions can not only run between databases but applications, filesystems and many other components can also be a member of the transactions. They need to support two-phase commit.

An XA Transaction is coordinated by a transaction coordinator which can be an application on a application server.