Saturday, December 26, 2015

The performance of MySQL Connector/Python with C Extension

The source of this post is in this gist on nbviewer.

After reading about the difference between MySQL Connector/Python and MySQLdb on this blog post I wondered how the C Extension option in Connector/Python would perform.

If you want to run the code yourself you'll need: Jupyter/IPython, Python 3, Requests, MySQLdb, Connector/Python, Matplotlib, Pandas and MySQL.

In [1]:
%matplotlib notebook
In [2]:
import random
import gzip
import time

import pandas as pd
import matplotlib.pyplot as plt
import requests
import mysql.connector
import MySQLdb
for imp in [mysql.connector, MySQLdb]:
    print('Using {imp} {version}'.format(imp=imp.__name__, version=imp.__version__))
print('C Extension for MySQL Connector/Python available: %s' % mysql.connector.HAVE_CEXT)
Using mysql.connector 2.1.3
Using MySQLdb 1.3.7
C Extension for MySQL Connector/Python available: True

Make sure the C Extension is available. This needs MySQL Connector/Python 2.1 or newer. On Fedora you might need to install this with dnf install mysql-connector-python3-cext if you have the mysql-connectors-community repository installed. If you compile from source then make sure to use the --with-mysql-capi option.

In [3]:
worlddb_url = 'https://downloads.mysql.com/docs/world.sql.gz'
worlddb_req = requests.get(worlddb_url)
if worlddb_req.status_code == 200:
    worldsql = gzip.decompress(worlddb_req.content).decode('iso-8859-15')
In [4]:
config = {
    'host': '127.0.0.1',
    'port': 5710,
    'user': 'msandbox',
    'passwd': 'msandbox',
}

The above is my config to connect to a MySQL Sandbox running MySQL Server 5.7.10.

Note: you might hit MySQL Bug #79780 when loading the world database into MySQL with Connector/Python with the C Extension enabled.

In [5]:
c1 = mysql.connector.connect(use_pure=False, **config)
cur1 = c1.cursor()
cur1.execute('DROP SCHEMA IF EXISTS world')
cur1.execute('CREATE SCHEMA world DEFAULT CHARACTER SET latin1')
cur1.execute('USE world')
result = [x for x in cur1.execute(worldsql, multi=True)]
cur1.close()
c1.close()
In [6]:
config['db'] = 'world'
In [7]:
perfdata = pd.DataFrame(columns=['connpy','connpy_cext','MySQLdb'], index=range(10000))

Now we're going to run 10000 queries with a random primary key between 1 and 8000. This does not use the C Extension as use_pure is set to True.

In [8]:
c1 = mysql.connector.connect(use_pure=True, **config)
cur1 = c1.cursor()
for it in range(10000):
    city_id = random.randint(1,8000)
    start = time.perf_counter()
    cur1.execute("SELECT * FROM City WHERE ID=%s", (city_id,))
    cur1.fetchone()
    perfdata.ix[it]['connpy'] = time.perf_counter() - start

Next up is Connector/Python with the C Extension (use_pure=False and HAVE_CEXT indicates we have the C Extension available)

In [9]:
c1 = mysql.connector.connect(use_pure=False, **config)
cur1 = c1.cursor()
for it in range(10000):
    city_id = random.randint(1,8000)
    start = time.perf_counter()
    cur1.execute("SELECT * FROM City WHERE ID=%s", (city_id,))
    cur1.fetchone()
    perfdata.ix[it]['connpy_cext'] = time.perf_counter() - start

And last, but not least, MySQLdb.

In [10]:
c2 = MySQLdb.connect(**config)
cur2 = c2.cursor()
for it in range(10000):
    city_id = random.randint(1,8000)
    start = time.perf_counter()
    cur2.execute("SELECT * FROM City WHERE ID=%s", (city_id,))
    cur2.fetchone()
    perfdata.ix[it]['MySQLdb'] = time.perf_counter() - start

Now let's have a look to what our data looks like

In [11]:
perfdata.head()
Out[11]:
connpy connpy_cext MySQLdb
0 0.00145918 0.000354935 0.000353173
1 0.000907707 0.000243508 0.000249597
2 0.000468397 0.000277101 0.000207893
3 0.000595066 0.000241349 0.00020754
4 0.000641848 0.000258027 0.000193182

Now let's plot that

In [12]:
plt.style.use('ggplot')
plt.scatter(perfdata.index, perfdata.connpy, s=1, c='r',
            label='Connector/Python Pure')
plt.scatter(perfdata.index, perfdata.connpy_cext, s=1, c='g',
            label='Connector/Python C Ext')
plt.scatter(perfdata.index, perfdata.MySQLdb, s=1, c='b',
            label='MySQLdb')
plt.ylim(ymin=0, ymax=0.001)
plt.xlim(xmin=0, xmax=10000)
plt.xlabel('Run #')
plt.ylabel('Runtime in seconds')
plt.legend()
Out[12]:
<matplotlib.legend.Legend at 0x7f47bd8c2518>

The performance of MySQL Connector/Python 2.1 with the C Extension is much closer to MySQLdb.

There is one serious drawback of using the C Extension: Prepared statements are not yet supported.

But there is more that performance alone. I like the MySQL Connector/Python API over the MySQLdb API and this shows that Connector/Python is flexible and can be almost as fast.

The pure Python implementation also has advantages: Easier installation (no C compiler required) and the option to use alternative implementations like PyPy