Monday, June 23, 2014

On Dolphins, Panda's and Bugs

MySQL Bugs

On Dolphins, Panda's and Bugs

Like any good OpenSource project the MySQL Bugs website is open for anyone to search through. This ofcourse doesn't include the security bugs.

There is a second collection of bugs in the My Oracle Support and these bugs are only accesseble by customers with a support contract. Even when I have access to MOS I still prefer to use the community bugs site. For service requests etc. I would use MOS.

The openness of the bugs database is one of the topic the IOUG MySQL Council discusses with Oracle.

The bugs database has more to offer than just information about initial bugs:

  • Bugs Statistics: This has a big matrix with components and states, some per developer stats and some totals (Did you known the fastest bug closure was 9 seconds?).
  • Bugs Tide This gives you per month statistics.

For both there are some filter option for version and whether to include feature requests.

You might want to read life cycle of a MySQL bug by Valeriy Kravchuk if you're not familiar with the different states of MySQL bugs.

The bugs website offers search results not only in the HTML reports as shown on the website but also offers RSS and CSV format so it can be consumed by computer programs.

To demostrate this I wrote this script. It fetches the CSV for my bugs and generates some graphs. As you might (or might not) notice this doesn't include my older bug reports as I have two different reporter-id's as I had different email accounts for my MySQL account and my My Oracle account before MySQL/Sun was merged with Oracle.

This page is made with IPython Notebook, which is a tool which allows you to combine code, output, graphs and text. You can also view this notebook here.

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

In [3]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from StringIO import StringIO
from datetime import datetime

This is the URL as found in the bugs website. Feel free to modify the parameters.

In [4]:
bugs_url = "http://bugs.mysql.com/search-csv.php?status=all&severity=all&mine=9242646"
  1. Fetch the data from the URL with the Requests library
  2. Convert the CSV data to a Pandas DataFrame (a DataFrame is a special kind of table)
In [5]:
bugs_req = requests.get(bugs_url)
csv_file = StringIO()
csv_file.write(bugs_req.text)
csv_file.seek(0)
bugsdf = pd.read_csv(csv_file)

Let's have a look at the data.

In [6]:
bugsdf.head()
Out[6]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last
0 68132 2013-01-21 13:53:28 2013-03-26 16:14:20 Server: Docs Closed S3 5.6, 5.5.29 Any Documentation about verification with GnuPG is... Philip Olson
1 68199 2013-01-28 08:23:59 2013-05-13 13:10:00 bugs.mysql.com Closed S3 5.7 Any Month-by-month (tide) stats for 5.7 absent Sveta Smirnova
2 68200 2013-01-28 08:54:53 2013-03-10 16:03:34 Server: Privileges Closed S3 5.6.9-rc Any Password logging doesn't work as documented NaN NaN
3 68536 2013-03-01 10:29:46 2013-07-31 13:24:26 MySQL Workbench Closed S3 5.2.47 Any Better support IPv6 addresses for new connections NaN NaN
4 68571 2013-03-05 09:08:23 2014-02-26 09:41:24 Monitoring: Advisors/Rules Verified S3 NaN Microsoft Windows (Win7 SP1) False positive for 32-binary check on win64 NaN NaN

Now we have the data, let's make a horizontal barplot for the number of bugs per category (the Type column).

We change the size of the figure as the default is too small to be readable.

In [7]:
fig = plt.figure(figsize=(8,10), dpi=100)
bugsdf.Type.value_counts(ascending=True).plot(kind='barh')
Out[7]:
<matplotlib.axes.AxesSubplot at 0x394de90>

The Version column has a the text for the version. Let's grab the first three characters to get the major version of for the bug. This is not really perfect as it will only return 5.1 if the string is '5.1.30, 5.5.16', but it's good enough for now.

The function will be mapped to the Version column of the dataframe. And we will save the result in a new column called major_version.

In [8]:
def getversion(inputver):
    if isinstance(inputver, float):
        return inputver
    return inputver[:3]
bugsdf['major_version'] = bugsdf.Version.map(getversion)

Besides Pandas we can use matplotlib's pyplot, which is a bit like MATLAB.

Let's create a heatmap for bug status and category

In [9]:
compstat = bugsdf.groupby('major_version').Type.value_counts().unstack().T
fig = plt.figure(figsize=(15,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=5, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Out[9]:
([<matplotlib.axis.XTick at 0x3c067d0>,
  <matplotlib.axis.XTick at 0x3d87410>,
  <matplotlib.axis.XTick at 0x3f2d550>,
  <matplotlib.axis.XTick at 0x3f2da50>,
  <matplotlib.axis.XTick at 0x3f2df50>,
  <matplotlib.axis.XTick at 0x3f15490>,
  <matplotlib.axis.XTick at 0x3f15990>,
  <matplotlib.axis.XTick at 0x3f15e90>,
  <matplotlib.axis.XTick at 0x40983d0>,
  <matplotlib.axis.XTick at 0x4098b10>,
  <matplotlib.axis.XTick at 0x3d8f2d0>],
 <a list of 11 Text xticklabel objects>)

Now we can create a heatmap which compares major versions and components.

In [10]:
fig = plt.figure(figsize=(8,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=10, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Out[10]:
([<matplotlib.axis.XTick at 0x3f15bd0>,
  <matplotlib.axis.XTick at 0x3f176d0>,
  <matplotlib.axis.XTick at 0x42bcad0>,
  <matplotlib.axis.XTick at 0x42bcfd0>,
  <matplotlib.axis.XTick at 0x4459d50>,
  <matplotlib.axis.XTick at 0x4455710>,
  <matplotlib.axis.XTick at 0x4451310>,
  <matplotlib.axis.XTick at 0x42cdc50>,
  <matplotlib.axis.XTick at 0x42c9810>,
  <matplotlib.axis.XTick at 0x42c51d0>,
  <matplotlib.axis.XTick at 0x42bdb10>],
 <a list of 11 Text xticklabel objects>)

Now we have the major version, let's filter on MySQL 5.6 and then graph the different values for the Status field.

In [11]:
bugsdf[bugsdf.major_version == '5.6'].Status.value_counts().plot(kind='bar')
Out[11]:
<matplotlib.axes.AxesSubplot at 0x4451e10>

Or maybe check the status for all versions.

In [12]:
bugsdf.groupby('major_version').Status.value_counts().unstack().plot(kind='barh', stacked=True)
Out[12]:
<matplotlib.axes.AxesSubplot at 0x446f790>

That's not really helpful, let's remove the Closed ones.

In [13]:
bugsdf[bugsdf.Status != 'Closed'].groupby('major_version').Status.value_counts().unstack().plot(kind='bar', stacked=True)
Out[13]:
<matplotlib.axes.AxesSubplot at 0x4464650>

The Entered and Modified fields are not yet in a 'real' date format. So these must be coverted before we can use them.

In [14]:
bugsdf['Entered'] = pd.to_datetime(bugsdf.Entered)
bugsdf['Modified'] = pd.to_datetime(bugsdf.Modified)

Which are the oldest open bugs?

The - before the bugsdf.Status.isin reverses the result, so it behave like 'is not in'.

In [15]:
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])].sort(columns='Entered')[:5]
Out[15]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last major_version
4 68571 2013-03-05 09:08:23 2014-02-26 09:41:24 Monitoring: Advisors/Rules Verified S3 NaN Microsoft Windows (Win7 SP1) False positive for 32-binary check on win64 NaN NaN NaN
5 68574 2013-03-05 13:42:26 2014-02-23 11:25:48 Server: InnoDB Verified S3 5.5.30, 5.6.16 Microsoft Windows (win7 sp1 ent) No instrumentation for InnoDB files in P_S on ... Marc Alff 5.5
19 68925 2013-04-11 05:57:29 2013-04-11 06:15:40 Client Verified S3 5.6.10, 5.5.29 Any Compatibility issue with mysql history ("\040"... NaN NaN 5.6
22 69147 2013-05-05 10:07:16 2013-07-28 14:43:57 Server: I_S Verified S4 5.7 Any Variable defaults in information_schema NaN NaN 5.7
33 69223 2013-05-14 06:22:48 2013-06-24 20:03:16 Server: DDL Verified S4 5.6.10 Any Give a warning on CREATE TABLE without PRIMARY... NaN NaN 5.6

Now let's find bugs which are open and not modified in a some time.

In [17]:
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])] \
      [bugsdf.Modified < datetime(2013, 6, 1)] \
      [bugsdf.Modified > datetime(1970, 1, 1)] 
Out[17]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last major_version
19 68925 2013-04-11 05:57:29 2013-04-11 06:15:40 Client Verified S3 5.6.10, 5.5.29 Any Compatibility issue with mysql history ("\040"... NaN NaN 5.6
35 69226 2013-05-14 10:31:09 2013-05-14 16:09:09 Server: Options Verified S3 5.6.11 Any Status variable for SSL/TLS implementation NaN NaN 5.6
36 69314 2013-05-24 21:33:48 2013-05-25 07:55:24 Server: Options Verified S3 5.7.1 Any ignore-db-dir option doesn't work for database... NaN NaN 5.7

Besides searching for bugs by reporter you can also use this to search on other conditions.

Monday, June 9, 2014

Quick test run with infobright

After reading Jonathan Levin's article about infobright I decided I had to try Infobright.

So I downloaded the 32-bit tarball. Normally I would have gone for a 64-bit build but those only came in RPM and DEB flavour.

Then I tried to run infobright community edition (ICE) in a MySQL Sandbox, but that failed as the resolveip utility failed to give an answer for localhost.


error while creating grant tables
Neither host 'daniel-thinkpad' nor 'localhost' could be looked up with
/home/dveeden/opt/mysql/4.0.7-ice/bin/resolveip
Please configure the 'hostname' command to return a correct
hostname.
If you want to solve this at a later stage, restart this script
with the --force option

Then I used docker. This went really smooth. You can grab my docker file here.

After creating a container with infobright I tried to insert some data... as I thought only UPDATE and DELETE were prohibited with ICE. But it turned out that also INSERT and ALTER TABLE .. ENGINE=.. are prohibited.

So you have to use LOAD DATA IN FILE which is fully documented in their data loading guide.

But there is an important difference between ICE and IEE (infobright enterprise edition): ICE uses the infobright loader and IEE uses the MySQL loader by default.

Loading some test data from the distroinfo-data package:

mysql> create table distroinfo(version decimal(3,1), codename varchar(20), series varchar(20), created datetime, `release` datetime, eol datetime);
Query OK, 0 rows affected (0.00 sec)

mysql> load data infile '/usr/share/distro-info/ubuntu.csv' into table distroinfo fields terminated by ',';
Query OK, 21 rows affected (0.06 sec)
Records: 21  Deleted: 0  Skipped: 0  Warnings: 0

But If I do the same with a MyISAM table:
mysql> load data infile '/usr/share/distro-info/ubuntu.csv' into table distroinfo2 fields terminated by ',';
Query OK, 21 rows affected, 23 warnings (0.00 sec)
Records: 21  Deleted: 0  Skipped: 0  Warnings: 22

So installing Infobright in a docker container is easy. Both ICE and IEE seem to be useful if you take the limitations of each in account. And you should be extremly careful with loading data with the Infobright loader as it might silently truncate your data!

Saturday, March 29, 2014

Notes on the AES encryption in MySQL

Oracle has improved the AES encryption/decryption functions in MySQL 5.6.17. They improved it a lot and posted a blog which explains all the details.

If you would like to know more about encryption there are two resources I would recommend:
  • The Code Book by Simon Singh. This is about the history of cryptography, but it also includes a lot of information about crypto which is currently in use. This is also a very entertaining read.
  • Crypto 101, a free/opensource book which gives a intro to crypto. The webpage also has a video of the talk on which the book is based.
And if you're going to use the AES encryption functions in MySQL there are some things you should consider:
  • Connections with a UNIX socket (or shared memory on Windows) should be safe. Local TCP/IP connections are also safe in most circumstances.
  • Remote connections which use the AES encryption should use a protected connection (SSL, VPN, etc). Otherwise someone might be able to sniff the network traffice which contains your encryption key.
  • Replication traffic should also be protected (especially if you're using statement based replication. Mixed or row based might be safer).
  • There is a trick to prevent the key to be logged:
    • SET @encrypted := AES_ENCRYPT('foo', 'bar');
    • INSERT INTO t1(c1) VALUES(@encrypted);
  • MySQL 5.6 prevents logging of passwords, but this doesn't include the AES functions (yet). (Bug #72158). So be careful with your logging.
  • It is possible to use the AES_DECRYPT in a view definition, but then your probably storing your key with your data, which defeats the purpose. The same is true for storing your key in a stored procedure, function or a virtual column in case of MariaDB.
And there are alternatives to using encryption within MySQL: Your application could encrypt the data before it's sent over the network. And depending on why you need encryption it might also be possible to encrypt the whole data directory. This can be achieved with LUKS or eCryptFS if you're on Linux or ZFS if you're on Solaris. On Windows you might use the EFS feature of NTFS or use Bitlocker.

The AES functions in MySQL are a good solution if you only want to encrypt specific data (which is often the case) or if you need to search on the decrypted data (WHERE AES_DECRYPT(..)='foo')

Friday, March 14, 2014

MySQL NL Meetup March 18

The next MySQL User Group NL meetup is this Tuesday at the Oracle office in Utrecht.

For more info go to the Event page @ Meetup.com

Saturday, February 8, 2014

Unittesting your indexes

During FOSDEM PGDay I watched the "Indexes: The neglected performance all-rounder" talk by Markus Winand. Both his talk and the "SQL Performance Explained" book (which is also available online) are great.

The conclusion of the talk is that we should put more effort in carefully designing indexes. But how can we make sure the indexes are really used now and in the future? We need to write some tests for it.

So I wrote a small Python script to test index usage per query. This uses the JSON explain format available in MySQL 5.6. It's just a proof-of-concept so don't expect too much of it yet (but please sent pull requests!).

A short example:
#!/usr/bin/python3
import indextest


class tester(indextest.IndexTester):
    def __init__(self):
        dbparams = { 'user': 'msandbox',
                     'password': 'msandbox',
                     'host': 'localhost',
                     'port': '5615',
                     'database': 'imdb'}
        self.dbparams = dbparams

    def test_query1(self):
        q1 = self.query("SELECT * FROM movies WHERE rank>9.8")
        return q1.testEqual('query_block.table.access_type', 'range')

    def test_query2(self):
        q2 = self.query("SELECT * FROM actors WHERE first_name='Tom'")
        return q2.testEqual('query_block.table.access_type', 'range')

    def test_query3(self):
        q3 = self.query("SELECT * FROM actors WHERE first_name='%Tom'")
        return q3.testEqual('query_block.table.key', 'idx_first_name')

run = tester()
run.runall()
$ ./example.py 
Query: SELECT * FROM movies WHERE rank>9.8
Test: query_block.table.access_type == range
Result: range == range: True

Query: SELECT * FROM actors WHERE first_name='Tom'
Test: query_block.table.access_type == range
Result: range == ref: False

Query: SELECT * FROM actors WHERE first_name='%Tom'
Test: query_block.table.key == idx_first_name
Result: idx_first_name == idx_first_name: True

Tested 3 queries, Pass: 2, Fail: 1

The code is available on github