Monday, June 23, 2014

On Dolphins, Panda's and Bugs

MySQL Bugs

On Dolphins, Panda's and Bugs

Like any good OpenSource project the MySQL Bugs website is open for anyone to search through. This ofcourse doesn't include the security bugs.

There is a second collection of bugs in the My Oracle Support and these bugs are only accesseble by customers with a support contract. Even when I have access to MOS I still prefer to use the community bugs site. For service requests etc. I would use MOS.

The openness of the bugs database is one of the topic the IOUG MySQL Council discusses with Oracle.

The bugs database has more to offer than just information about initial bugs:

  • Bugs Statistics: This has a big matrix with components and states, some per developer stats and some totals (Did you known the fastest bug closure was 9 seconds?).
  • Bugs Tide This gives you per month statistics.

For both there are some filter option for version and whether to include feature requests.

You might want to read life cycle of a MySQL bug by Valeriy Kravchuk if you're not familiar with the different states of MySQL bugs.

The bugs website offers search results not only in the HTML reports as shown on the website but also offers RSS and CSV format so it can be consumed by computer programs.

To demostrate this I wrote this script. It fetches the CSV for my bugs and generates some graphs. As you might (or might not) notice this doesn't include my older bug reports as I have two different reporter-id's as I had different email accounts for my MySQL account and my My Oracle account before MySQL/Sun was merged with Oracle.

This page is made with IPython Notebook, which is a tool which allows you to combine code, output, graphs and text. You can also view this notebook here.

In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

In [3]:
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from StringIO import StringIO
from datetime import datetime

This is the URL as found in the bugs website. Feel free to modify the parameters.

In [4]:
bugs_url = "http://bugs.mysql.com/search-csv.php?status=all&severity=all&mine=9242646"
  1. Fetch the data from the URL with the Requests library
  2. Convert the CSV data to a Pandas DataFrame (a DataFrame is a special kind of table)
In [5]:
bugs_req = requests.get(bugs_url)
csv_file = StringIO()
csv_file.write(bugs_req.text)
csv_file.seek(0)
bugsdf = pd.read_csv(csv_file)

Let's have a look at the data.

In [6]:
bugsdf.head()
Out[6]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last
0 68132 2013-01-21 13:53:28 2013-03-26 16:14:20 Server: Docs Closed S3 5.6, 5.5.29 Any Documentation about verification with GnuPG is... Philip Olson
1 68199 2013-01-28 08:23:59 2013-05-13 13:10:00 bugs.mysql.com Closed S3 5.7 Any Month-by-month (tide) stats for 5.7 absent Sveta Smirnova
2 68200 2013-01-28 08:54:53 2013-03-10 16:03:34 Server: Privileges Closed S3 5.6.9-rc Any Password logging doesn't work as documented NaN NaN
3 68536 2013-03-01 10:29:46 2013-07-31 13:24:26 MySQL Workbench Closed S3 5.2.47 Any Better support IPv6 addresses for new connections NaN NaN
4 68571 2013-03-05 09:08:23 2014-02-26 09:41:24 Monitoring: Advisors/Rules Verified S3 NaN Microsoft Windows (Win7 SP1) False positive for 32-binary check on win64 NaN NaN

Now we have the data, let's make a horizontal barplot for the number of bugs per category (the Type column).

We change the size of the figure as the default is too small to be readable.

In [7]:
fig = plt.figure(figsize=(8,10), dpi=100)
bugsdf.Type.value_counts(ascending=True).plot(kind='barh')
Out[7]:
<matplotlib.axes.AxesSubplot at 0x394de90>

The Version column has a the text for the version. Let's grab the first three characters to get the major version of for the bug. This is not really perfect as it will only return 5.1 if the string is '5.1.30, 5.5.16', but it's good enough for now.

The function will be mapped to the Version column of the dataframe. And we will save the result in a new column called major_version.

In [8]:
def getversion(inputver):
    if isinstance(inputver, float):
        return inputver
    return inputver[:3]
bugsdf['major_version'] = bugsdf.Version.map(getversion)

Besides Pandas we can use matplotlib's pyplot, which is a bit like MATLAB.

Let's create a heatmap for bug status and category

In [9]:
compstat = bugsdf.groupby('major_version').Type.value_counts().unstack().T
fig = plt.figure(figsize=(15,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=5, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Out[9]:
([<matplotlib.axis.XTick at 0x3c067d0>,
  <matplotlib.axis.XTick at 0x3d87410>,
  <matplotlib.axis.XTick at 0x3f2d550>,
  <matplotlib.axis.XTick at 0x3f2da50>,
  <matplotlib.axis.XTick at 0x3f2df50>,
  <matplotlib.axis.XTick at 0x3f15490>,
  <matplotlib.axis.XTick at 0x3f15990>,
  <matplotlib.axis.XTick at 0x3f15e90>,
  <matplotlib.axis.XTick at 0x40983d0>,
  <matplotlib.axis.XTick at 0x4098b10>,
  <matplotlib.axis.XTick at 0x3d8f2d0>],
 <a list of 11 Text xticklabel objects>)

Now we can create a heatmap which compares major versions and components.

In [10]:
fig = plt.figure(figsize=(8,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=10, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Out[10]:
([<matplotlib.axis.XTick at 0x3f15bd0>,
  <matplotlib.axis.XTick at 0x3f176d0>,
  <matplotlib.axis.XTick at 0x42bcad0>,
  <matplotlib.axis.XTick at 0x42bcfd0>,
  <matplotlib.axis.XTick at 0x4459d50>,
  <matplotlib.axis.XTick at 0x4455710>,
  <matplotlib.axis.XTick at 0x4451310>,
  <matplotlib.axis.XTick at 0x42cdc50>,
  <matplotlib.axis.XTick at 0x42c9810>,
  <matplotlib.axis.XTick at 0x42c51d0>,
  <matplotlib.axis.XTick at 0x42bdb10>],
 <a list of 11 Text xticklabel objects>)

Now we have the major version, let's filter on MySQL 5.6 and then graph the different values for the Status field.

In [11]:
bugsdf[bugsdf.major_version == '5.6'].Status.value_counts().plot(kind='bar')
Out[11]:
<matplotlib.axes.AxesSubplot at 0x4451e10>

Or maybe check the status for all versions.

In [12]:
bugsdf.groupby('major_version').Status.value_counts().unstack().plot(kind='barh', stacked=True)
Out[12]:
<matplotlib.axes.AxesSubplot at 0x446f790>

That's not really helpful, let's remove the Closed ones.

In [13]:
bugsdf[bugsdf.Status != 'Closed'].groupby('major_version').Status.value_counts().unstack().plot(kind='bar', stacked=True)
Out[13]:
<matplotlib.axes.AxesSubplot at 0x4464650>

The Entered and Modified fields are not yet in a 'real' date format. So these must be coverted before we can use them.

In [14]:
bugsdf['Entered'] = pd.to_datetime(bugsdf.Entered)
bugsdf['Modified'] = pd.to_datetime(bugsdf.Modified)

Which are the oldest open bugs?

The - before the bugsdf.Status.isin reverses the result, so it behave like 'is not in'.

In [15]:
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])].sort(columns='Entered')[:5]
Out[15]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last major_version
4 68571 2013-03-05 09:08:23 2014-02-26 09:41:24 Monitoring: Advisors/Rules Verified S3 NaN Microsoft Windows (Win7 SP1) False positive for 32-binary check on win64 NaN NaN NaN
5 68574 2013-03-05 13:42:26 2014-02-23 11:25:48 Server: InnoDB Verified S3 5.5.30, 5.6.16 Microsoft Windows (win7 sp1 ent) No instrumentation for InnoDB files in P_S on ... Marc Alff 5.5
19 68925 2013-04-11 05:57:29 2013-04-11 06:15:40 Client Verified S3 5.6.10, 5.5.29 Any Compatibility issue with mysql history ("\040"... NaN NaN 5.6
22 69147 2013-05-05 10:07:16 2013-07-28 14:43:57 Server: I_S Verified S4 5.7 Any Variable defaults in information_schema NaN NaN 5.7
33 69223 2013-05-14 06:22:48 2013-06-24 20:03:16 Server: DDL Verified S4 5.6.10 Any Give a warning on CREATE TABLE without PRIMARY... NaN NaN 5.6

Now let's find bugs which are open and not modified in a some time.

In [17]:
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])] \
      [bugsdf.Modified < datetime(2013, 6, 1)] \
      [bugsdf.Modified > datetime(1970, 1, 1)] 
Out[17]:
ID Entered Modified Type Status Severity Version OS Summary Assign First Assign Last major_version
19 68925 2013-04-11 05:57:29 2013-04-11 06:15:40 Client Verified S3 5.6.10, 5.5.29 Any Compatibility issue with mysql history ("\040"... NaN NaN 5.6
35 69226 2013-05-14 10:31:09 2013-05-14 16:09:09 Server: Options Verified S3 5.6.11 Any Status variable for SSL/TLS implementation NaN NaN 5.6
36 69314 2013-05-24 21:33:48 2013-05-25 07:55:24 Server: Options Verified S3 5.7.1 Any ignore-db-dir option doesn't work for database... NaN NaN 5.7

Besides searching for bugs by reporter you can also use this to search on other conditions.