On Dolphins, Panda's and Bugs
Like any good OpenSource project the MySQL Bugs website is open for anyone to search through. This ofcourse doesn't include the security bugs.
There is a second collection of bugs in the My Oracle Support and these bugs are only accesseble by customers with a support contract. Even when I have access to MOS I still prefer to use the community bugs site. For service requests etc. I would use MOS.
The openness of the bugs database is one of the topic the IOUG MySQL Council discusses with Oracle.
The bugs database has more to offer than just information about initial bugs:
- Bugs Statistics: This has a big matrix with components and states, some per developer stats and some totals (Did you known the fastest bug closure was 9 seconds?).
- Bugs Tide This gives you per month statistics.
For both there are some filter option for version and whether to include feature requests.
You might want to read life cycle of a MySQL bug by Valeriy Kravchuk if you're not familiar with the different states of MySQL bugs.
The bugs website offers search results not only in the HTML reports as shown on the website but also offers RSS and CSV format so it can be consumed by computer programs.
To demostrate this I wrote this script. It fetches the CSV for my bugs and generates some graphs. As you might (or might not) notice this doesn't include my older bug reports as I have two different reporter-id's as I had different email accounts for my MySQL account and my My Oracle account before MySQL/Sun was merged with Oracle.
This page is made with IPython Notebook, which is a tool which allows you to combine code, output, graphs and text. You can also view this notebook here.
%pylab inline
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from StringIO import StringIO
from datetime import datetime
This is the URL as found in the bugs website. Feel free to modify the parameters.
bugs_url = "http://bugs.mysql.com/search-csv.php?status=all&severity=all&mine=9242646"
- Fetch the data from the URL with the Requests library
- Convert the CSV data to a Pandas DataFrame (a DataFrame is a special kind of table)
bugs_req = requests.get(bugs_url)
csv_file = StringIO()
csv_file.write(bugs_req.text)
csv_file.seek(0)
bugsdf = pd.read_csv(csv_file)
Let's have a look at the data.
bugsdf.head()
Now we have the data, let's make a horizontal barplot for the number of bugs per category (the Type column).
We change the size of the figure as the default is too small to be readable.
fig = plt.figure(figsize=(8,10), dpi=100)
bugsdf.Type.value_counts(ascending=True).plot(kind='barh')
The Version column has a the text for the version. Let's grab the first three characters to get the major version of for the bug. This is not really perfect as it will only return 5.1 if the string is '5.1.30, 5.5.16', but it's good enough for now.
The function will be mapped to the Version column of the dataframe. And we will save the result in a new column called major_version.
def getversion(inputver):
if isinstance(inputver, float):
return inputver
return inputver[:3]
bugsdf['major_version'] = bugsdf.Version.map(getversion)
Besides Pandas we can use matplotlib's pyplot, which is a bit like MATLAB.
Let's create a heatmap for bug status and category
compstat = bugsdf.groupby('major_version').Type.value_counts().unstack().T
fig = plt.figure(figsize=(15,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=5, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Now we can create a heatmap which compares major versions and components.
fig = plt.figure(figsize=(8,10), dpi=100)
plt.pcolor(compstat, vmin=0, vmax=10, cmap='Blues')
plt.yticks(np.arange(0.5, len(compstat.index), 1), compstat.index)
plt.xticks(np.arange(0.5, len(compstat.columns), 1), compstat.columns)
Now we have the major version, let's filter on MySQL 5.6 and then graph the different values for the Status field.
bugsdf[bugsdf.major_version == '5.6'].Status.value_counts().plot(kind='bar')
Or maybe check the status for all versions.
bugsdf.groupby('major_version').Status.value_counts().unstack().plot(kind='barh', stacked=True)
That's not really helpful, let's remove the Closed ones.
bugsdf[bugsdf.Status != 'Closed'].groupby('major_version').Status.value_counts().unstack().plot(kind='bar', stacked=True)
The Entered and Modified fields are not yet in a 'real' date format. So these must be coverted before we can use them.
bugsdf['Entered'] = pd.to_datetime(bugsdf.Entered)
bugsdf['Modified'] = pd.to_datetime(bugsdf.Modified)
Which are the oldest open bugs?
The - before the bugsdf.Status.isin reverses the result, so it behave like 'is not in'.
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])].sort(columns='Entered')[:5]
Now let's find bugs which are open and not modified in a some time.
bugsdf[-bugsdf.Status.isin(['Closed', 'Duplicate', 'Won\'t fix', 'Can\'t repeat'])] \
[bugsdf.Modified < datetime(2013, 6, 1)] \
[bugsdf.Modified > datetime(1970, 1, 1)]
Besides searching for bugs by reporter you can also use this to search on other conditions.