Lens

Lens Tutorial

We have prepared a Lens tutorial in the form of a Jupyter notebook. A static version is reproduced below, but you can also execute it yourself by downloading the notebook file.

Notebook

Lens Tutorial

Lens is a library for exploring data in Pandas DataFrames. It computes single column summary statistics and estimates the correlation between columns.

We wrote Lens when we realised that the initial steps of acquiring a new dataset were almost formulaic: what data type is in this column? How many null values are there? Which columns are correlated? What's the distribution of this value? Lens calculates all this for you, and provides convenient visualisation of this information.

You can use Lens to analyse new datasets as well as using it to compare how DataFrames change over time.

Using lens

To start using Lens you need to import the library:

In [1]:
import lens

Lens has two key functions; lens.summarise for generating a Lens Summary from a DataFrame and lens.explore for visualising the results of a summary.

For this tutorial we are going to use Lens to analyse the Room Occupancy dataset provided in the Machine Learning Repository of UC Irvine. It includes ambient information about a room such as Temperature, Humidity, Light, CO2 and whether it was occupied. The goal is to predict occupancy based on the room measurements.

We read the training portion of the dataset into pandas directly from the UCI repository:

In [2]:
import pandas as pd
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile

remote_zip = urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00357/occupancy_data.zip')
df = pd.read_csv(BytesIO(ZipFile(BytesIO(remote_zip.read())).read('datatraining.txt')))

# Split a numerical variable to have additional categorical variables
df['Humidity_cat'] = pd.cut(df['Humidity'], 5,
                            labels=['low', 'medium-low', 'medium',
                                    'medium-high', 'high']).astype('str')
In [3]:
print('Number of rows in dataset: {}'.format(len(df.index)))
df.head()
Number of rows in dataset: 8143
Out[3]:
date Temperature Humidity Light CO2 HumidityRatio Occupancy Humidity_cat
1 2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004793 1 medium
2 2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783 1 medium
3 2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779 1 medium
4 2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004772 1 medium
5 2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004757 1 medium

Creating the summary

When you have a DataFrame that you'd like to analyse the first thing to do is to create a Lens Summary object.

In [4]:
ls = lens.summarise(df)
test_logtrans has failed for column `Temperature`: 'TDigest' object has no attribute 'quantile'
test_logtrans has failed for column `Humidity`: 'TDigest' object has no attribute 'quantile'
test_logtrans has failed for column `HumidityRatio`: 'TDigest' object has no attribute 'quantile'
test_logtrans has failed for column `CO2`: 'TDigest' object has no attribute 'quantile'

The summarise function takes a DataFrame and returns a Lens Summary object. The time this takes to run is dependent on both the number of rows and the number of columns in the DataFrame. It will use all cores available on the machine, so you might want to use a SherlockML instance with more cores to speed up the computation of the summary. There are additional optional parameters that can be passed in. Details of these can be found in the summarise API docs.

Given that creating the summary is computationally intensive, Lens provides a way to save this summary to a JSON file on disk and recover a saved summary through the to_json and from_json methods of lens.summary. This allows to store it for future analysis or to share it with collaborators:

In [5]:
# Saving to JSON
ls.to_json('room_occupancy_lens_summary.json')

# Reading from a file
ls_from_json = lens.Summary.from_json('room_occupancy_lens_summary.json')

The LensSummary object contains the information computed from the dataset and provides methods to access both column-wise and whole dataset information. It is designed to be used programatically, and information about the methods can be accessed in the LensSummary API docs.

In [6]:
print(ls.columns)
['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio', 'Occupancy', 'Humidity_cat']

Create explorer

Lens provides a function that converts a Lens Summary into an Explorer object. This can be used to see the summary information in tabular form and to display plots.

In [7]:
explorer = lens.explore(ls)

Coming back to our room occupancy dataset, the first thing that we'd like to know is a high-level overview of the data.

Describe

To show a general description of the DataFrame call the describe function. This is similar to Pandas' DataFrame.describe but also shows information for non-numeric columns.

In [8]:
explorer.describe()
Out[8]:
dateTemperatureHumidityLightCO2HumidityRatioOccupancyHumidity_cat
descNonenumericnumericnumericnumericnumericcategoricalcategorical
dtypeobjectfloat64float64float64float64float64int64object
notnulls81438143814381438143814381438143
nulls00000000
unique814326513258892282358325

We can see that our dataset has 8143 rows and all the rows are complete. This is a very clean dataset! It also tells us the columns and their types, including a desc field that explains how Lens will treat this column.

Column details

To see type-specific column details, use the column_details method. Used on a numeric column such as Temperature, it provides summary statistics for the data in that column, including minimun, maximum, mean, median, and standard deviation.

In [9]:
explorer.column_details('Temperature')
Out[9]:
Temperature
descnumeric
dtypefloat64
min19.0
max23.18
mean20.6190836403
median20.39
std1.01691644111
sum167901.198083
IQR1.69

We saw in the ouput of explorer.describe() that Occupancy, our target variable, is a categorical column with two unique values. With explorer.column_details we can obtain a frequency table for these two categories - empty (0) or occupied (1):

In [10]:
explorer.column_details('Occupancy')
Out[10]:

desc: categorical, dtype: int64

itemfrequency
06414
11729

Correlation

As a first step in exploring the relationships between the columns we can look at the correlation coefficients. explorer.correlation() returns a Spearman rank-order correlation coefficient matrix in tabular form.

In [11]:
explorer.correlation()
Out[11]:
HumidityHumidityRatioTemperatureCO2LightOccupancy
Humidity1.00.9403266291569079-0.193389233341625470.223518899301086680.00706530868193160940.1292350876247768
HumidityRatio0.94032662915690791.00.104765098214187470.37835692291681820.169057127901900970.25583595029547784
Temperature-0.193389233341625470.104765098214187471.00.63690608170847720.56523725594606390.5328303325204367
CO20.223518899301086680.37835692291681820.63690608170847721.00.47310301262234730.6566512850978677
Light0.00706530868193160940.169057127901900970.56523725594606390.47310301262234731.00.8046454034169337
Occupancy0.12923508762477680.255835950295477840.53283033252043670.65665128509786770.80464540341693371.0

However, parsing a correlation table becomes difficult when there are many columns in the dataset. To get a better overview, we can plot the correlation matrix as a heatmap, which immediately highlights a group of columns correlated with Occupancy: Temperature, Light, and CO2.

In [12]:
explorer.correlation_plot()

Distribution and Cumulative Distribution

We can explore the distribution of numerical variables through the distribution_plot and cdf_plot functions:

In [13]:
explorer.distribution_plot('Temperature')
In [14]:
explorer.cdf_plot('Temperature')

Pairwise plot

Once we know that certain columns might be correlated, it is useful to visually explore that correlation. This would typically be done through a scatter plot, and Lens has computed a 2D Kernel Density Estimate of the scatter plot that can be accessed through pairwise_density_plot.

In [15]:
explorer.pairwise_density_plot('Temperature', 'Humidity')

pairwise_density_plot can also show the relationship between a numeric column and a categorical column. In this case, a 1D KDE is computed for each of the categories in the categorical column.

In [16]:
explorer.pairwise_density_plot('Temperature', 'Occupancy')

Crosstab

The pairwise relationship between two categorical variables can also be seen as a cross-tabulation: how many observations exist in the dataset of the combination of categories in the two variables. This can be seen as a table or as a plot, which can be useful when the number of categories is very large.

In [17]:
explorer.crosstab('Occupancy', 'Humidity_cat')
Out[17]:
01
high143243
low2127425
medium1880428
medium-high1420310
medium-low844323
In [18]:
explorer.pairwise_density_plot('Occupancy', 'Humidity_cat')

Interactive widget

An alternative way of quickly exploring the plots available in Lens is through a Jupyter widget provided by lens.interactive_explore. Creating it is as easy as running this function on a Lens Summary.

Note that if you are reading this tutorial through the online docs the output of the following cell will not be interactive as it needs to run within a notebook. Download the notebook from the links below to try out the interactive explorer!

In [19]:
lens.interactive_explore(ls)

(room_occupancy_example.ipynb; room_occupancy_example_evaluated.ipynb; room_occupancy_example.py)

API documentation

lens.summarise API

Summarise a Pandas DataFrame

lens.summarise.summarise(df, scheduler='multiprocessing', num_workers=None, size=None, pairdensities=True)[source]

Create a Lens Summary for a Pandas DataFrame.

This creates a Summary instance containing many quantities of interest to a data scientist.

Parameters:

df : pd.DataFrame

DataFrame to be analysed.

scheduler : str, optional

Dask scheduler to use. Must be one of [‘multiprocessing’, ‘threaded’, ‘sync’].

num_workers : int or None, optional

Number of workers in the pool. If the environment variable NUM_CPUS is set that number will be used, otherwise it will use as many workers as CPUs available in the machine.

size : int, optional

DataFrame size on disk, which will be added to the report.

pairdensities : bool, optional

Whether to compute the pairdensity estimation between all pairs of numerical columns. For most datasets, this is the most expensive computation. Default is True.

Returns:

summary : Summary

The computed data summary.

Examples

Let’s explore the wine quality dataset.

>>> import pandas as pd
>>> import lens
>>> url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"  # noqa
>>> wines_df = pd.read_csv(url, sep=';')
>>> summary = lens.summarise(wines_df)

Now that we have a Summary instance we can inspect the shape of the dataset

>>> summary.columns
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']
>>> summary.rows
4898

So far, nothing groundbreaking. Let’s look at the quality column:

>>> summary.summary('quality')
{'desc': 'categorical',
 'dtype': 'int64',
 'name': 'quality',
 'notnulls': 4898,
 'nulls': 0,
 'unique': 7}

This tells us that there are seven unique values in the quality columns, and zero null values. It also tells us that lens will treat this column as categorical. Let’s look at this in more details:

>>> summary.details('quality')
{'desc': 'categorical',
 'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5},
 'iqr': 1.0,
 'max': 9,
 'mean': 5.8779093507554103,
 'median': 6.0,
 'min': 3,
 'name': 'quality',
 'std': 0.88563857496783116,
 'sum': 28790}

This tells us that the median wine quality is 6 and the standard deviation is less than one. Let’s now get the correlation between the quality column and the alcohol column:

>>> summary.pair_detail('quality', 'alcohol')['correlation']
{'pearson': 0.4355747154613688, 'spearman': 0.4403691816246831}

Thus, the Spearman Rank Correlation coefficient between these two columns is 0.44.

class lens.summarise.Summary(report)[source]

A summary of a pandas DataFrame.

Create a summary instance by calling lens.summarise.summarise() on a DataFrame. This calculates several quantities of interest to data scientists.

The Summary object is designed for programmatic use. For more direct visual inspection, use the lens.explorer.Explorer class in a Jupyter notebook.

cdf(column)[source]

Approximate cdf for column

This returns a function representing the cdf of a numeric column.

Parameters:

column : str

Name of the column.

Returns:

cdf: function

Function representing the cdf.

Examples

>>> cdf = summary.cdf('chlorides')
>>> min_value = summary.details('chlorides')['min']
>>> max_value = summary.details('chlorides')['max']
>>> xs = np.linspace(min_value, max_value, 200)
>>> plt.plot(xs, cdf(xs))
columns

Get a list of column names of the dataset.

Returns:

list

Column names

Examples

>>> summary.columns
['fixed acidity',
 'volatile acidity',
 'citric acid',
 'residual sugar',
 'chlorides',
 'free sulfur dioxide',
 'total sulfur dioxide',
 'density',
 'pH',
 'sulphates',
 'alcohol',
 'quality']
correlation_matrix(include=None, exclude=None)[source]

Correlation matrix for numeric columns

Parameters:

include: list of strings, optional

List of numeric columns to include. Includes all columns by default.

exclude: list of strings, optional

List of numeric columns to exclude. Includes all columns by default.

Returns:

columns: list of strings

List of column names

correlation_matrix: 2D array of floats

The correlation matrix, ordered such that correlation_matrix[i, j] is the correlation between columns[i] and columns[j]

Notes

The columns are ordered through hierarchical clustering. Thus, neighbouring columns in the output will be more correlated.

details(column)[source]

Type-specific information for a column

The details method returns additional information on column, beyond that provided by the summary method. If column is numeric, this returns summary statistics. If it is categorical, it returns a dictionary of how often each category occurs.

Parameters:

column : str

Column name

Returns:

dict

Dictionary of detailed information.

Examples

>>> summary.details('alcohol')
{'desc': 'numeric',
 'iqr': 1.9000000000000004,
 'max': 14.199999999999999,
 'mean': 10.514267047774602,
 'median': 10.4,
 'min': 8.0,
 'name': 'alcohol',
 'std': 1.2306205677573181,
 'sum': 51498.880000000005}
>>> summary.details('quality')
{'desc': 'categorical',
 'frequencies':
      {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5},
 'iqr': 1.0,
 'max': 9,
 'mean': 5.8779093507554103,
 'median': 6.0,
 'min': 3,
 'name': 'quality',
 'std': 0.88563857496783116,
 'sum': 28790}
static from_json(file)[source]

Create a Summary from a report saved in JSON format.

Parameters:

file : str or buffer

Path to file containing the JSON report or buffer from which the report can be read.

Returns:

Summary

Summary object containing the summary in the JSON file.

histogram(column)[source]

Return the histogram for column.

This function returns a histogram for the column. The number of bins is estimated through the Freedman-Diaconis rule.

Parameters:

column: str

Name of the column

Returns:

counts: array

Counts for each of the bins of the histogram.

bin_edges : array

Edges of the bins in the histogram. Length is length(counts)+1.

kde(column)[source]

Return a Kernel Density Estimate for column.

This function returns a KDE for the column. It is computed between the minimum and maximum values of the column and uses Scott’s rule to compute the bandwith.

Parameters:

column: str

Name of the column

Returns:

x: array

Values at which the KDE has been evaluated.

y : array

Values of the KDE.

pair_details(first, second)[source]

Get pairwise information for a column pair.

The information returned depends on the types of the two columns. It may contain the following keys.

correlation
dictionary with the Spearman rank correlation coefficient and Pearson product-moment correlation coefficient between the columns. This is returned when both columns are numeric.
pairdensity
dictionary with an estimate of the pairwise density between the columns. The density is either a 2D KDE estimate if both columns are numerical, or several 1D KDE estimates if one of the columns is categorical and the other numerical (grouped by the categorical column) or a cross-tabuluation.
Parameters:

first : str

Name of the first column.

second : str

Name of the second column.

Returns:

dict

Dictionary of pairwise information.

Examples

>>> summary.pair_details('chlorides', 'quality')
{'correlation': {
    'pearson': -0.20993441094675602,
    'spearman': -0.31448847828244203},
{'pairdensity': {
    'density': <2d numpy array>
    'x': <1d numpy array of x-values>
    'y': <1d numpy array of y-values>
    'x_scale': 'linear',
    'y_scale': 'cat'}
}
>>> summary.pair_details('alcohol', 'chlorides')
{'correlation': {
    'pearson': -0.36018871210816106,
    'spearman': -0.5708064071153713},
{'pairdensity': {
    'density': <2d numpy array>
    'x': <1d numpy array of x-values>
    'y': <1d numpy array of y-values>
    'x_scale': 'linear',
    'y_scale': 'linear'}
}
pdf(column)[source]

Approximate pdf for column

This returns a function representing the pdf of a numeric column.

Parameters:

column : str

Name of the column.

Returns:

pdf: function

Function representing the pdf.

Examples

>>> pdf = summary.pdf('chlorides')
>>> min_value = summary.details('chlorides')['min']
>>> max_value = summary.details('chlorides')['max']
>>> xs = np.linspace(min_value, max_value, 200)
>>> plt.plot(xs, pdf(xs))
rows

Get the number of rows in the dataset.

Returns:

int

Number of rows

Examples

>>> summary.rows
4898
rows_unique

Get the number of unique rows in the dataset.

Returns:

int

Number of unique rows.

summary(column)[source]

Basic information about the column

This returns information about the number of nulls and unique values in column as well as which type this column is. This is guaranteed to return a dictionary with the same keys for every column.

The dictionary contains the following keys:

desc
the type of data: currently categorical or numeric. Lens will calculate different quantities for this column depending on the value of desc.
dtype
the type of data in Pandas.
name
column name
notnulls
number of non-null values in the column
nulls
number of null-values in the column
unique
number of unique values in the column
Parameters:

column : str

Column name

Returns:

dict

Dictionary of summary information.

Examples

>>> summary.summary('quality')
{'desc': 'categorical',
 'dtype': 'int64',
 'name': 'quality',
 'notnulls': 4898,
 'nulls': 0,
 'unique': 7}
>>> summary.summary('chlorides')
{'desc': 'numeric',
 'dtype': 'float64',
 'name': 'chlorides',
 'notnulls': 4898,
 'nulls': 0,
 'unique': 160}
tdigest(column)[source]

Return a TDigest object approximating the distribution of a column

Documentation for the TDigest class can be found at https://github.com/CamDavidsonPilon/tdigest.

Parameters:

column : str

Name of the column.

Returns:

tdigest.TDigest

TDigest instance computed from the values of the column.

tdigest_centroids(column)[source]

Get TDigest centroids and counts for column.

Parameters:

column : str

Name of the column.

Returns:

numpy.array

Means of the TDigest centroids.

numpy.array

Counts for each of the TDigest centroids.

to_json(file=None)[source]

Produce a JSON serialization of the report.

Parameters:

file : str or buffer, optional

File name or writeable buffer to save the JSON report. If omitted, a string containing the report will be returned.

Returns:

str

JSON serialization of the summary report

lens.explorer API

Explore a Summary

class lens.explorer.Explorer(summary, plot_renderer=<function _render>)[source]

An explorer to visualise a Lens Summary

Once a Lens Summary has been generated with lens.summarise.summarise(), this class provides the methods necessary to explore the summary though tables and plots. It is best used from within a Jupyter notebook.

cdf_plot(column)[source]

Plot the empirical cumulative distribution function of a column.

Creates a plotly plot with the empirical CDF of a column.

Parameters:

column : str

Name of the column.

column_details(column, sort=False)[source]

Show type-specific column details.

For numeric columns, this method produces a table with summary statistics, including minimum, maximum, mean, and median. For categorical columns, it produces a frequency table for each category sorted in descending order of frequency.

Parameters:

column : str

Name of the column.

sort : boolean, optional

Sort frequency tables in categorical variables by category name.

correlation(include=None, exclude=None)[source]

Show the correlation matrix for numeric columns.

Print a Spearman rank order correlation coefficient matrix in tabular form, showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be shown in the table can be selected through either the include or exclude keyword arguments. Only one of them can be given.

Parameters:

include : list of str

List of columns to include in the correlation plot.

exclude : list of str

List of columns to exclude from the correlation plot.

correlation_plot(include=None, exclude=None)[source]

Plot the correlation matrix for numeric columns

Plot a Spearman rank order correlation coefficient matrix showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be plotted in the correlation plot can be selected through either the include or exclude keyword arguments. Only one of them can be given.

Parameters:

include : list of str

List of columns to include in the correlation plot.

exclude : list of str

List of columns to exclude from the correlation plot.

crosstab(column1, column2)[source]

Show a contingency table of two categorical columns.

Print a contingency table for two categorical variables showing the multivariate frequancy distribution of the columns.

Parameters:

column1 : str

First column.

column2 : str

Second column.

describe()[source]

General description of the dataset.

Produces a table including the following information about each column:

desc
the type of data: currently categorical or numeric. Lens will calculate different quantities for this column depending on the value of desc.
dtype
the type of data in Pandas.
name
column name
notnulls
number of non-null values in the column
nulls
number of null-values in the column
unique
number of unique values in the column
distribution(column)[source]

Show properties of the distribution of values in the column.

Parameters:

column : str

Name of the column.

distribution_plot(column, bins=None)[source]

Plot the distribution of a numeric column.

Create a plotly plot with a histogram of the values in a column. The number of bin in the histogram is decided according to the Freedman-Diaconis rule unless given by the bins parameter.

Parameters:

column : str

Name of the column.

bins : int, optional

Number of bins to use for histogram. If not given, the Freedman-Diaconis rule will be used to estimate the best number of bins. This argument also accepts the formats taken by the bins parameter of matplotlib’s :function:`~matplotlib.pyplot.hist`.

pairwise_density_plot(column1, column2)[source]

Plot the pairwise density between two columns.

This plot is an approximation of a scatterplot through a 2D Kernel Density Estimate for two numerical variables. When one of the variables is categorical, a 1D KDE for each of the categories is shown, normalised to the total number of non-null observations. For two categorical variables, the plot produced is a heatmap representation of the contingency table.

Parameters:

column1 : str

First column.

column2 : str

Second column.

Indices and tables