Lens¶
Lens Tutorial¶
We have prepared a Lens tutorial in the form of a Jupyter notebook. A static
version is reproduced below, but you can also execute it yourself by downloading
the notebook file
.
Lens Tutorial¶
Lens is a library for exploring data in Pandas DataFrames. It computes single column summary statistics and estimates the correlation between columns.
We wrote Lens when we realised that the initial steps of acquiring a new dataset were almost formulaic: what data type is in this column? How many null values are there? Which columns are correlated? What's the distribution of this value? Lens calculates all this for you, and provides convenient visualisation of this information.
You can use Lens to analyse new datasets as well as using it to compare how DataFrames change over time.
Using lens¶
To start using Lens you need to import the library:
import lens
Lens has two key functions; lens.summarise
for generating a Lens Summary from a DataFrame and
lens.explore
for visualising the results of a summary.
For this tutorial we are going to use Lens to analyse the Room Occupancy dataset provided in the Machine Learning Repository of UC Irvine. It includes ambient information about a room such as Temperature, Humidity, Light, CO2 and whether it was occupied. The goal is to predict occupancy based on the room measurements.
We read the training portion of the dataset into pandas directly from the UCI repository:
import pandas as pd
from urllib.request import urlopen
from io import BytesIO
from zipfile import ZipFile
remote_zip = urlopen('https://archive.ics.uci.edu/ml/machine-learning-databases/00357/occupancy_data.zip')
df = pd.read_csv(BytesIO(ZipFile(BytesIO(remote_zip.read())).read('datatraining.txt')))
# Split a numerical variable to have additional categorical variables
df['Humidity_cat'] = pd.cut(df['Humidity'], 5,
labels=['low', 'medium-low', 'medium',
'medium-high', 'high']).astype('str')
print('Number of rows in dataset: {}'.format(len(df.index)))
df.head()
Creating the summary¶
When you have a DataFrame that you'd like to analyse the first thing to do is
to create a Lens Summary
object.
ls = lens.summarise(df)
The summarise
function takes a DataFrame and returns a Lens Summary
object. The
time this takes to run is dependent on both the number of rows and the number of
columns in the DataFrame. It will use all cores available on the machine, so you
might want to use a SherlockML instance with more cores to speed up the computation
of the summary. There are additional optional parameters that can be
passed in. Details of these can be found in the summarise API docs.
Given that creating the summary is computationally intensive, Lens provides a way to save this summary to a JSON file on disk and recover a saved summary through the to_json
and from_json
methods of lens.summary
. This allows to store it for future analysis or to share it with collaborators:
# Saving to JSON
ls.to_json('room_occupancy_lens_summary.json')
# Reading from a file
ls_from_json = lens.Summary.from_json('room_occupancy_lens_summary.json')
The LensSummary
object contains the information computed from the dataset and provides methods to access both column-wise and whole dataset information. It is designed to be used programatically, and information about the methods can be accessed in the LensSummary API docs.
print(ls.columns)
Create explorer¶
Lens provides a function that converts a Lens Summary into an Explorer
object.
This can be used to see the summary information in tabular form and to display
plots.
explorer = lens.explore(ls)
Coming back to our room occupancy dataset, the first thing that we'd like to know is a high-level overview of the data.
Describe¶
To show a general description of the DataFrame call the describe
function.
This is similar to Pandas' DataFrame.describe
but also shows information for non-numeric columns.
explorer.describe()
We can see that our dataset has 8143 rows and all the rows are complete. This
is a very clean dataset! It also tells us the columns and their types, including a desc
field that explains how Lens will treat this column.
Column details¶
To see type-specific column details, use the column_details
method. Used on a numeric column such as Temperature
, it provides summary statistics for the data in that column, including minimun, maximum, mean, median, and standard deviation.
explorer.column_details('Temperature')
We saw in the ouput of explorer.describe()
that Occupancy
, our target variable, is a categorical column with two unique values. With explorer.column_details
we can obtain a frequency table for these two categories - empty (0) or occupied (1):
explorer.column_details('Occupancy')
Correlation¶
As a first step in exploring the relationships between the columns we can look at the correlation coefficients. explorer.correlation()
returns a Spearman rank-order correlation coefficient matrix in tabular form.
explorer.correlation()
However, parsing a correlation table becomes difficult when there are many columns in the dataset. To get a better overview, we can plot the correlation matrix as a heatmap, which immediately highlights a group of columns correlated with Occupancy
: Temperature
, Light
, and CO2
.
explorer.correlation_plot()
Distribution and Cumulative Distribution¶
We can explore the distribution of numerical variables through the distribution_plot
and cdf_plot
functions:
explorer.distribution_plot('Temperature')
explorer.cdf_plot('Temperature')
Pairwise plot¶
Once we know that certain columns might be correlated, it is useful to visually explore that correlation. This would typically be done through a scatter plot, and Lens has computed a 2D Kernel Density Estimate of the scatter plot that can be accessed through pairwise_density_plot
.
explorer.pairwise_density_plot('Temperature', 'Humidity')
pairwise_density_plot
can also show the relationship between a numeric column and a categorical column. In this case, a 1D KDE is computed for each of the categories in the categorical column.
explorer.pairwise_density_plot('Temperature', 'Occupancy')
Crosstab¶
The pairwise relationship between two categorical variables can also be seen as a cross-tabulation: how many observations exist in the dataset of the combination of categories in the two variables. This can be seen as a table or as a plot, which can be useful when the number of categories is very large.
explorer.crosstab('Occupancy', 'Humidity_cat')
explorer.pairwise_density_plot('Occupancy', 'Humidity_cat')
Interactive widget¶
An alternative way of quickly exploring the plots available in Lens is through a Jupyter widget provided by lens.interactive_explore
. Creating it is as easy as running this function on a Lens Summary
.
Note that if you are reading this tutorial through the online docs the output of the following cell will not be interactive as it needs to run within a notebook. Download the notebook from the links below to try out the interactive explorer!
lens.interactive_explore(ls)
(room_occupancy_example.ipynb; room_occupancy_example_evaluated.ipynb; room_occupancy_example.py)
API documentation¶
lens.summarise API¶
Summarise a Pandas DataFrame
-
lens.summarise.
summarise
(df, scheduler='multiprocessing', num_workers=None, size=None, pairdensities=True)[source]¶ Create a Lens Summary for a Pandas DataFrame.
This creates a
Summary
instance containing many quantities of interest to a data scientist.Parameters: df : pd.DataFrame
DataFrame to be analysed.
scheduler : str, optional
Dask scheduler to use. Must be one of [‘multiprocessing’, ‘threaded’, ‘sync’].
num_workers : int or None, optional
Number of workers in the pool. If the environment variable NUM_CPUS is set that number will be used, otherwise it will use as many workers as CPUs available in the machine.
size : int, optional
DataFrame size on disk, which will be added to the report.
pairdensities : bool, optional
Whether to compute the pairdensity estimation between all pairs of numerical columns. For most datasets, this is the most expensive computation. Default is True.
Returns: summary :
Summary
The computed data summary.
Examples
Let’s explore the wine quality dataset.
>>> import pandas as pd >>> import lens >>> url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv" # noqa >>> wines_df = pd.read_csv(url, sep=';') >>> summary = lens.summarise(wines_df)
Now that we have a
Summary
instance we can inspect the shape of the dataset>>> summary.columns ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'] >>> summary.rows 4898
So far, nothing groundbreaking. Let’s look at the
quality
column:>>> summary.summary('quality') {'desc': 'categorical', 'dtype': 'int64', 'name': 'quality', 'notnulls': 4898, 'nulls': 0, 'unique': 7}
This tells us that there are seven unique values in the quality columns, and zero null values. It also tells us that lens will treat this column as categorical. Let’s look at this in more details:
>>> summary.details('quality') {'desc': 'categorical', 'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5}, 'iqr': 1.0, 'max': 9, 'mean': 5.8779093507554103, 'median': 6.0, 'min': 3, 'name': 'quality', 'std': 0.88563857496783116, 'sum': 28790}
This tells us that the median wine quality is 6 and the standard deviation is less than one. Let’s now get the correlation between the
quality
column and thealcohol
column:>>> summary.pair_detail('quality', 'alcohol')['correlation'] {'pearson': 0.4355747154613688, 'spearman': 0.4403691816246831}
Thus, the Spearman Rank Correlation coefficient between these two columns is 0.44.
-
class
lens.summarise.
Summary
(report)[source]¶ A summary of a pandas DataFrame.
Create a summary instance by calling
lens.summarise.summarise()
on a DataFrame. This calculates several quantities of interest to data scientists.The Summary object is designed for programmatic use. For more direct visual inspection, use the
lens.explorer.Explorer
class in a Jupyter notebook.-
cdf
(column)[source]¶ Approximate cdf for column
This returns a function representing the cdf of a numeric column.
Parameters: column : str
Name of the column.
Returns: cdf: function
Function representing the cdf.
Examples
>>> cdf = summary.cdf('chlorides') >>> min_value = summary.details('chlorides')['min'] >>> max_value = summary.details('chlorides')['max'] >>> xs = np.linspace(min_value, max_value, 200) >>> plt.plot(xs, cdf(xs))
-
columns
¶ Get a list of column names of the dataset.
Returns: list
Column names
Examples
>>> summary.columns ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
-
correlation_matrix
(include=None, exclude=None)[source]¶ Correlation matrix for numeric columns
Parameters: include: list of strings, optional
List of numeric columns to include. Includes all columns by default.
exclude: list of strings, optional
List of numeric columns to exclude. Includes all columns by default.
Returns: columns: list of strings
List of column names
correlation_matrix: 2D array of floats
The correlation matrix, ordered such that
correlation_matrix[i, j]
is the correlation betweencolumns[i]
andcolumns[j]
Notes
The columns are ordered through hierarchical clustering. Thus, neighbouring columns in the output will be more correlated.
-
details
(column)[source]¶ Type-specific information for a column
The details method returns additional information on
column
, beyond that provided by thesummary
method. Ifcolumn
is numeric, this returns summary statistics. If it is categorical, it returns a dictionary of how often each category occurs.Parameters: column : str
Column name
Returns: dict
Dictionary of detailed information.
Examples
>>> summary.details('alcohol') {'desc': 'numeric', 'iqr': 1.9000000000000004, 'max': 14.199999999999999, 'mean': 10.514267047774602, 'median': 10.4, 'min': 8.0, 'name': 'alcohol', 'std': 1.2306205677573181, 'sum': 51498.880000000005}
>>> summary.details('quality') {'desc': 'categorical', 'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5}, 'iqr': 1.0, 'max': 9, 'mean': 5.8779093507554103, 'median': 6.0, 'min': 3, 'name': 'quality', 'std': 0.88563857496783116, 'sum': 28790}
-
static
from_json
(file)[source]¶ Create a Summary from a report saved in JSON format.
Parameters: file : str or buffer
Path to file containing the JSON report or buffer from which the report can be read.
Returns: Summary
object containing the summary in the JSON file.
-
histogram
(column)[source]¶ Return the histogram for column.
This function returns a histogram for the column. The number of bins is estimated through the Freedman-Diaconis rule.
Parameters: column: str
Name of the column
Returns: counts: array
Counts for each of the bins of the histogram.
bin_edges : array
Edges of the bins in the histogram. Length is
length(counts)+1
.
-
kde
(column)[source]¶ Return a Kernel Density Estimate for column.
This function returns a KDE for the column. It is computed between the minimum and maximum values of the column and uses Scott’s rule to compute the bandwith.
Parameters: column: str
Name of the column
Returns: x: array
Values at which the KDE has been evaluated.
y : array
Values of the KDE.
-
pair_details
(first, second)[source]¶ Get pairwise information for a column pair.
The information returned depends on the types of the two columns. It may contain the following keys.
- correlation
- dictionary with the Spearman rank correlation coefficient and Pearson product-moment correlation coefficient between the columns. This is returned when both columns are numeric.
- pairdensity
- dictionary with an estimate of the pairwise density between the columns. The density is either a 2D KDE estimate if both columns are numerical, or several 1D KDE estimates if one of the columns is categorical and the other numerical (grouped by the categorical column) or a cross-tabuluation.
Parameters: first : str
Name of the first column.
second : str
Name of the second column.
Returns: dict
Dictionary of pairwise information.
Examples
>>> summary.pair_details('chlorides', 'quality') {'correlation': { 'pearson': -0.20993441094675602, 'spearman': -0.31448847828244203}, {'pairdensity': { 'density': <2d numpy array> 'x': <1d numpy array of x-values> 'y': <1d numpy array of y-values> 'x_scale': 'linear', 'y_scale': 'cat'} }
>>> summary.pair_details('alcohol', 'chlorides') {'correlation': { 'pearson': -0.36018871210816106, 'spearman': -0.5708064071153713}, {'pairdensity': { 'density': <2d numpy array> 'x': <1d numpy array of x-values> 'y': <1d numpy array of y-values> 'x_scale': 'linear', 'y_scale': 'linear'} }
-
pdf
(column)[source]¶ Approximate pdf for column
This returns a function representing the pdf of a numeric column.
Parameters: column : str
Name of the column.
Returns: pdf: function
Function representing the pdf.
Examples
>>> pdf = summary.pdf('chlorides') >>> min_value = summary.details('chlorides')['min'] >>> max_value = summary.details('chlorides')['max'] >>> xs = np.linspace(min_value, max_value, 200) >>> plt.plot(xs, pdf(xs))
-
rows
¶ Get the number of rows in the dataset.
Returns: int
Number of rows
Examples
>>> summary.rows 4898
-
rows_unique
¶ Get the number of unique rows in the dataset.
Returns: int
Number of unique rows.
-
summary
(column)[source]¶ Basic information about the column
This returns information about the number of nulls and unique values in
column
as well as which type this column is. This is guaranteed to return a dictionary with the same keys for every column.The dictionary contains the following keys:
desc
- the type of data: currently
categorical
ornumeric
. Lens will calculate different quantities for this column depending on the value ofdesc
. dtype
- the type of data in Pandas.
name
- column name
notnulls
- number of non-null values in the column
nulls
- number of null-values in the column
unique
- number of unique values in the column
Parameters: column : str
Column name
Returns: dict
Dictionary of summary information.
Examples
>>> summary.summary('quality') {'desc': 'categorical', 'dtype': 'int64', 'name': 'quality', 'notnulls': 4898, 'nulls': 0, 'unique': 7}
>>> summary.summary('chlorides') {'desc': 'numeric', 'dtype': 'float64', 'name': 'chlorides', 'notnulls': 4898, 'nulls': 0, 'unique': 160}
-
tdigest
(column)[source]¶ Return a TDigest object approximating the distribution of a column
Documentation for the TDigest class can be found at https://github.com/CamDavidsonPilon/tdigest.
Parameters: column : str
Name of the column.
Returns: tdigest.TDigest
TDigest instance computed from the values of the column.
-
lens.explorer API¶
Explore a Summary
-
class
lens.explorer.
Explorer
(summary, plot_renderer=<function _render>)[source]¶ An explorer to visualise a Lens Summary
Once a Lens
Summary
has been generated withlens.summarise.summarise()
, this class provides the methods necessary to explore the summary though tables and plots. It is best used from within a Jupyter notebook.-
cdf_plot
(column)[source]¶ Plot the empirical cumulative distribution function of a column.
Creates a plotly plot with the empirical CDF of a column.
Parameters: column : str
Name of the column.
-
column_details
(column, sort=False)[source]¶ Show type-specific column details.
For numeric columns, this method produces a table with summary statistics, including minimum, maximum, mean, and median. For categorical columns, it produces a frequency table for each category sorted in descending order of frequency.
Parameters: column : str
Name of the column.
sort : boolean, optional
Sort frequency tables in categorical variables by category name.
-
correlation
(include=None, exclude=None)[source]¶ Show the correlation matrix for numeric columns.
Print a Spearman rank order correlation coefficient matrix in tabular form, showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be shown in the table can be selected through either the
include
orexclude
keyword arguments. Only one of them can be given.Parameters: include : list of str
List of columns to include in the correlation plot.
exclude : list of str
List of columns to exclude from the correlation plot.
-
correlation_plot
(include=None, exclude=None)[source]¶ Plot the correlation matrix for numeric columns
Plot a Spearman rank order correlation coefficient matrix showing the correlation between columns. The matrix is reordered to group together columns that have a higher correlation coefficient. The columns to be plotted in the correlation plot can be selected through either the
include
orexclude
keyword arguments. Only one of them can be given.Parameters: include : list of str
List of columns to include in the correlation plot.
exclude : list of str
List of columns to exclude from the correlation plot.
-
crosstab
(column1, column2)[source]¶ Show a contingency table of two categorical columns.
Print a contingency table for two categorical variables showing the multivariate frequancy distribution of the columns.
Parameters: column1 : str
First column.
column2 : str
Second column.
-
describe
()[source]¶ General description of the dataset.
Produces a table including the following information about each column:
desc
- the type of data: currently
categorical
ornumeric
. Lens will calculate different quantities for this column depending on the value ofdesc
. dtype
- the type of data in Pandas.
name
- column name
notnulls
- number of non-null values in the column
nulls
- number of null-values in the column
unique
- number of unique values in the column
-
distribution
(column)[source]¶ Show properties of the distribution of values in the column.
Parameters: column : str
Name of the column.
-
distribution_plot
(column, bins=None)[source]¶ Plot the distribution of a numeric column.
Create a plotly plot with a histogram of the values in a column. The number of bin in the histogram is decided according to the Freedman-Diaconis rule unless given by the bins parameter.
Parameters: column : str
Name of the column.
bins : int, optional
Number of bins to use for histogram. If not given, the Freedman-Diaconis rule will be used to estimate the best number of bins. This argument also accepts the formats taken by the bins parameter of matplotlib’s :function:`~matplotlib.pyplot.hist`.
-
pairwise_density_plot
(column1, column2)[source]¶ Plot the pairwise density between two columns.
This plot is an approximation of a scatterplot through a 2D Kernel Density Estimate for two numerical variables. When one of the variables is categorical, a 1D KDE for each of the categories is shown, normalised to the total number of non-null observations. For two categorical variables, the plot produced is a heatmap representation of the contingency table.
Parameters: column1 : str
First column.
column2 : str
Second column.
-