Lens TutorialΒΆ

We have prepared a Lens tutorial in the form of a Jupyter notebook. A static version is reproduced below, but you can also execute it yourself by downloading the notebook file.

Notebook

Lens Tutorial

Lens is a library for exploring data in Pandas DataFrames. It computes single column summary statistics and estimates the correlation between columns.

We wrote Lens when we realised that the initial steps of acquiring a new dataset were almost formulaic: what data type is in this column? How many null values are there? Which columns are correlated? What's the distribution of this value? Lens calculates all this for you, and provides convenient visualisation of this information.

You can use Lens to analyse new datasets as well as using it to compare how DataFrames change over time.

Using lens

To start using Lens you need to import the library:

In [1]:
import lens

Lens has two key functions; lens.summarise for generating a Lens Summary from a DataFrame and lens.explore for visualising the results of a summary.

For this tutorial we are going to use Lens to analyse the Room Occupancy dataset provided in the Machine Learning Repository of UC Irvine. It includes ambient information about a room such as Temperature, Humidity, Light, CO2 and whether it was occupied. The goal is to predict occupancy based on the room measurements.

To read it into pandas use:

In [2]:
import pandas as pd
df = pd.read_csv('http://asi-datasets.s3.amazonaws.com/room_occupancy/room_occupancy.csv')

# Split a numerical variable to have additional categorical variables
df['Humidity_cat'] = pd.cut(df['Humidity'], 5,
                            labels=['low', 'medium-low', 'medium',
                                    'medium-high', 'high']).astype('str')
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-2-d87b9dbd5581> in <module>()
      1 import pandas as pd
----> 2 df = pd.read_csv('http://asi-datasets.s3.amazonaws.com/room_occupancy/room_occupancy.csv')
      3 
      4 # Split a numerical variable to have additional categorical variables
      5 df['Humidity_cat'] = pd.cut(df['Humidity'], 5,

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, escapechar, comment, encoding, dialect, tupleize_cols, error_bad_lines, warn_bad_lines, skipfooter, skip_footer, doublequote, delim_whitespace, as_recarray, compact_ints, use_unsigned, low_memory, buffer_lines, memory_map, float_precision)
    703                     skip_blank_lines=skip_blank_lines)
    704 
--> 705         return _read(filepath_or_buffer, kwds)
    706 
    707     parser_f.__name__ = name

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    427     compression = _infer_compression(filepath_or_buffer, compression)
    428     filepath_or_buffer, _, compression = get_filepath_or_buffer(
--> 429         filepath_or_buffer, encoding, compression)
    430     kwds['compression'] = compression
    431 

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression)
    188 
    189     if _is_url(filepath_or_buffer):
--> 190         req = _urlopen(filepath_or_buffer)
    191         content_encoding = req.headers.get('Content-Encoding', None)
    192         if content_encoding == 'gzip':

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    221     else:
    222         opener = _opener
--> 223     return opener.open(url, data, timeout)
    224 
    225 def install_opener(opener):

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in open(self, fullurl, data, timeout)
    530         for processor in self.process_response.get(protocol, []):
    531             meth = getattr(processor, meth_name)
--> 532             response = meth(req, response)
    533 
    534         return response

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in http_response(self, request, response)
    640         if not (200 <= code < 300):
    641             response = self.parent.error(
--> 642                 'http', request, response, code, msg, hdrs)
    643 
    644         return response

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in error(self, proto, *args)
    568         if http_err:
    569             args = (dict, 'default', 'http_error_default') + orig_args
--> 570             return self._call_chain(*args)
    571 
    572 # XXX probably also want an abstract factory that knows when it makes

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    502         for handler in handlers:
    503             func = getattr(handler, meth_name)
--> 504             result = func(*args)
    505             if result is not None:
    506                 return result

~/checkouts/readthedocs.org/user_builds/lens/conda/stable/lib/python3.6/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    648 class HTTPDefaultErrorHandler(BaseHandler):
    649     def http_error_default(self, req, fp, code, msg, hdrs):
--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found
In [3]:
print('Number of rows in dataset: {}'.format(len(df.index)))
df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-51af78adc7d9> in <module>()
----> 1 print('Number of rows in dataset: {}'.format(len(df.index)))
      2 df.head()

NameError: name 'df' is not defined

Creating the summary

When you have a DataFrame that you'd like to analyse the first thing to do is to create a Lens Summary object.

In [4]:
ls = lens.summarise(df)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-26b94ac16efd> in <module>()
----> 1 ls = lens.summarise(df)

NameError: name 'df' is not defined

The summarise function takes a DataFrame and returns a Lens Summary object. The time this takes to run is dependent on both the number of rows and the number of columns in the DataFrame. It will use all cores available on the machine, so you might want to use a SherlockML instance with more cores to speed up the computation of the summary. There are additional optional parameters that can be passed in. Details of these can be found in the summarise API docs.

Given that creating the summary is computationally intensive, Lens provides a way to save this summary to a JSON file on disk and recover a saved summary through the to_json and from_json methods of lens.summary. This allows to store it for future analysis or to share it with collaborators:

In [5]:
# Saving to JSON
ls.to_json('room_occupancy_lens_summary.json')

# Reading from a file
ls_from_json = lens.Summary.from_json('room_occupancy_lens_summary.json')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-b4ac926a4ca0> in <module>()
      1 # Saving to JSON
----> 2 ls.to_json('room_occupancy_lens_summary.json')
      3 
      4 # Reading from a file
      5 ls_from_json = lens.Summary.from_json('room_occupancy_lens_summary.json')

NameError: name 'ls' is not defined

The LensSummary object contains the information computed from the dataset and provides methods to access both column-wise and whole dataset information. It is designed to be used programatically, and information about the methods can be accessed in the LensSummary API docs.

In [6]:
print(ls.columns)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-53fbd974258b> in <module>()
----> 1 print(ls.columns)

NameError: name 'ls' is not defined

Create explorer

Lens provides a function that converts a Lens Summary into an Explorer object. This can be used to see the summary information in tabular form and to display plots.

In [7]:
explorer = lens.explore(ls)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-ca78edc511ce> in <module>()
----> 1 explorer = lens.explore(ls)

NameError: name 'ls' is not defined

Coming back to our room occupancy dataset, the first thing that we'd like to know is a high-level overview of the data.

Describe

To show a general description of the DataFrame call the describe function. This is similar to Pandas' DataFrame.describe but also shows information for non-numeric columns.

In [8]:
explorer.describe()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-71345a25be99> in <module>()
----> 1 explorer.describe()

NameError: name 'explorer' is not defined

We can see that our dataset has 8143 rows and all the rows are complete. This is a very clean dataset! It also tells us the columns and their types, including a desc field that explains how Lens will treat this column.

Column details

To see type-specific column details, use the column_details method. Used on a numeric column such as Temperature, it provides summary statistics for the data in that column, including minimun, maximum, mean, median, and standard deviation.

In [9]:
explorer.column_details('Temperature')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-1dcc3484f0f1> in <module>()
----> 1 explorer.column_details('Temperature')

NameError: name 'explorer' is not defined

We saw in the ouput of explorer.describe() that Occupancy, our target variable, is a categorical column with two unique values. With explorer.column_details we can obtain a frequency table for these two categories - empty (0) or occupied (1):

In [10]:
explorer.column_details('Occupancy')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-8ff4c5be7431> in <module>()
----> 1 explorer.column_details('Occupancy')

NameError: name 'explorer' is not defined

Correlation

As a first step in exploring the relationships between the columns we can look at the correlation coefficients. explorer.correlation() returns a Spearman rank-order correlation coefficient matrix in tabular form.

In [11]:
explorer.correlation()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-812c7fefee40> in <module>()
----> 1 explorer.correlation()

NameError: name 'explorer' is not defined

However, parsing a correlation table becomes difficult when there are many columns in the dataset. To get a better overview, we can plot the correlation matrix as a heatmap, which immediately highlights a group of columns correlated with Occupancy: Temperature, Light, and CO2.

In [12]:
explorer.correlation_plot()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-38c8dacf5299> in <module>()
----> 1 explorer.correlation_plot()

NameError: name 'explorer' is not defined

Distribution and Cumulative Distribution

We can explore the distribution of numerical variables through the distribution_plot and cdf_plot functions:

In [13]:
explorer.distribution_plot('Temperature')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-80e97a51da35> in <module>()
----> 1 explorer.distribution_plot('Temperature')

NameError: name 'explorer' is not defined
In [14]:
explorer.cdf_plot('Temperature')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-14-7bc69085b8cf> in <module>()
----> 1 explorer.cdf_plot('Temperature')

NameError: name 'explorer' is not defined

Pairwise plot

Once we know that certain columns might be correlated, it is useful to visually explore that correlation. This would typically be done through a scatter plot, and Lens has computed a 2D Kernel Density Estimate of the scatter plot that can be accessed through pairwise_density_plot.

In [15]:
explorer.pairwise_density_plot('Temperature', 'Humidity')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-4ae0c148fbda> in <module>()
----> 1 explorer.pairwise_density_plot('Temperature', 'Humidity')

NameError: name 'explorer' is not defined

pairwise_density_plot can also show the relationship between a numeric column and a categorical column. In this case, a 1D KDE is computed for each of the categories in the categorical column.

In [16]:
explorer.pairwise_density_plot('Temperature', 'Occupancy')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-16-c077fea455c0> in <module>()
----> 1 explorer.pairwise_density_plot('Temperature', 'Occupancy')

NameError: name 'explorer' is not defined

Crosstab

The pairwise relationship between two categorical variables can also be seen as a cross-tabulation: how many observations exist in the dataset of the combination of categories in the two variables. This can be seen as a table or as a plot, which can be useful when the number of categories is very large.

In [17]:
explorer.crosstab('Occupancy', 'Humidity_cat')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-671156e67b22> in <module>()
----> 1 explorer.crosstab('Occupancy', 'Humidity_cat')

NameError: name 'explorer' is not defined
In [18]:
explorer.pairwise_density_plot('Occupancy', 'Humidity_cat')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-75868ed37b0c> in <module>()
----> 1 explorer.pairwise_density_plot('Occupancy', 'Humidity_cat')

NameError: name 'explorer' is not defined

Interactive widget

An alternative way of quickly exploring the plots available in Lens is through a Jupyter widget provided by lens.interactive_explore. Creating it is as easy as running this function on a Lens Summary.

Note that if you are reading this tutorial through the online docs the output of the following cell will not be interactive as it needs to run within a notebook. Download the notebook from the links below to try out the interactive explorer!

In [19]:
lens.interactive_explore(ls)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-a9dbfd7afa86> in <module>()
----> 1 lens.interactive_explore(ls)

NameError: name 'ls' is not defined

(room_occupancy_example.ipynb; room_occupancy_example_evaluated.ipynb; room_occupancy_example.py)