Lens TutorialΒΆ

We have prepared a Lens tutorial in the form of a Jupyter notebook. A static version is reproduced below, but you can also execute it yourself by downloading the notebook file.


Lens Tutorial

Lens is a library for exploring data in Pandas DataFrames. It computes single column summary statistics and estimates the correlation between columns.

We wrote Lens when we realised that the initial steps of acquiring a new dataset were almost formulaic: what data type is in this column? How many null values are there? Which columns are correlated? What's the distribution of this value? Lens calculates all this for you, and provides convenient visualisation of this information.

You can use Lens to analyse new datasets as well as using it to compare how DataFrames change over time.

Using lens

To start using Lens you need to import the library:

In [1]:
import lens

Lens has two key functions; lens.summarise for generating a Lens Summary from a DataFrame and lens.explore for visualising the results of a summary.

For this tutorial we are going to use Lens to analyse the Room Occupancy dataset provided in the Machine Learning Repository of UC Irvine. It includes ambient information about a room such as Temperature, Humidity, Light, CO2 and whether it was occupied. The goal is to predict occupancy based on the room measurements.

To read it into pandas use:

In [2]:
import pandas as pd
df = pd.read_csv('http://asi-datasets.s3.amazonaws.com/room_occupancy/room_occupancy.csv')

# Split a numerical variable to have additional categorical variables
df['Humidity_cat'] = pd.cut(df['Humidity'], 5,
                            labels=['low', 'medium-low', 'medium',
                                    'medium-high', 'high']).astype('str')
(room_occupancy_example.ipynb; room_occupancy_example_evaluated.ipynb; room_occupancy_example.py)