{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Lens Tutorial\n",
"\n",
"\n",
"*Lens* is a library for exploring data in Pandas DataFrames. It computes single\n",
"column summary statistics and estimates the correlation between columns.\n",
"\n",
"We wrote *Lens* when we realised that the initial steps of acquiring a new\n",
"dataset were almost formulaic: what data type is in this column? How many null\n",
"values are there? Which columns are correlated? What's the distribution of this\n",
"value? Lens calculates all this for you, and provides convenient visualisation\n",
"of this information.\n",
"\n",
"You can use *Lens* to analyse new datasets as well as using it to compare how\n",
"DataFrames change over time.\n",
"\n",
"## Using lens\n",
"\n",
"To start using *Lens* you need to import the library:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import lens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Lens* has two key functions; ``lens.summarise`` for generating a Lens Summary from a DataFrame and\n",
"``lens.explore`` for visualising the results of a summary.\n",
"\n",
"For this tutorial we are going to use *Lens* to analyse the Room Occupancy\n",
"dataset provided in the [Machine Learning Repository of UC Irvine](https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+). It includes \n",
"ambient information about a room such as Temperature, Humidity,\n",
"Light, CO2 and whether it was occupied. The goal is to\n",
"predict occupancy based on the room measurements.\n",
"\n",
"To read it into pandas use:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"df = pd.read_csv('http://asi-datasets.s3.amazonaws.com/room_occupancy/room_occupancy.csv')\n",
"\n",
"# Split a numerical variable to have additional categorical variables\n",
"df['Humidity_cat'] = pd.cut(df['Humidity'], 5,\n",
" labels=['low', 'medium-low', 'medium',\n",
" 'medium-high', 'high']).astype('str')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of rows in dataset: 8143\n"
]
},
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" Temperature | \n",
" Humidity | \n",
" Light | \n",
" CO2 | \n",
" HumidityRatio | \n",
" Occupancy | \n",
" Humidity_cat | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
" 2015-02-04 17:51:00 | \n",
" 23.18 | \n",
" 27.2720 | \n",
" 426.0 | \n",
" 721.25 | \n",
" 0.004793 | \n",
" 1 | \n",
" medium | \n",
"
\n",
" \n",
" 2 | \n",
" 2015-02-04 17:51:59 | \n",
" 23.15 | \n",
" 27.2675 | \n",
" 429.5 | \n",
" 714.00 | \n",
" 0.004783 | \n",
" 1 | \n",
" medium | \n",
"
\n",
" \n",
" 3 | \n",
" 2015-02-04 17:53:00 | \n",
" 23.15 | \n",
" 27.2450 | \n",
" 426.0 | \n",
" 713.50 | \n",
" 0.004779 | \n",
" 1 | \n",
" medium | \n",
"
\n",
" \n",
" 4 | \n",
" 2015-02-04 17:54:00 | \n",
" 23.15 | \n",
" 27.2000 | \n",
" 426.0 | \n",
" 708.25 | \n",
" 0.004772 | \n",
" 1 | \n",
" medium | \n",
"
\n",
" \n",
" 5 | \n",
" 2015-02-04 17:55:00 | \n",
" 23.10 | \n",
" 27.2000 | \n",
" 426.0 | \n",
" 704.50 | \n",
" 0.004757 | \n",
" 1 | \n",
" medium | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date Temperature Humidity Light CO2 HumidityRatio \\\n",
"1 2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004793 \n",
"2 2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783 \n",
"3 2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779 \n",
"4 2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004772 \n",
"5 2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004757 \n",
"\n",
" Occupancy Humidity_cat \n",
"1 1 medium \n",
"2 1 medium \n",
"3 1 medium \n",
"4 1 medium \n",
"5 1 medium "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print('Number of rows in dataset: {}'.format(len(df.index)))\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating the summary\n",
"\n",
"When you have a DataFrame that you'd like to analyse the first thing to do is\n",
"to create a Lens ``Summary`` object."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"ls = lens.summarise(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `summarise` function takes a DataFrame and returns a Lens ``Summary`` object. The\n",
"time this takes to run is dependent on both the number of rows and the number of\n",
"columns in the DataFrame. It will use all cores available on the machine, so you \n",
"might want to use a SherlockML instance with more cores to speed up the computation \n",
"of the summary. There are additional optional parameters that can be\n",
"passed in. Details of these can be found in the [summarise API docs](http://docs.sherlockml.com/lens/summarise_api.html#lens.summarise.summarise).\n",
"\n",
"Given that creating the summary is computationally intensive, *Lens* provides a way to save this summary to a JSON file on disk and recover a saved summary through the `to_json` and `from_json` methods of `lens.summary`. This allows to store it for future analysis or to share it with collaborators:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Saving to JSON\n",
"ls.to_json('room_occupancy_lens_summary.json')\n",
"\n",
"# Reading from a file\n",
"ls_from_json = lens.Summary.from_json('room_occupancy_lens_summary.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `LensSummary` object contains the information computed from the dataset and provides methods to access both column-wise and whole dataset information. It is designed to be used programatically, and information about the methods can be accessed in the [LensSummary API docs](http://docs.sherlockml.com/lens/summarise_api.html#lens.summarise.Summary)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio', 'Occupancy', 'Humidity_cat']\n"
]
}
],
"source": [
"print(ls.columns)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create explorer\n",
"\n",
"Lens provides a function that converts a Lens Summary into an `Explorer` object.\n",
"This can be used to see the summary information in tabular form and to display\n",
"plots."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"explorer = lens.explore(ls)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Coming back to our room occupancy dataset, the first thing that we'd like to\n",
"know is a high-level overview of the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Describe\n",
"\n",
"To show a general description of the DataFrame call the `describe` function.\n",
"This is similar to Pandas' ``DataFrame.describe`` but also shows information for non-numeric columns."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" | date | Temperature | Humidity | Light | CO2 | HumidityRatio | Occupancy | Humidity_cat |
desc | None | numeric | numeric | numeric | numeric | numeric | categorical | categorical |
dtype | object | float64 | float64 | float64 | float64 | float64 | int64 | object |
notnulls | 8143 | 8143 | 8143 | 8143 | 8143 | 8143 | 8143 | 8143 |
nulls | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
unique | 8143 | 265 | 1325 | 889 | 2282 | 3583 | 2 | 5 |
"
],
"text/plain": [
""
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"explorer.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see that our dataset has 8143 rows and all the rows are complete. This\n",
"is a very clean dataset! It also tells us the columns and their types, including a `desc` field that explains how *Lens* will treat this column."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Column details\n",
"\n",
"To see type-specific column details, use the `column_details` method. Used on a numeric column such as `Temperature`, it provides summary statistics for the data in that column, including minimun, maximum, mean, median, and standard deviation."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" | Temperature |
desc | numeric |
dtype | float64 |
min | 19.0 |
max | 23.18 |
mean | 20.6190836403 |
median | 20.39 |
std | 1.01691644111 |
sum | 167901.198083 |
IQR | 1.69 |
"
],
"text/plain": [
""
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"explorer.column_details('Temperature')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We saw in the ouput of `explorer.describe()` that `Occupancy`, our target variable, is a categorical column with two unique values. With `explorer.column_details` we can obtain a frequency table for these two categories - empty (0) or occupied (1):"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"desc: categorical, dtype: int64
"
],
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"explorer.column_details('Occupancy')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Correlation\n",
"\n",
"As a first step in exploring the relationships between the columns we can look at the correlation coefficients. `explorer.correlation()` returns a Spearman rank-order correlation coefficient matrix in tabular form."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" | Humidity | HumidityRatio | Temperature | CO2 | Light | Occupancy |
Humidity | 1.0 | 0.9403266291569079 | -0.19338923334162547 | 0.22351889930108668 | 0.0070653086819316094 | 0.1292350876247768 |
HumidityRatio | 0.9403266291569079 | 1.0 | 0.10476509821418747 | 0.3783569229168182 | 0.16905712790190097 | 0.25583595029547784 |
Temperature | -0.19338923334162547 | 0.10476509821418747 | 1.0 | 0.6369060817084772 | 0.5652372559460639 | 0.5328303325204367 |
CO2 | 0.22351889930108668 | 0.3783569229168182 | 0.6369060817084772 | 1.0 | 0.4731030126223473 | 0.6566512850978677 |
Light | 0.0070653086819316094 | 0.16905712790190097 | 0.5652372559460639 | 0.4731030126223473 | 1.0 | 0.8046454034169337 |
Occupancy | 0.1292350876247768 | 0.25583595029547784 | 0.5328303325204367 | 0.6566512850978677 | 0.8046454034169337 | 1.0 |
"
],
"text/plain": [
""
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"explorer.correlation()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"However, parsing a correlation table becomes difficult when there are many columns in the dataset. To get a better overview, we can plot the correlation matrix as a heatmap, which immediately highlights a group of columns correlated with `Occupancy`: `Temperature`, `Light`, and `CO2`."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.correlation_plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Distribution and Cumulative Distribution"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can explore the distribution of numerical variables through the `distribution_plot` and `cdf_plot` functions:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.distribution_plot('Temperature')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.cdf_plot('Temperature')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pairwise plot\n",
"\n",
"Once we know that certain columns might be correlated, it is useful to visually explore that correlation. This would typically be done through a scatter plot, and *Lens* has computed a 2D Kernel Density Estimate of the scatter plot that can be accessed through `pairwise_density_plot`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.pairwise_density_plot('Temperature', 'Humidity')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`pairwise_density_plot` can also show the relationship between a numeric column and a categorical column. In this case, a 1D KDE is computed for each of the categories in the categorical column."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.pairwise_density_plot('Temperature', 'Occupancy')"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### Crosstab\n",
"\n",
"The pairwise relationship between two categorical variables can also be seen as a cross-tabulation: how many observations exist in the dataset of the combination of categories in the two variables. This can be seen as a table or as a plot, which can be useful when the number of categories is very large."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
" | 0 | 1 |
high | 143 | 243 |
low | 2127 | 425 |
medium | 1880 | 428 |
medium-high | 1420 | 310 |
medium-low | 844 | 323 |
"
],
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"explorer.crosstab('Occupancy', 'Humidity_cat')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"explorer.pairwise_density_plot('Occupancy', 'Humidity_cat')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interactive widget\n",
"\n",
"An alternative way of quickly exploring the plots available in *Lens* is through a Jupyter widget provided by `lens.interactive_explore`. Creating it is as easy as running this function on a *Lens* `Summary`.\n",
"\n",
"Note that if you are reading this tutorial through the online docs the output of the following cell will not be interactive as it needs to run within a notebook. Download the notebook from the links below to try out the interactive explorer!"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Widget Javascript not detected. It may not be installed properly. Did you enable the widgetsnbextension? If not, then run \"jupyter nbextension enable --py --sys-prefix widgetsnbextension\"\n"
]
}
],
"source": [
"lens.interactive_explore(ls)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
},
"widgets": {
"state": {
"496016afdc01477d9f515f2c073feb4f": {
"views": [
{
"cell_index": 36
}
]
}
},
"version": "1.2.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}