lens.summarise API¶
Summarise a Pandas DataFrame
-
lens.summarise.
summarise
(df, scheduler='multiprocessing', num_workers=None, size=None, pairdensities=True)[source]¶ Create a Lens Summary for a Pandas DataFrame.
This creates a
Summary
instance containing many quantities of interest to a data scientist.Parameters: df : pd.DataFrame
DataFrame to be analysed.
scheduler : str, optional
Dask scheduler to use. Must be one of [‘multiprocessing’, ‘threaded’, ‘sync’].
num_workers : int or None, optional
Number of workers in the pool. If the environment variable NUM_CPUS is set that number will be used, otherwise it will use as many workers as CPUs available in the machine.
size : int, optional
DataFrame size on disk, which will be added to the report.
pairdensities : bool, optional
Whether to compute the pairdensity estimation between all pairs of numerical columns. For most datasets, this is the most expensive computation. Default is True.
Returns: summary :
Summary
The computed data summary.
Examples
Let’s explore the wine quality dataset.
>>> import pandas as pd >>> import lens >>> url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv" # noqa >>> wines_df = pd.read_csv(url, sep=';') >>> summary = lens.summarise(wines_df)
Now that we have a
Summary
instance we can inspect the shape of the dataset>>> summary.columns ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'] >>> summary.rows 4898
So far, nothing groundbreaking. Let’s look at the
quality
column:>>> summary.summary('quality') {'desc': 'categorical', 'dtype': 'int64', 'name': 'quality', 'notnulls': 4898, 'nulls': 0, 'unique': 7}
This tells us that there are seven unique values in the quality columns, and zero null values. It also tells us that lens will treat this column as categorical. Let’s look at this in more details:
>>> summary.details('quality') {'desc': 'categorical', 'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5}, 'iqr': 1.0, 'max': 9, 'mean': 5.8779093507554103, 'median': 6.0, 'min': 3, 'name': 'quality', 'std': 0.88563857496783116, 'sum': 28790}
This tells us that the median wine quality is 6 and the standard deviation is less than one. Let’s now get the correlation between the
quality
column and thealcohol
column:>>> summary.pair_detail('quality', 'alcohol')['correlation'] {'pearson': 0.4355747154613688, 'spearman': 0.4403691816246831}
Thus, the Spearman Rank Correlation coefficient between these two columns is 0.44.
-
class
lens.summarise.
Summary
(report)[source]¶ A summary of a pandas DataFrame.
Create a summary instance by calling
lens.summarise.summarise()
on a DataFrame. This calculates several quantities of interest to data scientists.The Summary object is designed for programmatic use. For more direct visual inspection, use the
lens.explorer.Explorer
class in a Jupyter notebook.-
cdf
(column)[source]¶ Approximate cdf for column
This returns a function representing the cdf of a numeric column.
Parameters: column : str
Name of the column.
Returns: cdf: function
Function representing the cdf.
Examples
>>> cdf = summary.cdf('chlorides') >>> min_value = summary.details('chlorides')['min'] >>> max_value = summary.details('chlorides')['max'] >>> xs = np.linspace(min_value, max_value, 200) >>> plt.plot(xs, cdf(xs))
-
columns
¶ Get a list of column names of the dataset.
Returns: list
Column names
Examples
>>> summary.columns ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
-
correlation_matrix
(include=None, exclude=None)[source]¶ Correlation matrix for numeric columns
Parameters: include: list of strings, optional
List of numeric columns to include. Includes all columns by default.
exclude: list of strings, optional
List of numeric columns to exclude. Includes all columns by default.
Returns: columns: list of strings
List of column names
correlation_matrix: 2D array of floats
The correlation matrix, ordered such that
correlation_matrix[i, j]
is the correlation betweencolumns[i]
andcolumns[j]
Notes
The columns are ordered through hierarchical clustering. Thus, neighbouring columns in the output will be more correlated.
-
details
(column)[source]¶ Type-specific information for a column
The details method returns additional information on
column
, beyond that provided by thesummary
method. Ifcolumn
is numeric, this returns summary statistics. If it is categorical, it returns a dictionary of how often each category occurs.Parameters: column : str
Column name
Returns: dict
Dictionary of detailed information.
Examples
>>> summary.details('alcohol') {'desc': 'numeric', 'iqr': 1.9000000000000004, 'max': 14.199999999999999, 'mean': 10.514267047774602, 'median': 10.4, 'min': 8.0, 'name': 'alcohol', 'std': 1.2306205677573181, 'sum': 51498.880000000005}
>>> summary.details('quality') {'desc': 'categorical', 'frequencies': {3: 20, 4: 163, 5: 1457, 6: 2198, 7: 880, 8: 175, 9: 5}, 'iqr': 1.0, 'max': 9, 'mean': 5.8779093507554103, 'median': 6.0, 'min': 3, 'name': 'quality', 'std': 0.88563857496783116, 'sum': 28790}
-
static
from_json
(file)[source]¶ Create a Summary from a report saved in JSON format.
Parameters: file : str or buffer
Path to file containing the JSON report or buffer from which the report can be read.
Returns: Summary
object containing the summary in the JSON file.
-
histogram
(column)[source]¶ Return the histogram for column.
This function returns a histogram for the column. The number of bins is estimated through the Freedman-Diaconis rule.
Parameters: column: str
Name of the column
Returns: counts: array
Counts for each of the bins of the histogram.
bin_edges : array
Edges of the bins in the histogram. Length is
length(counts)+1
.
-
kde
(column)[source]¶ Return a Kernel Density Estimate for column.
This function returns a KDE for the column. It is computed between the minimum and maximum values of the column and uses Scott’s rule to compute the bandwith.
Parameters: column: str
Name of the column
Returns: x: array
Values at which the KDE has been evaluated.
y : array
Values of the KDE.
-
pair_details
(first, second)[source]¶ Get pairwise information for a column pair.
The information returned depends on the types of the two columns. It may contain the following keys.
- correlation
- dictionary with the Spearman rank correlation coefficient and Pearson product-moment correlation coefficient between the columns. This is returned when both columns are numeric.
- pairdensity
- dictionary with an estimate of the pairwise density between the columns. The density is either a 2D KDE estimate if both columns are numerical, or several 1D KDE estimates if one of the columns is categorical and the other numerical (grouped by the categorical column) or a cross-tabuluation.
Parameters: first : str
Name of the first column.
second : str
Name of the second column.
Returns: dict
Dictionary of pairwise information.
Examples
>>> summary.pair_details('chlorides', 'quality') {'correlation': { 'pearson': -0.20993441094675602, 'spearman': -0.31448847828244203}, {'pairdensity': { 'density': <2d numpy array> 'x': <1d numpy array of x-values> 'y': <1d numpy array of y-values> 'x_scale': 'linear', 'y_scale': 'cat'} }
>>> summary.pair_details('alcohol', 'chlorides') {'correlation': { 'pearson': -0.36018871210816106, 'spearman': -0.5708064071153713}, {'pairdensity': { 'density': <2d numpy array> 'x': <1d numpy array of x-values> 'y': <1d numpy array of y-values> 'x_scale': 'linear', 'y_scale': 'linear'} }
-
pdf
(column)[source]¶ Approximate pdf for column
This returns a function representing the pdf of a numeric column.
Parameters: column : str
Name of the column.
Returns: pdf: function
Function representing the pdf.
Examples
>>> pdf = summary.pdf('chlorides') >>> min_value = summary.details('chlorides')['min'] >>> max_value = summary.details('chlorides')['max'] >>> xs = np.linspace(min_value, max_value, 200) >>> plt.plot(xs, pdf(xs))
-
rows
¶ Get the number of rows in the dataset.
Returns: int
Number of rows
Examples
>>> summary.rows 4898
-
rows_unique
¶ Get the number of unique rows in the dataset.
Returns: int
Number of unique rows.
-
summary
(column)[source]¶ Basic information about the column
This returns information about the number of nulls and unique values in
column
as well as which type this column is. This is guaranteed to return a dictionary with the same keys for every column.The dictionary contains the following keys:
desc
- the type of data: currently
categorical
ornumeric
. Lens will calculate different quantities for this column depending on the value ofdesc
. dtype
- the type of data in Pandas.
name
- column name
notnulls
- number of non-null values in the column
nulls
- number of null-values in the column
unique
- number of unique values in the column
Parameters: column : str
Column name
Returns: dict
Dictionary of summary information.
Examples
>>> summary.summary('quality') {'desc': 'categorical', 'dtype': 'int64', 'name': 'quality', 'notnulls': 4898, 'nulls': 0, 'unique': 7}
>>> summary.summary('chlorides') {'desc': 'numeric', 'dtype': 'float64', 'name': 'chlorides', 'notnulls': 4898, 'nulls': 0, 'unique': 160}
-
tdigest
(column)[source]¶ Return a TDigest object approximating the distribution of a column
Documentation for the TDigest class can be found at https://github.com/CamDavidsonPilon/tdigest.
Parameters: column : str
Name of the column.
Returns: tdigest.TDigest
TDigest instance computed from the values of the column.
-