The Breast Cancer Wisconsin dataset is a Machine Learning friendly dataset.

The Breast Cancer Wisconsin dataset is a Machine Learning friendly dataset.¶
Because it is widely used for demonstrating various exploratory data analysis (EDA) approaches, it is well-behaved and a good starting point for getting inspired by others and being a high import topic by itself.
The Breast Cancer Diagnostic data is available on the UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
There are numerous perfect notebooks, e.g., found on Kaggle; two are listed below:
https://www.kaggle.com/harikrishna9/tumor-diagnosis-exploratory-data-analysis/data
https://www.kaggle.com/asimislam/tutorial-python-subplots
In this notebook, I have compiled some of the methods which could be used as a starting point exploring any tabular dataset, keeping in mind that this a rather straightforward data with only numerical values and no missing values and not too severe imbalance between the two different classes.
# importation of libraries to be used
import pandas as pd
import numpy as np
import seaborn as sns # data visualization library
import matplotlib.pyplot as plt
# read the data file, in this case from a folder on the local computer
data = pd.read_csv("C:\\Data\\bcwd.csv")
# wha are the features of this dataset
data.columns
Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'], dtype='object')
# the 'Unnamed: 32' and 'id' are to be excluded going forward
data=data.drop(['Unnamed: 32'], axis = 1)
data=data.drop(['id'], axis = 1)
#set the number of max columns to be allowed to be displayed in a dataframe
pd.set_option('display.max_columns', data.shape[1])
# check the size of the dataset
print('Number of samples', data.shape[0],': ' 'Numbers of features', data.shape[1])
Number of samples 569 : Numbers of features 31
# this information could also be obtained by data.info() plus more
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 diagnosis 569 non-null object 1 radius_mean 569 non-null float64 2 texture_mean 569 non-null float64 3 perimeter_mean 569 non-null float64 4 area_mean 569 non-null float64 5 smoothness_mean 569 non-null float64 6 compactness_mean 569 non-null float64 7 concavity_mean 569 non-null float64 8 concave points_mean 569 non-null float64 9 symmetry_mean 569 non-null float64 10 fractal_dimension_mean 569 non-null float64 11 radius_se 569 non-null float64 12 texture_se 569 non-null float64 13 perimeter_se 569 non-null float64 14 area_se 569 non-null float64 15 smoothness_se 569 non-null float64 16 compactness_se 569 non-null float64 17 concavity_se 569 non-null float64 18 concave points_se 569 non-null float64 19 symmetry_se 569 non-null float64 20 fractal_dimension_se 569 non-null float64 21 radius_worst 569 non-null float64 22 texture_worst 569 non-null float64 23 perimeter_worst 569 non-null float64 24 area_worst 569 non-null float64 25 smoothness_worst 569 non-null float64 26 compactness_worst 569 non-null float64 27 concavity_worst 569 non-null float64 28 concave points_worst 569 non-null float64 29 symmetry_worst 569 non-null float64 30 fractal_dimension_worst 569 non-null float64 dtypes: float64(30), object(1) memory usage: 137.9+ KB
# let us have look at the data
data.tail()
diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
564 | M | 21.56 | 22.39 | 142.00 | 1479.0 | 0.11100 | 0.11590 | 0.24390 | 0.13890 | 0.1726 | 0.05623 | 1.1760 | 1.256 | 7.673 | 158.70 | 0.010300 | 0.02891 | 0.05198 | 0.02454 | 0.01114 | 0.004239 | 25.450 | 26.40 | 166.10 | 2027.0 | 0.14100 | 0.21130 | 0.4107 | 0.2216 | 0.2060 | 0.07115 |
565 | M | 20.13 | 28.25 | 131.20 | 1261.0 | 0.09780 | 0.10340 | 0.14400 | 0.09791 | 0.1752 | 0.05533 | 0.7655 | 2.463 | 5.203 | 99.04 | 0.005769 | 0.02423 | 0.03950 | 0.01678 | 0.01898 | 0.002498 | 23.690 | 38.25 | 155.00 | 1731.0 | 0.11660 | 0.19220 | 0.3215 | 0.1628 | 0.2572 | 0.06637 |
566 | M | 16.60 | 28.08 | 108.30 | 858.1 | 0.08455 | 0.10230 | 0.09251 | 0.05302 | 0.1590 | 0.05648 | 0.4564 | 1.075 | 3.425 | 48.55 | 0.005903 | 0.03731 | 0.04730 | 0.01557 | 0.01318 | 0.003892 | 18.980 | 34.12 | 126.70 | 1124.0 | 0.11390 | 0.30940 | 0.3403 | 0.1418 | 0.2218 | 0.07820 |
567 | M | 20.60 | 29.33 | 140.10 | 1265.0 | 0.11780 | 0.27700 | 0.35140 | 0.15200 | 0.2397 | 0.07016 | 0.7260 | 1.595 | 5.772 | 86.22 | 0.006522 | 0.06158 | 0.07117 | 0.01664 | 0.02324 | 0.006185 | 25.740 | 39.42 | 184.60 | 1821.0 | 0.16500 | 0.86810 | 0.9387 | 0.2650 | 0.4087 | 0.12400 |
568 | B | 7.76 | 24.54 | 47.92 | 181.0 | 0.05263 | 0.04362 | 0.00000 | 0.00000 | 0.1587 | 0.05884 | 0.3857 | 1.428 | 2.548 | 19.15 | 0.007189 | 0.00466 | 0.00000 | 0.00000 | 0.02676 | 0.002783 | 9.456 | 30.37 | 59.16 | 268.6 | 0.08996 | 0.06444 | 0.0000 | 0.0000 | 0.2871 | 0.07039 |
"""
from the above we can see that 'diagnosis' need to be converted to a more ML suitable format
where B (benign) = 0 and M (malignant) = 1 are set
"""
data=data.replace({'B': 0})
data=data.replace({'M': 1})
"""
although we stated that there were only numeric values and no missing values let us check this
anyhow
"""
numeric_data = data.iloc[:,1:data.shape[1]]
numeric_data_id=(numeric_data.applymap(np.isreal).all(1)==False)
print('If any non-numeric data, there are:' , numeric_data_id.sum())
If any non-numeric data, there are: 0
# although this will be addressed when modelling the scale of the different features are large
data.mean()
diagnosis 0.372583 radius_mean 14.127292 texture_mean 19.289649 perimeter_mean 91.969033 area_mean 654.889104 smoothness_mean 0.096360 compactness_mean 0.104341 concavity_mean 0.088799 concave points_mean 0.048919 symmetry_mean 0.181162 fractal_dimension_mean 0.062798 radius_se 0.405172 texture_se 1.216853 perimeter_se 2.866059 area_se 40.337079 smoothness_se 0.007041 compactness_se 0.025478 concavity_se 0.031894 concave points_se 0.011796 symmetry_se 0.020542 fractal_dimension_se 0.003795 radius_worst 16.269190 texture_worst 25.677223 perimeter_worst 107.261213 area_worst 880.583128 smoothness_worst 0.132369 compactness_worst 0.254265 concavity_worst 0.272188 concave points_worst 0.114606 symmetry_worst 0.290076 fractal_dimension_worst 0.083946 dtype: float64
we have no checked for non-numeric enteries and found none
let us now proceed to get more information on the dataset, before we set up the ML model
"""
-------------------------------------------------------------------------
Some helper modules for Descriptive Statistics, which I found useful and provided by
https://wacamlds.podia.com/courses/data-science-and-machine-learning-projects-in-python
-------------------------------------------------------------------------
"""
def get_redundant_pairs(df):
pairs_to_drop = set()
cols = df.columns
for i in range(0, df.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(df, n=10):
au_corr = df.corr().unstack()
labels_to_drop = get_redundant_pairs(df)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
def corrank(X):
import itertools
df = pd.DataFrame([[(i,j),
X.corr().loc[i,j]] for i,j in list(itertools.combinations(X.corr(), 2))],
columns=['pairs','corr'])
print(df.sort_values(by='corr',ascending=False))
print()
# to be able to use the helper modules
features_names=data.columns
feature_names=features_names.tolist()
feature_names=feature_names[1:]
target = 'diagnosis'
"""
-------------------------------------------------------------------------
descriptive statistics and correlation matrix
https://wacamlds.podia.com/courses/data-science-and-machine-learning-projects-in-python
-------------------------------------------------------------------------
"""
def data_descriptiveStats(feature_names, target, dataset):
# Count Number of Missing Value on Each Column
print(); print('Count Number of Missing Value on Each Column: ')
print(); print(dataset[feature_names].isnull().sum(axis=0))
print(); print(dataset[target].isnull().sum(axis=0))
# Ranking of Correlation Coefficients among Variable Pairs
print(); print("Ranking of Correlation Coefficients:")
corrank(dataset[feature_names])
# Print Highly Correlated Variables
print(); print("Highly correlated variables (Absolute Correlations):")
print(); print(get_top_abs_correlations(dataset[feature_names], 15))
# Get Information on the target
print(); print(dataset[target].describe())
print(); print(dataset.groupby(target).size())
data_descriptiveStats(feature_names, target, data)
Count Number of Missing Value on Each Column: radius_mean 0 texture_mean 0 perimeter_mean 0 area_mean 0 smoothness_mean 0 compactness_mean 0 concavity_mean 0 concave points_mean 0 symmetry_mean 0 fractal_dimension_mean 0 radius_se 0 texture_se 0 perimeter_se 0 area_se 0 smoothness_se 0 compactness_se 0 concavity_se 0 concave points_se 0 symmetry_se 0 fractal_dimension_se 0 radius_worst 0 texture_worst 0 perimeter_worst 0 area_worst 0 smoothness_worst 0 compactness_worst 0 concavity_worst 0 concave points_worst 0 symmetry_worst 0 fractal_dimension_worst 0 dtype: int64 0 Ranking of Correlation Coefficients: pairs corr 1 (radius_mean, perimeter_mean) 0.997855 391 (radius_worst, perimeter_worst) 0.993708 2 (radius_mean, area_mean) 0.987357 57 (perimeter_mean, area_mean) 0.986507 392 (radius_worst, area_worst) 0.984015 .. ... ... 238 (fractal_dimension_mean, area_worst) -0.231854 235 (fractal_dimension_mean, radius_worst) -0.253691 63 (perimeter_mean, fractal_dimension_mean) -0.261477 89 (area_mean, fractal_dimension_mean) -0.283110 8 (radius_mean, fractal_dimension_mean) -0.311631 [435 rows x 2 columns] Highly correlated variables (Absolute Correlations): radius_mean perimeter_mean 0.997855 radius_worst perimeter_worst 0.993708 radius_mean area_mean 0.987357 perimeter_mean area_mean 0.986507 radius_worst area_worst 0.984015 perimeter_worst area_worst 0.977578 radius_se perimeter_se 0.972794 perimeter_mean perimeter_worst 0.970387 radius_mean radius_worst 0.969539 perimeter_mean radius_worst 0.969476 radius_mean perimeter_worst 0.965137 area_mean radius_worst 0.962746 area_worst 0.959213 perimeter_worst 0.959120 radius_se area_se 0.951830 dtype: float64 count 569.000000 mean 0.372583 std 0.483918 min 0.000000 25% 0.000000 50% 0.000000 75% 1.000000 max 1.000000 Name: diagnosis, dtype: float64 diagnosis 0 357 1 212 dtype: int64
There seems to be a high correlation bewteen different size parameters - which is perhaps what we would expect
The data set is slightly imbalanced - which we have to address when performing classification analysis
But let us continue with the EDA
# -------------------------------------------------------------------------
# data visualisation and correlation graph
# -------------------------------------------------------------------------
def data_visualization(feature_names, target, dataset):
# boxplot
b = 3 # number of columns
a =int((data.shape[1])/b) # number of rows
c = 1 # initialize plot counter
fig = plt.figure(figsize=(14,30))
for i in feature_names:
plt.subplot(a, b, c)
plt.title('{} (box), subplot: {}{}{}'.format(i, a, b, c))
plt.xlabel(i)
plt.boxplot(x = data[i])
c = c + 1
plt.show()
# Correlation Plot using seaborn
print(); print("Correlation plot of Numerical features")
# Compute the correlation matrix
corr = dataset[feature_names].corr()
#print(corr)
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.0, vmin= -1.0, center=0, square=True,
linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
data_visualization(feature_names, target, data)
Correlation plot of Numerical features
Looking for correlation to Diagnosis¶
# Find all correlations and sort
correlations_data = data.corr()['diagnosis'].sort_values()
# Print the most negative correlations
print(correlations_data.head(15), '\n')
smoothness_se -0.067016 fractal_dimension_mean -0.012838 texture_se -0.008303 symmetry_se -0.006522 fractal_dimension_se 0.077972 concavity_se 0.253730 compactness_se 0.292999 fractal_dimension_worst 0.323872 symmetry_mean 0.330499 smoothness_mean 0.358560 concave points_se 0.408042 texture_mean 0.415185 symmetry_worst 0.416294 smoothness_worst 0.421465 texture_worst 0.456903 Name: diagnosis, dtype: float64
# Print the most positive correlations
print(correlations_data.tail(15), '\n')
perimeter_se 0.556141 radius_se 0.567134 compactness_worst 0.590998 compactness_mean 0.596534 concavity_worst 0.659610 concavity_mean 0.696360 area_mean 0.708984 radius_mean 0.730029 area_worst 0.733825 perimeter_mean 0.742636 radius_worst 0.776454 concave points_mean 0.776614 perimeter_worst 0.782914 concave points_worst 0.793566 diagnosis 1.000000 Name: diagnosis, dtype: float64
sns.clustermap(data.corr())
plt.show()
Conclusions from the first phase of EDA¶
-There are no missing values
-All values are numeric
-Many features are skewed and have outliers
-There is a strong correlation between some features (indicating redundancy)
-Slight imbalance between B (benign) and M (malignant) representation
-Two main clusters of features
In the next phase 1) skewness, 2) outliers and 3) scale variance will be addressed before taking on modelling and feature importance.
"Hasta luego"
SPJ