Power Transform – The Breast Cancer Wisconsin dataset
Power Transform - The Breast Cancer Wisconsin dataset¶
"A power transform is a family of functions that are applied to create a monotonic transformation of data using power functions. This is a useful data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association such as the Pearson correlation between variables and for other data stabilization procedures". https://en.wikipedia.org/wiki/Power_transform date: 2021-02-12
In a previous blog, I concluded that in the next phase I would address: 1) skewness, 2) outliers and 3) scale variance. https://mlja.se/2021/02/09/the-breast-cancer-wisconsin-dataset-is-a-machine-learning-friendly-dataset/
At least, 1 and 3 is somewhat addressed in this blog.
This blog is almost completely based on a Jason Brownlee script - thanks a lot. Please read the blog, link below.
https://machinelearningmastery.com/power-transforms-with-scikit-learn/
# Import libraries
from pandas import DataFrame
import pandas as pd
from numpy import mean
from numpy import std
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from matplotlib import pyplot
# Load dataset
# read the data file, in this case from a folder on the local computer
dataset = pd.read_csv("C:\\Data\\bcwd.csv")
# the 'Unnamed: 32' and 'id' are to be excluded going forward
dataset=dataset.drop(['Unnamed: 32'], axis = 1)
dataset=dataset.drop(['id'], axis = 1)
columns=dataset.columns
dataset=dataset.replace({'B': 0})
dataset=dataset.replace({'M': 1})
dataset.head()
diagnosis | radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave points_mean | symmetry_mean | ... | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 1 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 1 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 1 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 31 columns
y = dataset["diagnosis"]
X = dataset.drop('diagnosis',axis=1)
Visualize a Box-Cox and Yeo-Johnson transform of the The Breast Cancer Wisconsin dataset¶
list_tot=columns.tolist()
indices = [1,2,3,4,5,6]
selected_elements = [list_tot[index] for index in indices]
selected_elements
['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean']
Histograms of the selected_elements¶
dataset.hist(figsize=[10,10],column=selected_elements)
pyplot.show()
Perform a Box-Cox transform of the dataset¶
data = X.values[:,:]
scaler = MinMaxScaler(feature_range=(1, 2))
power = PowerTransformer(method='box-cox')
pipeline = Pipeline(steps=[('s', scaler),('p', power)])
data = pipeline.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# histograms of the variables
dataset.hist(figsize=[10,10],column=[0,1,2,3,4,5])
pyplot.show()
Perform a Yeo-Johnson transform of the dataset¶
data = X.values[:,:]
scaler = StandardScaler()
power = PowerTransformer(method='yeo-johnson')
pipeline = Pipeline(steps=[('s', scaler),('p', power)])
data = pipeline.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# histograms of the variables
dataset.hist(figsize=[10,10],column=[0,1,2,3,4,5])
pyplot.show()
Evaluate Logistic Regression on the raw BCW dataset¶
# ensure inputs are floats and output is an integer label
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# define and configure the model
model =LogisticRegression()
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=456)
n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
error_score='raise')
# report model performance
print('Accuracy (std): %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy (std): 0.943 (0.020)
Box_Cox power transform , MinMaxScaler.¶
All features must be > 0
# define the pipeline
scaler = MinMaxScaler(feature_range=(1, 2))
power = PowerTransformer(method='box-cox')
model = LogisticRegression()
pipeline = Pipeline(steps=[('s', scaler),('p', power), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=456)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
error_score='raise')
# report pipeline performance
print('Accuracy (std): %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy (std): 0.974 (0.017)
Yeo-Johnson standardized BCW dataset_v1¶
scaler = StandardScaler()
power = PowerTransformer(method='yeo-johnson', standardize=False)
model = LogisticRegression()
pipeline = Pipeline(steps=[('s', scaler), ('p', power), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=456)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
error_score='raise')
# report pipeline performance
print('Accuracy (std): %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy (std): 0.974 (0.017)
Yeo-Johnson standardized BCW dataset_v2¶
power = PowerTransformer(method='yeo-johnson', standardize=True)
model = LogisticRegression()
pipeline = Pipeline(steps=[('p', power), ('m', model)])
# evaluate the pipeline
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=456)
n_scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1,
error_score='raise')
# report pipeline performance
print('Accuracy (std): %.3f (%.3f)' % (mean(n_scores), std(n_scores)))
Accuracy (std): 0.972 (0.019)
Conclusion¶
Making the numerical values more normal distribution-like improves the accuracy of the raw dataset (0.943) to (0.974) for the Power Transformed dataset.
For the BCW dataset using Logistic Regression for classification, similar results were obtained for the Box-Cox and Yeo-Johnson power transforms.
SPJ