Classification I: training & predicting

Overview

In previous chapters, we focused solely on descriptive and exploratory data analysis questions. This chapter and the next together serve as our first foray into answering predictive questions about data. In particular, we will focus on classification, i.e., using one or more variables to predict the value of a categorical variable of interest. This chapter will cover the basics of classification, how to preprocess data to make it suitable for use in a classifier, and how to use our observed data to make predictions. The next chapter will focus on how to evaluate how accurate the predictions from our classifier are, as well as how to improve our classifier (where possible) to maximize its accuracy.

Chapter learning objectives

By the end of the chapter, readers will be able to do the following:

  • Recognize situations where a classifier would be appropriate for making predictions.

  • Describe what a training data set is and how it is used in classification.

  • Interpret the output of a classifier.

  • Compute, by hand, the straight-line (Euclidean) distance between points on a graph when there are two predictor variables.

  • Explain the \(K\)-nearest neighbor classification algorithm.

  • Perform \(K\)-nearest neighbor classification in Python using scikit-learn.

  • Use StandardScaler to preprocess data to be centered, scaled, and balanced.

  • Combine preprocessing and model training using make_pipeline.

The classification problem

In many situations, we want to make predictions\index{predictive question} based on the current situation as well as past experiences. For instance, a doctor may want to diagnose a patient as either diseased or healthy based on their symptoms and the doctor’s past experience with patients; an email provider might want to tag a given email as “spam” or “not spam” based on the email’s text and past email text data; or a credit card company may want to predict whether a purchase is fraudulent based on the current purchase item, amount, and location as well as past purchases. These tasks are all examples of \index{classification} classification, i.e., predicting a categorical class (sometimes called a label) \index{class}\index{categorical variable} for an observation given its other variables (sometimes called features). \index{feature|see{predictor}}

Generally, a classifier assigns an observation without a known class (e.g., a new patient) to a class (e.g., diseased or healthy) on the basis of how similar it is to other observations for which we do know the class (e.g., previous patients with known diseases and symptoms). These observations with known classes that we use as a basis for prediction are called a training set; \index{training set} this name comes from the fact that we use these data to train, or teach, our classifier. Once taught, we can use the classifier to make predictions on new data for which we do not know the class.

There are many possible methods that we could use to predict a categorical class/label for an observation. In this book, we will focus on the widely used \(K\)-nearest neighbors \index{K-nearest neighbors} algorithm [Cover and Hart, 1967, Fix and Hodges, 1951]. In your future studies, you might encounter decision trees, support vector machines (SVMs), logistic regression, neural networks, and more; see the additional resources section at the end of the next chapter for where to begin learning more about these other methods. It is also worth mentioning that there are many variations on the basic classification problem. For example, we focus on the setting of binary classification \index{classification!binary} where only two classes are involved (e.g., a diagnosis of either healthy or diseased), but you may also run into multiclass classification problems with more than two categories (e.g., a diagnosis of healthy, bronchitis, pneumonia, or a common cold).

Exploring a data set

In this chapter and the next, we will study a data set of digitized breast cancer image features, created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian [Street et al., 1993]. \index{breast cancer} Each row in the data set represents an image of a tumor sample, including the diagnosis (benign or malignant) and several other measurements (nucleus texture, perimeter, area, and more). Diagnosis for each image was conducted by physicians.

As with all data analyses, we first need to formulate a precise question that we want to answer. Here, the question is predictive: \index{question!classification} can we use the tumor image measurements available to us to predict whether a future tumor image (with unknown diagnosis) shows a benign or malignant tumor? Answering this question is important because traditional, non-data-driven methods for tumor diagnosis are quite subjective and dependent upon how skilled and experienced the diagnosing physician is. Furthermore, benign tumors are not normally dangerous; the cells stay in the same place, and the tumor stops growing before it gets very large. By contrast, in malignant tumors, the cells invade the surrounding tissue and spread into nearby organs, where they can cause serious damage [Street et al., 1993]. Thus, it is important to quickly and accurately diagnose the tumor type to guide patient treatment.

Loading the cancer data

Our first step is to load, wrangle, and explore the data using visualizations in order to better understand the data we are working with. We start by loading the pandas package needed for our analysis.

import pandas as pd

In this case, the file containing the breast cancer data set is a .csv file with headers. We’ll use the read_csv function with no additional arguments, and then inspect its contents:

\index{read function!read_csv}

cancer = pd.read_csv("data/wdbc.csv")
cancer
ID Class Radius Texture Perimeter Area Smoothness Compactness Concavity Concave_Points Symmetry Fractal_Dimension
0 842302 M 1.096100 -2.071512 1.268817 0.983510 1.567087 3.280628 2.650542 2.530249 2.215566 2.253764
1 842517 M 1.828212 -0.353322 1.684473 1.907030 -0.826235 -0.486643 -0.023825 0.547662 0.001391 -0.867889
2 84300903 M 1.578499 0.455786 1.565126 1.557513 0.941382 1.052000 1.362280 2.035440 0.938859 -0.397658
3 84348301 M -0.768233 0.253509 -0.592166 -0.763792 3.280667 3.399917 1.914213 1.450431 2.864862 4.906602
4 84358402 M 1.748758 -1.150804 1.775011 1.824624 0.280125 0.538866 1.369806 1.427237 -0.009552 -0.561956
... ... ... ... ... ... ... ... ... ... ... ... ...
564 926424 M 2.109139 0.720838 2.058974 2.341795 1.040926 0.218868 1.945573 2.318924 -0.312314 -0.930209
565 926682 M 1.703356 2.083301 1.614511 1.722326 0.102368 -0.017817 0.692434 1.262558 -0.217473 -1.057681
566 926954 M 0.701667 2.043775 0.672084 0.577445 -0.839745 -0.038646 0.046547 0.105684 -0.808406 -0.894800
567 927241 M 1.836725 2.334403 1.980781 1.733693 1.524426 3.269267 3.294046 2.656528 2.135315 1.042778
568 92751 B -1.806811 1.220718 -1.812793 -1.346604 -3.109349 -1.149741 -1.113893 -1.260710 -0.819349 -0.560539

569 rows × 12 columns

Describing the variables in the cancer data set

Breast tumors can be diagnosed by performing a biopsy, a process where tissue is removed from the body and examined for the presence of disease. Traditionally these procedures were quite invasive; modern methods such as fine needle aspiration, used to collect the present data set, extract only a small amount of tissue and are less invasive. Based on a digital image of each breast tissue sample collected for this data set, ten different variables were measured for each cell nucleus in the image (items 3–12 of the list of variables below), and then the mean for each variable across the nuclei was recorded. As part of the data preparation, these values have been standardized (centered and scaled); we will discuss what this means and why we do it later in this chapter. Each image additionally was given a unique ID and a diagnosis by a physician. Therefore, the total set of variables per image in this data set is:

  1. ID: identification number

  2. Class: the diagnosis (M = malignant or B = benign)

  3. Radius: the mean of distances from center to points on the perimeter

  4. Texture: the standard deviation of gray-scale values

  5. Perimeter: the length of the surrounding contour

  6. Area: the area inside the contour

  7. Smoothness: the local variation in radius lengths

  8. Compactness: the ratio of squared perimeter and area

  9. Concavity: severity of concave portions of the contour

  10. Concave Points: the number of concave portions of the contour

  11. Symmetry: how similar the nucleus is when mirrored

  12. Fractal Dimension: a measurement of how “rough” the perimeter is

Below we use .info() \index{glimpse} to preview the data frame. This method can make it easier to inspect the data when we have a lot of columns, as it prints the data such that the columns go down the page (instead of across).

cancer.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ID                 569 non-null    int64  
 1   Class              569 non-null    object 
 2   Radius             569 non-null    float64
 3   Texture            569 non-null    float64
 4   Perimeter          569 non-null    float64
 5   Area               569 non-null    float64
 6   Smoothness         569 non-null    float64
 7   Compactness        569 non-null    float64
 8   Concavity          569 non-null    float64
 9   Concave_Points     569 non-null    float64
 10  Symmetry           569 non-null    float64
 11  Fractal_Dimension  569 non-null    float64
dtypes: float64(10), int64(1), object(1)
memory usage: 53.5+ KB

From the summary of the data above, we can see that Class is of type object.

Given that we only have two different values in our Class column (B for benign and M for malignant), we only expect to get two names back.

cancer['Class'].unique()
array(['M', 'B'], dtype=object)

Exploring the cancer data

Before we start doing any modeling, let’s explore our data set. Below we use the .groupby(), .count() \index{group_by}\index{summarize} methods to find the number and percentage of benign and malignant tumor observations in our data set. When paired with .groupby(), .count() counts the number of observations in each Class group. Then we calculate the percentage in each group by dividing by the total number of observations. We have 357 (63%) benign and 212 (37%) malignant tumor observations.

num_obs = len(cancer)
explore_cancer = pd.DataFrame()
explore_cancer['count'] = cancer.groupby('Class')['ID'].count()
explore_cancer['percentage'] = explore_cancer['count'] / num_obs * 100
explore_cancer
count percentage
Class
B 357 62.741652
M 212 37.258348

Next, let’s draw a scatter plot \index{visualization!scatter} to visualize the relationship between the perimeter and concavity variables. Rather than use altair's default palette, we select our own colorblind-friendly colors—"#efb13f" for light orange and "#86bfef" for light blue—and pass them as the scale argument in the color argument. We also make the category labels (“B” and “M”) more readable by changing them to “Benign” and “Malignant” using .apply() method on the dataframe.

colors = ["#86bfef", "#efb13f"]
cancer["Class"] = cancer["Class"].apply(
    lambda x: "Malignant" if (x == "M") else "Benign"
)
perim_concav = (
    alt.Chart(cancer)
    .mark_point(opacity=0.6, filled=True, size=40)
    .encode(
        x=alt.X("Perimeter", title="Perimeter (standardized)"),
        y=alt.Y("Concavity", title="Concavity (standardized)"),
        color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"),
    )
)
perim_concav
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Fig. 1 Scatter plot of concavity versus perimeter colored by diagnosis label.

In Fig. 1, we can see that malignant observations typically fall in the upper right-hand corner of the plot area. By contrast, benign observations typically fall in the lower left-hand corner of the plot. In other words, benign observations tend to have lower concavity and perimeter values, and malignant ones tend to have larger values. Suppose we obtain a new observation not in the current data set that has all the variables measured except the label (i.e., an image without the physician’s diagnosis for the tumor class). We could compute the standardized perimeter and concavity values, resulting in values of, say, 1 and 1. Could we use this information to classify that observation as benign or malignant? Based on the scatter plot, how might you classify that new observation? If the standardized concavity and perimeter values are 1 and 1 respectively, the point would lie in the middle of the orange cloud of malignant points and thus we could probably classify it as malignant. Based on our visualization, it seems like the prediction of an unobserved label might be possible.

Classification with \(K\)-nearest neighbors

In order to actually make predictions for new observations in practice, we will need a classification algorithm. In this book, we will use the \(K\)-nearest neighbors \index{K-nearest neighbors!classification} classification algorithm. To predict the label of a new observation (here, classify it as either benign or malignant), the \(K\)-nearest neighbors classifier generally finds the \(K\) “nearest” or “most similar” observations in our training set, and then uses their diagnoses to make a prediction for the new observation’s diagnosis. \(K\) is a number that we must choose in advance; for now, we will assume that someone has chosen \(K\) for us. We will cover how to choose \(K\) ourselves in the next chapter.

To illustrate the concept of \(K\)-nearest neighbors classification, we will walk through an example. Suppose we have a new observation, with standardized perimeter of 2 and standardized concavity of 4, whose diagnosis “Class” is unknown. This new observation is depicted by the red, diamond point in Fig. 2.

Fig. 2 Scatter plot of concavity versus perimeter with new observation represented as a red diamond.

Fig. 3 shows that the nearest point to this new observation is malignant and located at the coordinates (2.1, 3.6). The idea here is that if a point is close to another in the scatter plot, then the perimeter and concavity values are similar, and so we may expect that they would have the same diagnosis.

Fig. 3 Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a malignant label.

Suppose we have another new observation with standardized perimeter 0.2 and concavity of 3.3. Looking at the scatter plot in Fig. 4, how would you classify this red, diamond observation? The nearest neighbor to this new point is a benign observation at (0.2, 2.7). Does this seem like the right prediction to make for this observation? Probably not, if you consider the other nearby points.

Fig. 4 Scatter plot of concavity versus perimeter. The new observation is represented as a red diamond with a line to the one nearest neighbor, which has a benign label.

To improve the prediction we can consider several neighboring points, say \(K = 3\), that are closest to the new observation to predict its diagnosis class. Among those 3 closest points, we use the majority class as our prediction for the new observation. As shown in Fig. 5, we see that the diagnoses of 2 of the 3 nearest neighbors to our new observation are malignant. Therefore we take majority vote and classify our new red, diamond observation as malignant.

Fig. 5 Scatter plot of concavity versus perimeter with three nearest neighbors.

Here we chose the \(K=3\) nearest observations, but there is nothing special about \(K=3\). We could have used \(K=4, 5\) or more (though we may want to choose an odd number to avoid ties). We will discuss more about choosing \(K\) in the next chapter.

Distance between points

We decide which points are the \(K\) “nearest” to our new observation using the straight-line distance (we will often just refer to this as distance). \index{distance!K-nearest neighbors}\index{straight line!distance} Suppose we have two observations \(a\) and \(b\), each having two predictor variables, \(x\) and \(y\). Denote \(a_x\) and \(a_y\) to be the values of variables \(x\) and \(y\) for observation \(a\); \(b_x\) and \(b_y\) have similar definitions for observation \(b\). Then the straight-line distance between observation \(a\) and \(b\) on the x-y plane can be computed using the following formula:

\[\mathrm{Distance} = \sqrt{(a_x -b_x)^2 + (a_y - b_y)^2}\]

To find the \(K\) nearest neighbors to our new observation, we compute the distance from that new observation to each observation in our training data, and select the \(K\) observations corresponding to the \(K\) smallest distance values. For example, suppose we want to use \(K=5\) neighbors to classify a new observation with perimeter of 0 and concavity of 3.5, shown as a red diamond in Fig. 6. Let’s calculate the distances between our new point and each of the observations in the training set to find the \(K=5\) neighbors that are nearest to our new point. You will see in the code below \index{mutate}, we compute the straight-line distance using the formula above: we square the differences between the two observations’ perimeter and concavity coordinates, add the squared differences, and then take the square root.

Fig. 6 Scatter plot of concavity versus perimeter with new observation represented as a red diamond.

new_obs_Perimeter = 0
new_obs_Concavity = 3.5
cancer_dist = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Class"]]
cancer_dist["dist_from_new"] = np.sqrt(
    (cancer_dist["Perimeter"] - new_obs_Perimeter) ** 2
    + (cancer_dist["Concavity"] - new_obs_Concavity) ** 2
)
# sort the rows in ascending order and take the first 5 rows
cancer_dist = cancer_dist.sort_values(by="dist_from_new").head(5)
cancer_dist
ID Perimeter Concavity Class dist_from_new
112 86409 0.241202 2.653051 Benign 0.880626
258 887181 0.750277 2.870061 Malignant 0.979663
351 899667 0.622700 2.541410 Malignant 1.143088
430 907914 0.416930 2.314364 Malignant 1.256806
152 8710441 -1.160091 4.039155 Benign 1.279258

In Table 1 we show in mathematical detail how we computed the dist_from_new variable (the distance to the new observation) for each of the 5 nearest neighbors in the training data.

Table 1 Evaluating the distances from the new observation to each of its 5 nearest neighbors

Perimeter

Concavity

Distance

Class

0.24

2.65

\(\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88\)

B

0.75

2.87

\(\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98\)

M

0.62

2.54

\(\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14\)

M

0.42

2.31

\(\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26\)

M

-1.16

4.04

\(\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28\)

B

The result of this computation shows that 3 of the 5 nearest neighbors to our new observation are malignant (M); since this is the majority, we classify our new observation as malignant. These 5 neighbors are circled in Fig. 7.

Fig. 7 Scatter plot of concavity versus perimeter with 5 nearest neighbors circled.

More than two explanatory variables

Although the above description is directed toward two predictor variables, exactly the same \(K\)-nearest neighbors algorithm applies when you have a higher number of predictor variables. Each predictor variable may give us new information to help create our classifier. The only difference is the formula for the distance between points. Suppose we have \(m\) predictor variables for two observations \(a\) and \(b\), i.e., \(a = (a_{1}, a_{2}, \dots, a_{m})\) and \(b = (b_{1}, b_{2}, \dots, b_{m})\).

The distance formula becomes \index{distance!more than two variables}

\[\mathrm{Distance} = \sqrt{(a_{1} -b_{1})^2 + (a_{2} - b_{2})^2 + \dots + (a_{m} - b_{m})^2}.\]

This formula still corresponds to a straight-line distance, just in a space with more dimensions. Suppose we want to calculate the distance between a new observation with a perimeter of 0, concavity of 3.5, and symmetry of 1, and another observation with a perimeter, concavity, and symmetry of 0.417, 2.31, and 0.837 respectively. We have two observations with three predictor variables: perimeter, concavity, and symmetry. Previously, when we had two variables, we added up the squared difference between each of our (two) variables, and then took the square root. Now we will do the same, except for our three variables. We calculate the distance as follows

\[\mathrm{Distance} =\sqrt{(0 - 0.417)^2 + (3.5 - 2.31)^2 + (1 - 0.837)^2} = 1.27.\]

Let’s calculate the distances between our new observation and each of the observations in the training set to find the \(K=5\) neighbors when we have these three predictors.

new_obs_Perimeter = 0
new_obs_Concavity = 3.5
new_obs_Symmetry = 1
cancer_dist2 = cancer.loc[:, ["ID", "Perimeter", "Concavity", "Symmetry", "Class"]]
cancer_dist2["dist_from_new"] = np.sqrt(
    (cancer_dist2["Perimeter"] - new_obs_Perimeter) ** 2
    + (cancer_dist2["Concavity"] - new_obs_Concavity) ** 2
    + (cancer_dist2["Symmetry"] - new_obs_Symmetry) ** 2
)
# sort the rows in ascending order and take the first 5 rows
cancer_dist2 = cancer_dist2.sort_values(by="dist_from_new").head(5)
cancer_dist2
ID Perimeter Concavity Symmetry Class dist_from_new
430 907914 0.416930 2.314364 0.836722 Malignant 1.267368
400 90439701 1.334664 2.886368 1.099359 Malignant 1.472326
562 925622 0.470430 2.084810 1.154075 Malignant 1.499268
68 859471 -1.365450 2.812359 1.092064 Benign 1.531594
351 899667 0.622700 2.541410 2.055065 Malignant 1.555575

Based on \(K=5\) nearest neighbors with these three predictors we would classify the new observation as malignant since 4 out of 5 of the nearest neighbors are malignant class. Fig. 8 shows what the data look like when we visualize them as a 3-dimensional scatter with lines from the new observation to its five nearest neighbors.

data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Fig. 8 3D scatter plot of the standardized symmetry, concavity, and perimeter variables. Note that in general we recommend against using 3D visualizations; here we show the data in 3D only to illustrate what higher dimensions and nearest neighbors look like, for learning purposes.

Summary of \(K\)-nearest neighbors algorithm

In order to classify a new observation using a \(K\)-nearest neighbor classifier, we have to do the following:

  1. Compute the distance between the new observation and each observation in the training set.

  2. Sort the data table in ascending order according to the distances.

  3. Choose the top \(K\) rows of the sorted table.

  4. Classify the new observation based on a majority vote of the neighbor classes.

\(K\)-nearest neighbors with scikit-learn

Coding the \(K\)-nearest neighbors algorithm in Python ourselves can get complicated, especially if we want to handle multiple classes, more than two variables, or predict the class for multiple new observations. Thankfully, in Python, the \(K\)-nearest neighbors algorithm is implemented in the scikit-learn package along with many other models that you will encounter in this and future chapters of the book. Using the functions in the scikit-learn package will help keep our code simple, readable and accurate; the less we have to code ourselves, the fewer mistakes we will likely make. We start by importing KNeighborsClassifier from the sklearn.neighbors module.

from sklearn.neighbors import KNeighborsClassifier

Let’s walk through how to use KNeighborsClassifier to perform \(K\)-nearest neighbors classification. We will use the cancer data set from above, with perimeter and concavity as predictors and \(K = 5\) neighbors to build our classifier. Then we will use the classifier to predict the diagnosis label for a new observation with perimeter 0, concavity 3.5, and an unknown diagnosis label. Let’s pick out our two desired predictor variables and class label and store them as a new data set named cancer_train:

cancer_train = cancer.loc[:, ['Class', 'Perimeter', 'Concavity']]
cancer_train
Class Perimeter Concavity
0 Malignant 1.268817 2.650542
1 Malignant 1.684473 -0.023825
2 Malignant 1.565126 1.362280
3 Malignant -0.592166 1.914213
4 Malignant 1.775011 1.369806
... ... ... ...
564 Malignant 2.058974 1.945573
565 Malignant 1.614511 0.692434
566 Malignant 0.672084 0.046547
567 Malignant 1.980781 3.294046
568 Benign -1.812793 -1.113893

569 rows × 3 columns

Next, we create a model specification for \(K\)-nearest neighbors classification by creating a KNeighborsClassifier object, specifying that we want to use \(K = 5\) neighbors (we will discuss how to choose \(K\) in the next chapter) and the straight-line distance (weights="uniform"). The weights argument controls how neighbors vote when classifying a new observation; by setting it to "uniform", each of the \(K\) nearest neighbors gets exactly 1 vote as described above. Other choices, which weigh each neighbor’s vote differently, can be found on the scikit-learn website.

knn_spec = KNeighborsClassifier(n_neighbors=5)
knn_spec
KNeighborsClassifier()

In order to fit the model on the breast cancer data, we need to call fit on the classifier object and pass the data in the argument. We also need to specify what variables to use as predictors and what variable to use as the target. Below, the X=cancer_train[["Perimeter", "Concavity"]] and the y=cancer_train['Class'] argument specifies that Class is the target variable (the one we want to predict), and both Perimeter and Concavity are to be used as the predictors.

knn_spec.fit(X=cancer_train[["Perimeter", "Concavity"]], y=cancer_train["Class"]);

Finally, we make the prediction on the new observation by calling predict \index{tidymodels!predict} on the classifier object, passing the new observation itself. As above, when we ran the \(K\)-nearest neighbors classification algorithm manually, the knn_fit object classifies the new observation as “Malignant”. Note that the predict function outputs a numpy array with the model’s prediction.

new_obs = pd.DataFrame({'Perimeter': [0], 'Concavity': [3.5]})
knn_spec.predict(new_obs)
array(['Malignant'], dtype=object)

Is this predicted malignant label the true class for this observation? Well, we don’t know because we do not have this observation’s diagnosis— that is what we were trying to predict! The classifier’s prediction is not necessarily correct, but in the next chapter, we will learn ways to quantify how accurate we think our predictions are.

Data preprocessing with scikit-learn

Centering and scaling

When using \(K\)-nearest neighbor classification, the scale \index{scaling} of each variable (i.e., its size and range of values) matters. Since the classifier predicts classes by identifying observations nearest to it, any variables with a large scale will have a much larger effect than variables with a small scale. But just because a variable has a large scale doesn’t mean that it is more important for making accurate predictions. For example, suppose you have a data set with two features, salary (in dollars) and years of education, and you want to predict the corresponding type of job. When we compute the neighbor distances, a difference of $1000 is huge compared to a difference of 10 years of education. But for our conceptual understanding and answering of the problem, it’s the opposite; 10 years of education is huge compared to a difference of $1000 in yearly salary!

In many other predictive models, the center of each variable (e.g., its mean) matters as well. For example, if we had a data set with a temperature variable measured in degrees Kelvin, and the same data set with temperature measured in degrees Celsius, the two variables would differ by a constant shift of 273 (even though they contain exactly the same information). Likewise, in our hypothetical job classification example, we would likely see that the center of the salary variable is in the tens of thousands, while the center of the years of education variable is in the single digits. Although this doesn’t affect the \(K\)-nearest neighbor classification algorithm, this large shift can change the outcome of using many other predictive models. \index{centering}

To scale and center our data, we need to find our variables’ mean (the average, which quantifies the “central” value of a set of numbers) and standard deviation (a number quantifying how spread out values are). For each observed value of the variable, we subtract the mean (i.e., center the variable) and divide by the standard deviation (i.e., scale the variable). When we do this, the data is said to be standardized, \index{standardization!K-nearest neighbors} and all variables in a data set will have a mean of 0 and a standard deviation of 1. To illustrate the effect that standardization can have on the \(K\)-nearest neighbor algorithm, we will read in the original, unstandardized Wisconsin breast cancer data set; we have been using a standardized version of the data set up until now. To keep things simple, we will just use the Area, Smoothness, and Class variables:

unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv")
unscaled_cancer = unscaled_cancer[['Class', 'Area', 'Smoothness']]
unscaled_cancer
Class Area Smoothness
0 M 1001.0 0.11840
1 M 1326.0 0.08474
2 M 1203.0 0.10960
3 M 386.1 0.14250
4 M 1297.0 0.10030
... ... ... ...
564 M 1479.0 0.11100
565 M 1261.0 0.09780
566 M 858.1 0.08455
567 M 1265.0 0.11780
568 B 181.0 0.05263

569 rows × 3 columns

Looking at the unscaled and uncentered data above, you can see that the differences between the values for area measurements are much larger than those for smoothness. Will this affect predictions? In order to find out, we will create a scatter plot of these two predictors (colored by diagnosis) for both the unstandardized data we just loaded, and the standardized version of that same data. But first, we need to standardize the unscaled_cancer data set with scikit-learn.

In the scikit-learn framework, all data preprocessing and modeling can be built using a Pipeline, or a more convenient function make_pipeline for simple pipeline construction. Here we will initialize a preprocessor using make_column_transformer for the unscaled_cancer data above, specifying that we want to standardize the predictors Area and Smoothness:

preprocessor = make_column_transformer(
    (StandardScaler(), ["Area", "Smoothness"]),
)
preprocessor
ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
                                 ['Area', 'Smoothness'])])

So far, we have built a preprocessor so that each of the predictors have a mean of 0 and standard deviation of 1.

You can now see that the recipe includes a scaling and centering step for all predictor variables. Note that when you add a step to a ColumnTransformer, you must specify what columns to apply the step to. Here we specified that StandardScaler should be applied to all predictor variables.

At this point, the data are not yet scaled and centered. To actually scale and center the data, we need to call fit and transform on the unscaled data ( can be combined into fit_transform).

preprocessor.fit(unscaled_cancer)
scaled_cancer = preprocessor.transform(unscaled_cancer)
# scaled_cancer = preprocessor.fit_transform(unscaled_cancer)
scaled_cancer = pd.DataFrame(scaled_cancer, columns=['Area', 'Smoothness'])
scaled_cancer['Class'] = unscaled_cancer['Class']
scaled_cancer
Area Smoothness Class
0 0.984375 1.568466 M
1 1.908708 -0.826962 M
2 1.558884 0.942210 M
3 -0.764464 3.283553 M
4 1.826229 0.280372 M
... ... ... ...
564 2.343856 1.041842 M
565 1.723842 0.102458 M
566 0.577953 -0.840484 M
567 1.735218 1.525767 M
568 -1.347789 -3.112085 B

569 rows × 3 columns

It may seem redundant that we had to both fit and transform to scale and center the data. However, we do this in two steps so we can specify a different data set in the transform step if we want. For example, we may want to specify new data that were not part of the training set.

You may wonder why we are doing so much work just to center and scale our variables. Can’t we just manually scale and center the Area and Smoothness variables ourselves before building our \(K\)-nearest neighbor model? Well, technically yes; but doing so is error-prone. In particular, we might accidentally forget to apply the same centering / scaling when making predictions, or accidentally apply a different centering / scaling than what we used while training. Proper use of a ColumnTransformer helps keep our code simple, readable, and error-free. Furthermore, note that using fit and transform on the preprocessor is required only when you want to inspect the result of the preprocessing steps yourself. You will see further on in Section Putting it together in a pipeline that scikit-learn provides tools to automatically streamline the preprocesser and the model so that you can callfit and transform on the Pipeline as necessary without additional coding effort.

Fig. 9 shows the two scatter plots side-by-side—one for unscaled_cancer and one for scaled_cancer. Each has the same new observation annotated with its \(K=3\) nearest neighbors. In the original unstandardized data plot, you can see some odd choices for the three nearest neighbors. In particular, the “neighbors” are visually well within the cloud of benign observations, and the neighbors are all nearly vertically aligned with the new observation (which is why it looks like there is only one black line on this plot). Fig. 10 shows a close-up of that region on the unstandardized plot. Here the computation of nearest neighbors is dominated by the much larger-scale area variable. The plot for standardized data on the right in Fig. 9 shows a much more intuitively reasonable selection of nearest neighbors. Thus, standardizing the data can change things in an important way when we are using predictive algorithms. Standardizing your data should be a part of the preprocessing you do before predictive modeling and you should always think carefully about your problem domain and whether you need to standardize your data.

Fig. 9 Comparison of K = 3 nearest neighbors with standardized and unstandardized data.

Fig. 10 Close-up of three nearest neighbors for unstandardized data.

Balancing

Another potential issue in a data set for a classifier is class imbalance, \index{balance}\index{imbalance} i.e., when one label is much more common than another. Since classifiers like the \(K\)-nearest neighbor algorithm use the labels of nearby points to predict the label of a new point, if there are many more data points with one label overall, the algorithm is more likely to pick that label in general (even if the “pattern” of data suggests otherwise). Class imbalance is actually quite a common and important problem: from rare disease diagnosis to malicious email detection, there are many cases in which the “important” class to identify (presence of disease, malicious email) is much rarer than the “unimportant” class (no disease, normal email).

To better illustrate the problem, let’s revisit the scaled breast cancer data, cancer; except now we will remove many of the observations of malignant tumors, simulating what the data would look like if the cancer was rare. We will do this by picking only 3 observations from the malignant group, and keeping all of the benign observations. We choose these 3 observations using the .head() method, which takes the number of rows to select from the top (n). The new imbalanced data is shown in Fig. 11.

cancer = pd.read_csv("data/wdbc.csv")
rare_cancer = pd.concat(
    (cancer.query("Class == 'B'"), cancer.query("Class == 'M'").head(3))
)
colors = ["#86bfef", "#efb13f"]
rare_cancer["Class"] = rare_cancer["Class"].apply(
    lambda x: "Malignant" if (x == "M") else "Benign"
)
rare_plot = (
    alt.Chart(
        rare_cancer
    )
    .mark_point(opacity=0.6, filled=True, size=40)
    .encode(
        x=alt.X("Perimeter", title="Perimeter (standardized)"),
        y=alt.Y("Concavity", title="Concavity (standardized)"),
        color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"),
    )
)
rare_plot
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Fig. 11 Imbalanced data.

Suppose we now decided to use \(K = 7\) in \(K\)-nearest neighbor classification. With only 3 observations of malignant tumors, the classifier will always predict that the tumor is benign, no matter what its concavity and perimeter are! This is because in a majority vote of 7 observations, at most 3 will be malignant (we only have 3 total malignant observations), so at least 4 must be benign, and the benign vote will always win. For example, Fig. 12 shows what happens for a new tumor observation that is quite close to three observations in the training data that were tagged as malignant.

Fig. 12 Imbalanced data with 7 nearest neighbors to a new observation highlighted.

Fig. 13 shows what happens if we set the background color of each area of the plot to the predictions the \(K\)-nearest neighbor classifier would make. We can see that the decision is always “benign,” corresponding to the blue color.

Fig. 13 Imbalanced data with background color indicating the decision of the classifier and the points represent the labeled data.

Despite the simplicity of the problem, solving it in a statistically sound manner is actually fairly nuanced, and a careful treatment would require a lot more detail and mathematics than we will cover in this textbook. For the present purposes, it will suffice to rebalance the data by oversampling the rare class. \index{oversampling} In other words, we will replicate rare observations multiple times in our data set to give them more voting power in the \(K\)-nearest neighbor algorithm. In order to do this, we will need an oversampling step with the resample function from the sklearn Python package. \index{recipe!step_upsample} We show below how to do this, and also use the .groupby() and .count() methods to see that our classes are now balanced:

rare_cancer['Class'].value_counts()
Benign       357
Malignant      3
Name: Class, dtype: int64
from sklearn.utils import resample

malignant_cancer = rare_cancer[rare_cancer["Class"] == "Malignant"]
benign_cancer = rare_cancer[rare_cancer["Class"] == "Benign"]
malignant_cancer_upsample = resample(
    malignant_cancer, n_samples=len(benign_cancer), random_state=100
)
upsampled_cancer = pd.concat((malignant_cancer_upsample, benign_cancer))
upsampled_cancer.groupby(by='Class')['Class'].count()
Class
Benign       357
Malignant    357
Name: Class, dtype: int64

Now suppose we train our \(K\)-nearest neighbor classifier with \(K=7\) on this balanced data. Fig. 14 shows what happens now when we set the background color of each area of our scatter plot to the decision the \(K\)-nearest neighbor classifier would make. We can see that the decision is more reasonable; when the points are close to those labeled malignant, the classifier predicts a malignant tumor, and vice versa when they are closer to the benign tumor observations.

Fig. 14 Upsampled data with background color indicating the decision of the classifier.

Putting it together in a pipeline

The scikit-learn package collection also provides the pipeline, a way to chain together multiple data analysis steps without a lot of otherwise necessary code for intermediate steps. To illustrate the whole pipeline, let’s start from scratch with the unscaled_wdbc.csv data. First we will load the data, create a model, and specify a preprocessor for how the data should be preprocessed:

# load the unscaled cancer data
unscaled_cancer = pd.read_csv("data/unscaled_wdbc.csv")

# create the KNN model
knn_spec = KNeighborsClassifier(n_neighbors=7)

# create the centering / scaling preprocessor
preprocessor = make_column_transformer(
    (StandardScaler(), ["Area", "Smoothness"]),
)

You will also notice that we did not call .fit() on the preprocessor; this is unnecessary when it is placed in a Pipeline.

We will now place these steps in a Pipeline using the make_pipeline function, and finally we will call .fit() to run the whole Pipeline on the unscaled_cancer data.

knn_fit = make_pipeline(preprocessor, knn_spec).fit(
    X=unscaled_cancer.loc[:, ["Area", "Smoothness"]], y=unscaled_cancer["Class"]
)

knn_fit
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('standardscaler',
                                                  StandardScaler(),
                                                  ['Area', 'Smoothness'])])),
                ('kneighborsclassifier', KNeighborsClassifier(n_neighbors=7))])

As before, the fit object lists the function that trains the model. But now the fit object also includes information about the overall workflow, including the standardizing preprocessing step. In other words, when we use the predict function with the knn_fit object to make a prediction for a new observation, it will first apply the same preprocessing steps to the new observation. As an example, we will predict the class label of two new observations: one with Area = 500 and Smoothness = 0.075, and one with Area = 1500 and Smoothness = 0.1.

new_observation = pd.DataFrame({"Area": [500, 1500], "Smoothness": [0.075, 0.1]})
prediction = knn_fit.predict(new_observation)
prediction
array(['B', 'M'], dtype=object)

The classifier predicts that the first observation is benign (“B”), while the second is malignant (“M”). Fig. 15 visualizes the predictions that this trained \(K\)-nearest neighbor model will make on a large range of new observations. Although you have seen colored prediction map visualizations like this a few times now, we have not included the code to generate them, as it is a little bit complicated. For the interested reader who wants a learning challenge, we now include it below. The basic idea is to create a grid of synthetic new observations using the numpy.meshgrid function, predict the label of each, and visualize the predictions with a colored scatter having a very high transparency (low opacity value) and large point radius. See if you can figure out what each line is doing!

Note: Understanding this code is not required for the remainder of the textbook. It is included for those readers who would like to use similar visualizations in their own data analyses.

# create the grid of area/smoothness vals, and arrange in a data frame
are_grid = np.linspace(
    unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 100
)
smo_grid = np.linspace(
    unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 100
)
asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"])

# use the fit workflow to make predictions at the grid points
knnPredGrid = knn_fit.predict(asgrid)

# bind the predictions as a new column with the grid points
prediction_table = asgrid.copy()
prediction_table["Class"] = knnPredGrid

# plot:
# 1. the colored scatter of the original data
unscaled_plot = (
    alt.Chart(
        unscaled_cancer,
    )
    .mark_point(opacity=0.6, filled=True, size=40)
    .encode(
        x=alt.X(
            "Area",
            title="Area (standardized)",
            scale=alt.Scale(
                domain=(unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max())
            ),
        ),
        y=alt.Y(
            "Smoothness",
            title="Smoothness (standardized)",
            scale=alt.Scale(
                domain=(
                    unscaled_cancer["Smoothness"].min(),
                    unscaled_cancer["Smoothness"].max(),
                )
            ),
        ),
        color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"),
    )
)

# 2. the faded colored scatter for the grid points
prediction_plot = (
    alt.Chart(prediction_table)
    .mark_point(opacity=0.02, filled=True, size=200)
    .encode(
        x=alt.X("Area"),
        y=alt.Y("Smoothness"),
        color=alt.Color("Class", scale=alt.Scale(range=colors), title="Diagnosis"),
    )
)

unscaled_plot + prediction_plot
data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7

Fig. 15 Scatter plot of smoothness versus area where background color indicates the decision of the classifier.

Exercises

Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Classification I: training and predicting” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter @ref(move-to-your-own-machine). This will ensure that the automated feedback and guidance that the worksheets provide will function as intended.

CH67

Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967.

FH51

Evelyn Fix and Joseph Hodges. Discriminatory analysis. nonparametric discrimination: consistency properties. Technical Report, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.

SWM93(1,2)

William Nick Street, William Wolberg, and Olvi Mangasarian. Nuclear feature extraction for breast tumor diagnosis. In International Symposium on Electronic Imaging: Science and Technology. 1993.