Machine Learning Applied to Mammogram Classification

A real-world Machine Learning project detailed step by step.

Source: ActiveState

This project is part of the online course Machine Learning, Data Science and Deep Learning by Sundog Education.

The objective is to develop a classification algorithm that can predict whether a mammograph mass is benign or malignant with the highest accuracy possible.

The project has been developed using Python and the code is available in open-source on my GitHub.

Methodology 🗺️

For this project, the classification problem is tackled via Supervised Learning. Training data has been collected from the “mammographic masses” public dataset from the UCI repository.

First of all, a data pre-processing step will be performed to explore the dataset, handle eventual missing/erroneous values, select relevant features and normalize input data.

Then, several different classification models (along with multiple hyperparameters configurations) will be applied, including:

  • Logistic Regression
  • K-nearest neighbors (KNN)
  • Naïve Bayes
  • Decision Tree
  • Random Forest
  • Support Vector Machine (SVM)
  • Neural network

Models performances will be measured using K-fold cross validation which is an efficient statistical procedure to test a model’s ability to predict new data while preventing problems such as overfitting and selection bias.

Results will be finally be compared to identify which model yields the highest accuracy.

Data pre-processing ⚙️

Step 1 — Data exploration

The data contains 961 instances of masses detected in mammograms, and contains the following attributes:

  1. BI-RADS assessment (ordinal) — Assessment of how confident the severity classification is, ranked from 1 to 5.
  2. Age (integer) — Patient’s age in years.
  3. Mass shape (nominal) — round=1 oval=2 lobular=3 irregular=4
  4. Mass margin (nominal) — circumscribed=1; micro-lobulated=2; obscured=3; ill-defined=4; spiculated=5
  5. Mass density (ordinal) — high=1; iso=2; low=3; fat-containing=4
  6. Severity (binomial) — benign=0 or malignant=1

Here are some statistics of each feature:

Fig. 1 — Features information

Step 2 — Handling missing values

In Figure 1, we can observe there are quite a few missing values in the dataset (2 for “BI-RADS”, 5 for “age”, 31 for “shape”, 48 for “margin” and 76 for “density”).

Before dropping every row that’s missing data, it is important to make sure we don’t bias our data by doing so.

Let’s look at how missing values are distributed (Fig. 2 shows the missing values distribution for “age”). If it appears there are any sort of correlation to what sort data has missing fields, we’d have to impute that data in with a suitable method (eg. KNN, MICE).

Fig. 2 — “Age” missing values distribution

In our case, missing data seems randomly distributed. We can therefore move on and drop rows containing missing values:

Fig. 3 — Features information (missing values dropped)

Step 3 — Feature selection

Now, data must be split into two arrays:

  1. A multi-dimensional input array (X) containing values of features relevant to predict the output. In our case, relevant features are age, shape, margin and density. The attribute BI-RADS (assessment of how confident the severity classification is) is dropped because it is not a “predictive” attribute.
  2. A 1D array (Y) containing classification data (values of the feature ‘severity’).
Fig. 4 — Input data matrix (X) and classification data matrix (Y)

Step 4 — Normalization

Finally, some models require input data to be normalized so let’s go ahead and normalize our matrix X:

Fig. 5 — Normalized input data matrix (X)

Models 🧠

At this stage, data has been cleaned and prepared for analysis.

The next step is to build a classification algorithm that will be able to predict the class (benign or malignant) of new inputs with the highest accuracy possible.

The idea is to assess several different classification models and hyperparameters configurations, and compare their accuracy to identify which model yields the best results.

For this project, models and hyperparameters configurations tested are the following:

  • Logistic Regression
  • K-nearest neighbors (KNN) — values of K ranging from 1 to 50.
  • Naïve Bayes
  • Decision Tree
  • Random Forest — number of estimators (decision trees) ranging from 5 to 20.
  • Support Vector Machine (SVM) — kernel types tested: ‘linear’, ‘poly’, ‘rbf’ and ‘sigmoid’.
  • Neural network — Multi-layer perceptron (MLP) with hyperparameters tuning (network topology, activation functions, optimiser and loss function).

Models accuracy are measured using K-fold cross validation with K=10.

Results ⭐

Results are displayed in the table below:

Fig. 6 — Models performances results

Overall, except Decision Tree and Random Forest, models have achieved comparable results with a 79–80% accuracy.

For more details, I invite you to have a look at my code and comments, available in open-source on my GitHub.

I hope you enjoyed this article!

For any questions or feedback, don’t hesitate to contact me via my LinkedIn or Facebook. I will be please to read you.


Machine Learning Applied to Mammogram Classification was originally published in The Startup on Medium, where people are continuing the conversation by highlighting and responding to this story.

Related Articles

What is semi-supervised machine learning?

Machine learning has proven to be very efficient at classifying images and other unstructured data, a task that is very difficult to handle with classic rule-based software. But before machine learning models can perform classification tasks, they need to be trained on a lot of annotated examples. Data annotation is a slow and manual process that requires humans to review training examples one by one and giving them their right labels. In fact, data annotation is such a vital part of machine learning that the growing popularity of the technology has given rise to a huge market for labeled data. From Amazon’s Mechanical Turk… This story continues at The Next Web

Responses

Your email address will not be published. Required fields are marked *

Receive the latest news

Subscribe To Our Weekly Newsletter

Get notified about chronicles from TreatMyBrand directly in your inbox