Simple Data Analysis

2 min readMar 20, 2021

The easiest way to overcome cancer is by early detection through timely identification of cancerous cells in the body. According to records, 10–20% of people with cancer are misdiagnosed and 28% of 583 cases were life threatening or life altering (google). Therefore, in order to win this war against cancer, it is important that we greatly cut down the rate of misdiagnosis.

In this article, I will analyze the common traits of cancerous and non-cancerous cells using a sample data from Kaggle on breast cancer — cancer and non cancer classification.

The data contains cell characteristics such as radius, concavity, texture, perimeter, smoothness, symmetry etc and it also identifies which cells are cancerous and which ones are not. I used pandas and numpy to analyse the data and Matplotlib for visualization. My Jupyter notebook containing this analysis can be accessed here:

https://github.com/lufunmbi/Data-Analysis

I used the info() function in pandas to get an overall view of the data and the describe() function to get the mathematical summary of the data. The outcome column contains two types of variables, 0 and 1 which indicates whether a cell is cancerous or not. I plotted a pie chart showing the percentage of cancerous to non-cancerous cells we have in the dataset

I made a copy of the dataset, renamed some of the column to prevent errors during usage and dropped some columns that I would not be using for this analysis. Below is the mathematical summary for new dataset as obtained from the describe() function

I compared some values of the dataset based on the outcome column as seen below:

From the figures above, the non-cancerous cells(0) have larger dimensions than the cancerous cells(1)

Simple Data Analysis

Written by Bi