BuildingAI :Logistic Regression (Breast Cancer Prediction ) — Intermediate
In this series we will learn about real world implementation of Artificial Intelligence. AI have grown significantly and many of us are interested in knowing what we can do with AI.
I think one of the best way of learning this would be hands on experience with some of the libraries and algorithms.
I have started this series with one of the most basic machine learning example but a important one. You can use this algorithm to train the machine with your own datasets.
Unlike other tutorials we are not starting with basics, we are going little advance (not too advance). I think most of you will figure out how to change some of the codes by yourself.
At first we will learn about Numpy or Numeric Python
We are using : Google Colab (https://colab.research.google.com)
Colab is a Google research project created to help disseminate machine learning education and research. It’s a Jupyter notebook environment that requires no setup to use and runs entirely in the cloud.
First Step : Open Colab and create a new python Notebook. File > New Notebook.
Second Step : Give your python notebook a name. For example: BuildingAI_1.ipynb.
An IPYNB file is a notebook document used by Jupyter Notebook, an interactive computational environment designed to help scientists work with the Python language and their data.
Now import Pandas:
import pandas as pd
Pandas is a library for data manipulation and analysis, it a very powerful and handy library to learn more about it click here.
Getting the dataset and polishing:
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)data.columns = ['Sample code', 'Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape','Marginal Adhesion', 'Single Epithelial Cell Size', 'Bare Nuclei', 'Bland Chromatin','Normal Nucleoli', 'Mitoses','Class']data = data.drop(['Sample code'],axis=1)
Here, we are getting the data set from a website and calling it data. This data consists of information about different factors that are directly related to breast cancer. We have to specify the columns of the dataset. And we drop the first column, as we do not need that information for our prediction.
Separate Features and Target:
predictors = ['Clump Thickness','Uniformity of Cell Size','Uniformity of Cell Shape', 'Marginal Adhesion','Single Epithelial Cell Size','Bland Chromatin','Normal Nucleoli','Mitoses']features = data[predictors]target=data.Class
Here we select different columns from the dataset as features(factors affecting breast cancer) and the result of this is target(final result considering the values of different factors) . As result in the above dataset is in column Class , hence we take the column Class as target.
Splitting the dataset for training and testing:
from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(features,target,test_size=0.25,random_state=0) #splitting the dataset
In this code cell, At first we import the necessary library from sklearn (more about sklearn(here) and then we split the data set we for training and for testing. Here, test_size determines the size of dataset for testing , for example here we have the value of test_size, 0.25. Therefore the train size would be 0.75.
Importing Logistic Regression:
from sklearn.linear_model import LogisticRegressioncancer=LogisticRegression() cancer.fit(X_train,y_train) #fitting the modelprediction = cancer.predict(X_test) #making prediction
In this code cell, we first import LogisticRegression and then instantiate it. After that we fit the instance of LR to our x_train and y_train and use .predict function from the library to predict the value with the test dataset that we have separated for testing previously.
Using Confusion Matrix we evaluate our model:
from sklearn import metricscnf_matrix = metrics.confusion_matrix(y_test,pred1)print("Accuracy:",round(metrics.accuracy_score(y_test,pred1),2))
This will give you the accuracy of your trained model.
It is the fraction of the total sample that are correctly identified.
You can fine tune your model or in other word increase the accuracy of your model by changing different parameters like the size of the testing and training data and the features (or columns) that you select.
You can use ExtraTreeClassifier to find the most important columns and use those columns only for better prediction.