Calculating the Accuracy in Machine Learning

Day 23 of #100DaysOfCode

This is the continuation of the previous blog. You can read it here: ilkecandan.hashnode.dev/learning-and-predic..

Now, we should learn how to measure the accuracy of our models. We should firstly, split our data sets into 2 sets; one for training and the other for testing. Generally, we should allocate 70 to 80 percent of our data for training and the other 20 to 30 percent for testing. So now, we should import a couple of functions.

We should add this line to our already existing code: (check previous blog)

from sklearn.model_selection import train_test_split

With this function we can split our dataset into two sets for training and testing.

Then we should call this function again. We should give it 3 arguments that includes x, y and a keyboard argument. Keyboard argument specifies the size of our test data set.

train_test_split(X, y, test_size=0.2)

So, we are allocating 20 of our data for testing. This function will return a tuple so that we can unpack it into four variables. These will be called X_train, X_test, y_train and y_test. First two are inputs for training and testing while last two are the outputs.

Before going further, let's get into predictions. To calculate the accuracy we have to compare predictions with the actual values we have in our output set for testing. For this we have to add another function.

from sklearn.metrics import accuracy_score

Now, final code should look like this:

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

music=pd.read_csv('music.csv')
X = music.drop(columns=['genre'])
y= music['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model=DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
score

This function returns to a accuracy score between 0 to 1.

And our result will look like this:

data set me.png

You will probably see a different result everytime you run this programme. Because, it always splits data in different versions. Because, the function choses data randomly for testing and training.

To run the current cell without adding adding a new cell below. A Shortcut: 1- Activate the first cell. 2- Press CTRL and enter.