Logistic Regression in Python – Splitting Data

Logistic Regression in Python – Splitting the Data

We have approximately 41,010 records. If we use the entire data set to build the model, we won’t have any data left for testing. So, generally, we split the dataset into two parts, say a 70/30 split. We use 70% of the data to build the model, and the rest to test the model’s predictive accuracy. You can use a different split ratio based on your requirements.

Creating Feature Arrays

The X array contains all the features (data columns) we want to analyze, and the Y array is a single-dimensional array of Boolean values, which is the predicted output. To understand this, let’s run some code.

First, execute the following Python statements to create the X array —

In [17]: X = data.iloc[:,1:]

To examine the contents of X, use head to print some initial records. The following screen shows the contents of the X array.

In [18]: X.head ()

Logistic Regression in Python - Split Data

The array has several rows and 23 columns.

Next, we will create the output array containing the “y “ values.

Creating the Output Array

To create an array for the predicted value column, use the following Python statement —

In [19]: Y = data.iloc[:,0]

Inspect its contents by calling head. The following screen output shows the result-

In [20]: Y.head()
Out[20]: 0 0
1 0
2 1
3 0
4 1
Name: y, dtype: int64

Now, split the data using the following command −

In [21]: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=0)

This will create four arrays called X_train, Y_train, X_test, and Y_test. As before, you can use the head command to examine the contents of these arrays. We will use the X_train and Y_train arrays for training our model, and the X_test and Y_test arrays for testing and validation.

Now, we are ready to build our classifier. We will study it in the next chapter.

Leave a Reply

Your email address will not be published. Required fields are marked *