Logistic Regression in Python – Getting the Data

Logistic Regression in Python – Obtaining Data

This chapter discusses in detail the steps for obtaining data for logistic regression in Python.

Downloading the Dataset

If you haven’t downloaded the UCI dataset mentioned earlier, download it now from here. Click on the “data” folder. You will see the following screen −

Logistic Regression in Python – Obtaining Data

Click on the provided link to download the bank.zip file. The zip file contains the following files –

Logistic Regression in Python – Getting Data

We will use the bank.csv file for model development. The bank-names.txt file contains a description of the database you will need later. The bank-full.csv file contains a larger dataset you can use for more advanced development.

We have included the bank.csv file in the downloadable source code archive. This file contains comma-delimited fields. We have also made some modifications to the file. We recommend that you use the files in the project source archive for your learning.

Loading Data

To load the data from the CSV file you just copied, enter the following statement and run the code.

In [2]: df = pd.read_csv('bank.csv', header=0)

You can also check the loaded data by running the following code statement −

IN [3]: df.head()

Once the command runs, you will see the following output —

Logistic Regression in Python - Get Data

Basically, it has printed the first five rows of the loaded data. Check that there are 21 columns present. We will only use a few of these columns for model development.

Next, we need to clean the data. The data may contain some rows with NaN values. To eliminate these rows, use the following command

IN [4]: df = df.dropna()

Luckily, bank.csv does not contain any rows with NaN values, so this step is not really needed in our case. However, in general, it is difficult to find such rows in a huge database. Therefore, it is always safer to run the above statements to clean the data.

Note – You can easily check the data size at any point in time by using the following statement —

IN [5]: print (df.shape)
(41188, 21)

The number of rows and columns will be printed in the output as shown in the second line above.

The next thing to do is to check whether each column is suitable for the model we are trying to build.

Leave a Reply

Your email address will not be published. Required fields are marked *