Logistic Regression in Python – Getting the Data
Logistic Regression in Python – Obtaining Data
This chapter discusses in detail the steps for obtaining data for logistic regression in Python.
Downloading the Dataset
If you haven’t downloaded the UCI dataset mentioned earlier, download it now from here. Click on the “data” folder. You will see the following screen −
Click on the provided link to download the bank.zip file. The zip file contains the following files –
We will use the bank.csv file for model development. The bank-names.txt file contains a description of the database you will need later. The bank-full.csv file contains a larger dataset you can use for more advanced development.
We have included the bank.csv file in the downloadable source code archive. This file contains comma-delimited fields. We have also made some modifications to the file. We recommend that you use the files in the project source archive for your learning.
Loading Data
To load the data from the CSV file you just copied, enter the following statement and run the code.
In [2]: df = pd.read_csv('bank.csv', header=0)
You can also check the loaded data by running the following code statement −
IN [3]: df.head()
Once the command runs, you will see the following output —
Basically, it has printed the first five rows of the loaded data. Check that there are 21 columns present. We will only use a few of these columns for model development.
Next, we need to clean the data. The data may contain some rows with NaN values. To eliminate these rows, use the following command
IN [4]: df = df.dropna()
Luckily, bank.csv does not contain any rows with NaN values, so this step is not really needed in our case. However, in general, it is difficult to find such rows in a huge database. Therefore, it is always safer to run the above statements to clean the data.
Note – You can easily check the data size at any point in time by using the following statement —
IN [5]: print (df.shape)
(41188, 21)
The number of rows and columns will be printed in the output as shown in the second line above.
The next thing to do is to check whether each column is suitable for the model we are trying to build.