Logistic Regression in Python – Reshaping Data
Logistic Regression in Python – Restructuring Data
Whenever any organization conducts a survey, they attempt to gather as much information as possible from their customers, with the idea that this information will be useful to the organization at some later point in time. To solve the problem at hand, we must collect information directly related to our question.
Show All Fields
Now, let’s see how to select the data fields that will be useful to us. Run the following statement in the code editor.
In [6]: print(list(df.columns))
You will see the following output −
['age', 'job', 'marital', 'education', 'default', 'housing', 'loan',
'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays',
'previous', 'poutcome', 'emp_var_rate', 'cons_price_idx', 'cons_conf_idx',
'euribor3m', 'nr_employed', 'y']
The output shows the names of all the columns in the database. The last column, “y,” is a Boolean value indicating whether the customer has a fixed deposit with the bank. The value of this field is either “y” or “n.” You can read the description and purpose of each column in the bank-name.txt file downloaded as part of the data.
Eliminating Unnecessary Fields
By examining the column names, you’ll see that some fields are meaningless for the current problem. For example, fields such as month, day of week, and event are of no use to us. We will remove these fields from our database. To drop a column, we use the drop command as shown below
In [8]: #drop columns which are not needed.
df.drop(df.columns[[0, 3, 7, 8, 9, 10, 11, 12, 13, 15, 16, 17, 18, 19]],
axis = 1, inplace = True)
This command says, drop column numbers 0, 3, 7, 8, and so on. To ensure that the index is chosen correctly, use the following statement –
In [7]: df.columns[9]
Out[7]: 'day_of_week'
This will print out the column name for the given index.
After discarding the unnecessary columns, use the head statement to check the data. The screen output is shown here –
In [9]: df.head()
Out[9]:
job marital default housing loan poutcome y
0 blue-collar married unknown yes no nonexistent 0
1 technician married no no no nonexistent 0
2 management single no yes no success 1
3 services married no no no nonexistent 0
4 retired married no yes no success 1
Now, we have only those fields that we believe are important for our data analysis and prediction. This is where the importance of the data scientist comes into play. The data scientist must select the appropriate columns to build the model.
For example, the job type, while at first glance it may not convince everyone to include it in the database, will be a very useful field. Not all types of customers will open TD. Low-income individuals might not open a TD, while high-income individuals often park their extra money in a TD. Therefore, in this case, the type of work becomes very important. Similarly, carefully select the columns you believe are relevant to your analysis.
In the next chapter, we will prepare the data for model building.