Bank telemarketing campaign analysis

Analyzing data to build a machine learning model

Rômulo Peixoto
Analytics Vidhya

--

Today we will talk about my first project and the first iteration I did to predict if a client would subscribe to a term deposit from his/her information and figure out how that would come.

I got this data set from UCI Machine Learning Repository, and this data set was available from this article Moro et al., 2014, which used other kinds of tools to predict the subscription. They used R as the programming language and the metrics evaluated for the models were the area under the curve (AUC) and lift analysis, which consists of making a curve of cumulative deciles of the population plotted against the cumulative percentage of real answers, the area under this curve (ALIFT) is another metric that was evaluated.

So, as usual, I’m going to be using python, as my programming language with all that juicy jupyter notebooks, pandas, and seaborn extravaganza we are all used to, and for this first iteration, I’m going to be just taking a look at this data, see what information can I extract before actually tackle it with some machine learning model, and building that marvelous dashboard so we can access this data more easily, and the dashboard tool of choice is streamlit, which I discovered and learned on the fly, check out.

You can check the codes here.

So, what’s the data for this challenge?

So, downloading the data from the web page I got 3 files, one with all entries, which are 45211 for full information and 4521 for sample information, this separation, according to the authors is due to the usage of heavier processing algorithms like if you want to use support vector machines, you might want to use this smaller set, another .txt file with some general information about the data sets.

So, there are also more datasets available, the bank additional data sets, these data sets contain a smaller population, 41188, but more importantly, more features, what does that mean? With more variables to make our models, so the standard bank data sets carry 17 variables in their content, no missing values, and refer to clients bank information only, the additional 3 variables, 20 in total, are for social and economic context information, and there are some missing values here, but for now, I’m trying to make it with bank information only.

Taking a sneak peek at the population

For starters, we need to take a look into how is this data distributed along with all the columns, and at first sight, I saw both columns day and month, and this play a very unpleasant task, converting these columns into a date format column, there a few ways I found on the internet for doing so, but, they are incredibly over-complicated to do, so, I did in the old fashion way, opened Google Sheets and added by hand all year values, which wasn’t that hard actually, in the data set page it says that the information is from 2008 to 2010.

Note: I lost a big chunk of my time trying to figure this out, if you want to talk to me why I went rogue here, feel free to leave a comment.

After this sweet start, that wasn’t so sweet, I ran some basic lines of code to check if everything was in its place, first I saw that the pdays variable, has a value of -1, which gets in the way of the average and consequently standard deviation calculation, since this value is the number of days this particular person was contacted for a previous campaign, and -1 is for those who were never contacted by any campaign before, I changed to zero because I counted the number of zeros in this column and is, not ironically, also zero, so replacing this value wouldn’t come and bite me soon, I’m fine for now.

After this first treatment, everything seems to be alright, and I went for some frequency calculations, which were a lot of graphs, like, a visual representation festival, all kinds of graphs with one dimension you could imagine that plotly express and seaborn could represent in a bar like a chart.

Counting what’s going on

So I plotted 42 graphs to see the appearance of this blob of data, and I decided to run through each variable and tell you what is going on there. For more convenience, I split the source dataframe into 2 others, yes_df and no_df, so we can see also, by the response to the campaign what are the main characteristics of people, so each column generated 3 graphs, a more general one, and one for each answer.

  • Job

So the job variable is very self-explanatory, but it’s more about the kind of job you have than what you actually do, they are 12 values and, the top three values for all of the data are blue-collar, management, and technician, in this order. Now for yes the top 3 jobs are management, technician, and blue-collar, quite unexpected for me actually, but ok. And for no are blue-collar, management and technician.

These graphs alone, don’t tell us much, but they are good qualitative data for us to learn more about what are the kinds of customers the bank has.

  • Age

Here I thought I would see some shift in the three graphs, let’s see what happens.

Overall data graph show that most of our customers are from 24 to 60 years old, the total range of values goes from 18 to 95.

Yes graph showed more distributed values over the range of most customers, but yet, most of the positives are in the 24–39 range of people.

No graph showed a big difference between 29–30 years and went down till the ’60s, where there’s a huge drop because there are few people in the range of 60 to 95.

All in all, these three graphs have very similar shapes, so this is a sign that there is a low correlation between age and acceptance of the campaign.

  • Education

So education has 4 classes, unknown, primary, secondary, and tertiary, most of the population has secondary education.

The point to notice here are ratios between each class in acceptance of the campaign, for yes class there is a little more tertiary education than secondary.

  • Default

So, in case you don’t know, default is a term for when a customer fails to pay any kind of debt he/she has, even if there are some missing payments. Regarding the overall data, there’s a very low number of customers with default.

  • Housing

In this data set, housing means if someone has a loan for a house, and things got a little more interesting here. In all the population, there are more customers with housing loans.

But, if we analyze only the people who said yes to the campaign, we see a shift, people with no housing loans were more prone to say yes, which is kind of justifiable, but very interesting to see how this shifted.

  • Loan

So this variable means if someone has a personal loan, beyond a housing loan, and looking at the last variable, this should have more or less the same behavior right? Yeah, it didn’t, actually, most people don’t have a personal loan.

  • Contact

This variable tells us what were the means of how to get in touch with customers, and got 3 classes, unknown, cellphone and telephone since is a telemarketing campaign. Again, there is the same profile in all graphs, but, in the yes population more customers were in contact through cellphone.

  • Poutcome

This variable is about how successful was previous campaigns with this particular customer, it assumes 4 classes, unknown, failure, other, and success. Overall data is more about unknown outcomes.

Although, yes population has higher ratios for failure and success, the latter is quite understandable, but the first one was also quite a surprise to show this growth in relevance for this population.

  • Previous

Previous data means how many contacts happened with this particular client, and for starters, most of the population never were contacted before, this explains a lot of the unknown classes we saw earlier, this trend goes along all the way for all graphs.

  • Duration

This variable measures how many seconds the last calls were, most of the population are between 0 to 400 seconds, the only thing I notice different here is the number of missing values in the yes graph, which is a coincidence, and the fastest call for a yes was 8 seconds, I don’t know who made that call but what a salesperson it was.

There is more to this variable, but we talk more about when training some models here.

  • Campaign

This measures the number of contacts for this campaign and a particular customer, it has the same profile of graph across the board, so that’s pretty much it, most people were contacted in a maximum of 5 times for this campaign.

  • Last contact date

Here is an interesting feature, so, when I downloaded the data set, it has two things that caught my eye, first, data was in place by chronological order and second, two features day and month, saying when the last contact with that person happened, making a date variable in this data set would be natural, but no easy task with my knowledge in python.

So after everything was in place to plot the results over time, I couldn’t help myself to see that there is some correlation between the numbers of callings per day and the amount of yes to the campaign, well, you can sniff that from miles away, but here we have representations of this behavior, how bad can it be to find this relationship after all? And how our results would relate to the dynamic between the number of callings per day to how good are the predictions of a classification model? These are interesting questions, someone, with a more business view would ask, but it’s a talk for later.

  • Pdays

This variable is almost last because it has no plot, yes, more than 75% of the data set wasn’t contacted previously by another campaign, therefore, there is no number of days to count a difference. But one weird behavior though, in the website says, 999 means no previous contact, but, in the actual data set is -1, so take care of that.

  • y

And here is our target, so counting the amount of yes and no shows that the population is very unbalanced towards no, which you can expect from a telemarketing campaign actually, I mean, I’m not the biggest fan of being called in the middle of the day to hear about term deposits, but some people were.

How much these variables correlate with each other?

So for this part, I decided to try to find more than one dimensions graphs, for this I plotted the correlation matrix in a heatmap with seaborn, and the results weren’t great, most of these numeric variables don’t correlate at all.

But, there is a hint of a correlation between pdays and year, so I plotted the scatter map, and actually is a very weird one, there is definitely a trend there, but is weak, I’m not sure if it’s worth mentioning, because first, there is a correlation for at a maximum of 25% of the population, is a very low sample if we want to see any correlation, and not very random either.

Conclusion

If you read till here, thank you, I appreciate that you like my content, this section is going to be a mixture of learnings, information, and what to expect.

First, learnings, I never imagined it would take so much time to do all of this and think, interpret, deal with tools so many times before having something ready. Another point is, writing what I’m doing was so helpful, like, many times I was finding myself thinking ‘wait, is this really true?’ or having another idea while I was writing, so write works, very well.

What do we know now about this data? Well, it is very unbalanced, not only for nos but all across the board we saw many parts that one class was just more than the sum of all the others in a variable, why is that important? If we want to train a model with this data set, I would say today, it probably would be better to make this model learn who would say no to the campaign, there is a lot of variability of people saying no anyway, but let’s see how can I overcome this problem.

And, hopefully, my app should be up in streamlit on Monday, because I read a lot about how to make an app, but when you get your hands dirty, man, there is whole another story. And training some models, I probably won’t look for ALIFT and AUC analysis at first, I want to see how this goes with more conventional metrics for classifications, like confusion matrix and things like that.

Keep in mind, that I not stoping with good metrics, my intentions with this project are to actually calculate how good a model would be when in production from this data set, like some real business metrics, but this is the future, for now, stay safe and take care.

--

--