Trendlines with Sklearn

How was my approach to the first time coding with this tool?

5 min readMay 19, 2021

Continuing with my saga of solving a book about statistics for business, today we will talk about my first approach to using a Machine Learning tool along with the book exercises. It’s really not a big complex problem-solving exercise, it’s fairly simple, but, there are some key things I got into and really wanted to share here with you.

So if you didn’t read my older texts, I highly recommend you, you can find them here. And I’m going with the book Statistic for Business and Economics by David R. Anderson, Dennis J. Sweeney, and Thomas A. Williams 11º edition, exercise 30 chapter 2.

What’s the plan?

So, the idea is very simple, the exercise gives you a data set, and it asks you to build a scatter graph and to build a trendline.

Let’s be honest, that is the easiest thing you can do, seriously, a trend line is a line of code if you want to use something like Pandas, Numpy, or any other library, but, I decided to go with sci-kit learn, why? Well, looking up at the internet, if you want to get any kind of metric if your model is any good enough, you will need to use at least r2_score, a sklearn tool, so if I’m heading this way anyway, why don’t I go all the way?

Ok, but why is such an obvious choice?

Well, my friend, I’m going to give you the fast version, but really, read about it urgently, because you need help, seriously. But, trendlines are built out of some kind of linear regression, this is a technique to find the best fitting equation that comprehends all of your data entries.

So why is this so important for a data scientist? Basically, because you can predict some values out of that equation and that's valuable. There is a wide range of ways of doing it, here I went with the least square methods, not getting in too deep in the maths of this technique, but this model build the equation of a line with the closest distance possible to the points you have, even if your values are negative.

Sklearn provides other ways of doing a linear regression too, very complex and intricate, for you to figure out which one fits the best for you, but for now, I’m going in a very simple way, the point here is to learn more about how to use it than using a support vector machine for an extremely simple data set.

What’s the result then?

So here is the resulting code from this attempt:

# 30.import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
scatter = pd.read_excel('~/Documentos/projetos_ds/livro_1/excel_files/ch_02/Scatter.xlsx', 'Data')
print(scatter)# A)scatter_x_train = scatter.iloc[0:19, 1:2]
scatter_x_test = scatter.iloc[0:19, 1:2]scatter_y_train = scatter.iloc[0:19, 2]
scatter_y_test = scatter.iloc[0:19, 2]trend = linear_model.LinearRegression()model = trend.fit(scatter_x_train, scatter_y_train)scatter_y_pred = trend.predict(scatter_x_test)print('The coeffitient is: {0}'.format(trend.coef_))print('The intercept is: {0}'.format(trend.intercept_))print('The correlation of the values is: {0}'.format(r2_score(scatter_y_test, scatter_y_pred)))plt.scatter(scatter_x_test, scatter_y_test,  color='black')
plt.plot(scatter_x_test, scatter_y_pred, color='blue', linewidth=1)plt.xticks(())
plt.yticks(())plt.show()

If you already read the simple linear regression example from sklearn documentation, actually you see a very close resemblance in structure, and yes, I took it out of there, you can check it right here.

Now, let me tell you about my choices here, starting with data entries, if you read my last post you know my vision on free educational content on the internet and how you learn much more by going through a documentation file. So here is what happened, in my country we don’t have a word to define an array, we call everything a matrix, so the 1D matrix is an array and a 2D array is a matrix, in Brazil, all matrix. I’m pretty much sure I learned like this in a free coding content on youtube about a year ago and it became very confusing when applying it to a more practical problem.

Ok, due to that equivalence, things got weird in the beginning, since sklearn sees it differently, if you have a pandas series that is one column it pretty much is a 2D array, and it’s a requirement for you to input any x to your model. So, this apparently is a very important characteristic to sklearn functioning on linear regression, so the reason is that it computes the number of rows you have on your data set (n_samples) and the feature that is your x (n_features), that can be more than one in a multivariable analysis, but here is just one, turning out that the definition was not entirely wrong, but was an unnecessary waste of time how it played out in practical problems.

Second, I computed every line everywhere, why? I don’t want to predict anything, just to fit a trend line, I know it sounds like a misuse of such a powerful tool, but, fitting it to this purpose is exploring some of the properties of this tool and when comes time to predict anything, well, I will know how to predict something, I’m still not 100% sure if it’s right this way, but definitely worked.

Worked so much that my result was y = 4.66–0.95x and my R² is 0.73, a very strong correlation.

The last point where I learned more about this feature in sklearn, the prediction method, is what gives our model a face when it’s plotted in a graph, so, there are two things I thought that could fit this purpose, the model variable and the scatter_y_pred to be plotted, the first gives a prediction on the x values and the second prediction gives the y values.

And the plotting of the graph I used Matplotlib because it was there in the example, but is possible to convert it to Plotly easily, but for practicing new tools this was my way of learning. Other detail, I solved my problem between Jupyter Notebooks and Matplotlib, the thing is that was a very rookie mistake, again if you see when importing Matplotlib I wasn’t using .pyplot, giving that gasp that only juniors give when coding.

Now what?

So I decided to make my posts about it with two fronts, this one, a very technical one, and another one that talks more about statistics itself. This way I can go more bananas in each front without being like ‘this is too long’.

What do you think? I would love to hear your thoughts about how is the best approach to learning these tools, this seems like a nice way of practicing and building foundations on statistics, and if I did right using all the lines to fit a trendline to the data set.

All in all, be safe and take care, keep learning.