SQLAlchemy: merging two of the greatest tools for data science

How to get started on using SQL with Python the right way!

Rômulo Peixoto
Analytics Vidhya

--

So far, studying more and more on how to use some tools to get into a data science entry job I found out that the most important thing about stacking up a lot of tech into your curriculum is to learn how to use them with synergy. In other words, if you know a lot of Python and a lot of SQL, how can you use both technologies together to make your ideas flow better.

With that in mind, I started learning more about SQL and ended up with a very important and useful toolkit to merge these two must-haves on your resumé, SQLAlchemy. In this post, you will learn how to set up the environment to run with it, how it works, and why this is the best choice.

Object Relational Mapping

First, we need to learn more about what is the nature of SQLAlchemy, Object Relational Mapping (ORM) is a technique based on object-oriented programing that converts data between two systems. This is interesting for data science since you have to pick some data in places that are not based on the language you want to make your treatment, like in this case, for SQL-based databases to a Python treatment.

And this goal is achieved by using objects, which can be variables, functions, or in this case a data structure, since, I hope you guessed, is what you have in your SQL databases. And by that, SQLAlchemy can translate and make it usable for you in a Python environment.

Nice, so how do I start?

So, I’m going to make this tutorial based on the PostgreSQL server, but, it can be used with other SQL programs, like MySQL, Microsoft SQL Server, anything that builds up somewhere based on SQL that stores your data.

This is possible because SQLAlchemy is a backend-agnostic system, which means that it doesn’t rely on a single system to work, as long as you set up your system right and know how to access it. This is most definitely true since you need to know how to access it when you start to work on that particular database just with SQL.

Right foot to merge

So first of all, you need to create your object that will set the path to your server, and that’s the object you’ll do your operations, since it is a path then you must be able to find all your data structures in the server, it goes like this:

from sqlalchemy import create_engineengine = create_engine('dialect+driver://username:password@host:port/database')

Ok, that’s not information you have right away on your mind right? Where can you find it? Well in Postgres it’s just right-clicking on the server in the pgAdmin software and going into properties, you must find this information right away under the tab ‘Connection’.

Now, let’s run some operations

So now you might be eager to run some queries on your SQL server through your code right? So let’s learn a little bit more about the method execute() .

So before you start a new table (or if you messed up the first time), you might want to drop a table here and there, so you might request this:

engine.execute('DROP TABLE IF EXISTS public.teachers_add')

Here is important to point that you have to refer to your Schema, otherwise the engine won’t work properly. This kind of work is worth noting that your main Object is a path, so any operation you work upon must have a pathway of thinking.

But, that’s not the only thing you want to do right? You want your tables made out of your so carefully thought requests to be working structures on your Python code. So here comes the part where things really become one, working with Pandas, to produce workable variables, it can be as simple as this:

import pandas as pdsql = 'SELECT * FROM public.subjects_add'
data_db1 = pd.read_sql_query((sql), engine)
data_db1

And voilà! Now what you requested in your database is a beautiful Pandas DataFrame, and you can use it like such. Another kind of way you can use your requests from SQLAlchemy is by putting it on a NumPy array, but if you need this right away, a quick fix is to turn it into a DataFrame and then into a NumPy array, this quick fix is a two-step solution you’re going to end up doing anyway, the other way would be to make a list using list comprehension and then to make it into a NumPy array, do as you wish.

So you worked hard on your analytics, and at the end of the day you want to post your results or modifications on the server, but it is a DataFrame in your code, you can put it up to the server through Python too! The snippet goes like this:

data_db1.to_sql('subjects_add', engine)

This method has a parameter that can be really useful, if you already have a table named ‘subjects_add’, then you can choose what to do with the ‘if_exists’ parameter, it accepts the values [‘fail’, ‘replace’, ‘append’], the last two are self-explanatory, but the first one will raise an error if there is an equal table on the server already, and be chill, if you forget this value, the default is ‘fail’, so it will raise a value error.

Albeit this is easy to use, there is one last red flag to take care of here, and it can get nasty, let’s say you want to append it to an existing table, very well, how are you going to append it? You don’t want to end up with a Frankenstein’s monster on your company's tables, it can get real with data engineers man. So you might want to use the ‘method’ parameter, and if you don’t pass any value, it will write using a common INSERT clause, row by row, it accepts the value ‘multi’, where it will insert multiple entries into one INSERT clause per time. But, your entries are tricky, of course, because you live in the real world, so you can call a signature method, a function that appends based on your own terms.

What about psycopg2?

That’s a good question, actually SQLAlchemy interfaces with PostgreSQL through psycopg2, and if you’re going to do a custom method to append information on your tables inside the database you will need knowledge on psycopg2.

But I could do a full tutorial on how to use psycopg2 instead right? Well yes, that was the idea, do two tools for one tutorial, since they complement each other, but, I’ve been reading that sometimes psycopg2 misbehaves in some environments, and you will need a deep understanding of the tool to solve some of these problems, even though some doesn’t have any solution as far as I know.

That’s it boys and girls

I found this information really useful, and not very accessible on the internet, that’s why I decided to do this mini-tutorial to help you, so if you want to leave a comment on which you find better SQLAlchemy or psycopg2, or if you found this tutorial useful I would appreciate.

And of course, this is just to ensure you will start with the right foot, if you need more in-depth information about this tool check out the documentation for more information.

Thank you and take care!

--

--