Category Archives: Infrastructure

Data Science Apps in the Cloud with Heroku

Category : Infrastructure

I recently had an opportunity to work with Heroku, a platform-as-a-service for deploying and running apps, for deploying Python-based data science applications in the cloud. At first, I didn’t understand why the engagement wasn’t just using AWS, since data science related instances abound on the EC2 marketplace. What I learned however is that AWS can be a money pit for businesses without a dedicated IT team. It is a complex beast that requires competent professionals to tame. Heroku, on the other hand, just seems to just work.

In this article, I’ll go through the basics of creating a Heroku application that at least loads popular data science dependencies in python. In later articles I may take the example to the end, where I load the Iris data set, run a regression on it using the statsmodels package, and write the results into a database on Heroku. All of this can be run using Heroku’s very simple free scheduler.

Preliminaries

To get started, create a free Heroku account at www.heroku.com, install the Heroku CLI, and run the following commands in a bash shell (Windows 10 users are encouraged to use the Ubuntu 18.04 app):

git clone https://github.com/baogorek/HerokuExample.git
cd HerokuExample
heroku login
heroku create

After cloning in the first line, the second line changes directories to the folder that contains the Heroku application, and the third line opens a browser window to log into your Heroku account. Finally, heroku create registers the app with the service. The output of that line is:

Creating app... done, ⬢ mysterious-badlands-45487                              https://mysterious-badlands-45487.herokuapp.com/ | https://git.heroku.com/mysterious-badlands-45487.git

which shows us that it is given a name, a URL, and its own git repository. If you navigate to the URL, you get a default welcome screen, but we won’t be building a web app in this article.

The git repository is interesting, because it seemed like we already had one. But this is a git remote hosted by Heroku itself, and it’s a big part of their deployment strategy. If I run a git remote -v, I can see it:

heroku  https://git.heroku.com/mysterious-badlands-45487.git (fetch)
heroku  https://git.heroku.com/mysterious-badlands-45487.git (push)
origin  https://github.com/baogorek/HerokuExample.git (fetch)
origin  https://github.com/baogorek/HerokuExample.git (push)

Even though I haven’t added anything new through git, I can deploy the app that I have through pushing to the heroku remote:

git push heroku master

Just that simple command sets off a lot of activity. Here is a selection of the output:

remote: Compressing source files... done.                                      remote: Building source:                                                       remote:                                                                        remote: -----> Python app detected                                             remote: -----> Installing python-3.6.8    
...
remote:        Installing collected packages: numpy, scipy, six, python-dateutil, pytz, pandas, patsy, cython, statsmodels, psycopg2-binary   
...
remote: -----> Launching...                                                    remote:        Released v3                                                     remote:        https://mysterious-badlands-45487.herokuapp.com/ deployed to Heroku  

Pushing to the “heroku” remote triggered the build of a Python application with data science dependencies such as numpy, scipy, pandas, and statsmodels. We see at the end that the app was “deployed.”

Testing it out

Since Heroku is based on containers, one quick way to test that our app has the data science dependencies that we think it does is to spin up a local container. We can do that with:

heroku run bash
python

In Python 3.6.8 within our local Heroku container, we can import a few packages just to make sure.

import statsmodels
import pandas

If you didn’t get an error, then your cloud-deployed Heroku app has these data science dependencies installed. Good!

Getting the dependencies

Exiting out of the local Heroku container, look inside the requirements.txt file:

cat requirements.txt

You’ll see a very modest text file with the following lines:

numpy
scipy==1.2
pandas
patsy
cython
statsmodels
psycopg2-binary

I specified scipy to be exactly version 1.2 based on advice from a post on a problem I was having, but otherwise these are the minimum dependencies specified by statsmodels.

Why not conda?

There are some Heroku “buildpacks” for conda online, but many of them are years old and not well maintained. Using the requirements.txt file was a breeze, and I didn’t see a reason to struggle with getting conda to work. But it clearly is possible.

Running jobs

If loading statsmodels and pandas in a local Heroku container didn’t send your pulse above 100, it’s not you. But we’re actually not too far away from making our Heroku app do things. One way to actually have your app act is to utilize the text file called Procfile (no “.txt” extension). If you look inside the Procfile for this app, it is completely blank.

Instead, I used the Heroku scheduler add on to run a file, like HerokuExample/herokuexample/run_glm.py. You can see how easy it is to set up by looking at the following screenshot:

Since you could potentially spin up some serious computing resources using the Heroku scheduler add on, you do need your credit card to enable it.

Running a script

Just so we see some output in this article, add the following line to the Procfile:

release: python herokuexample/say_hello.py

Add the file to git staging, commit it, and then push to the heroku remote:

git add Procfile
git commit -m "Updating Procfile"
git push heroku master

Among the output lines you will find:

remote: Verifying deploy... done.
remote: Running release command...
remote:
remote: I loaded statsmodels

Indeed it did.

Next steps

To really do something interesting without a full blown web app, we really need a database. Fortunately, Heroku has powerful database add ons that can complete the picture of a useful data science application that runs in the cloud. Leave a comment if you want to hear about Heroku databases in conjunction with data science apps, and I’ll add it to the queue.