Data Science Apps in the Cloud with Heroku
Category : Infrastructure
I recently had an opportunity to work with Heroku, a platform-as-a-service for deploying and running apps, for deploying Python-based data science applications in the cloud. At first, I didn’t understand why the engagement wasn’t just using AWS, since data science related instances abound on the EC2 marketplace. What I learned however is that AWS can be a money pit for businesses without a dedicated IT team. It is a complex beast that requires competent professionals to tame. Heroku, on the other hand, just seems to just work.
In this article, I’ll go through the basics of creating a Heroku application that at least loads popular data science dependencies in python. In later articles I may take the example to the end, where I load the Iris data set, run a regression on it using the
statsmodels package, and write the results into a database on Heroku. All of this can be run using Heroku’s very simple free scheduler.
To get started, create a free Heroku account at www.heroku.com, install the Heroku CLI, and run the following commands in a bash shell (Windows 10 users are encouraged to use the Ubuntu 18.04 app):
git clone https://github.com/baogorek/HerokuExample.git cd HerokuExample heroku login heroku create
After cloning in the first line, the second line changes directories to the folder that contains the Heroku application, and the third line opens a browser window to log into your Heroku account. Finally,
heroku create registers the app with the service. The output of that line is:
Creating app... done, ⬢ mysterious-badlands-45487 https://mysterious-badlands-45487.herokuapp.com/ | https://git.heroku.com/mysterious-badlands-45487.git
which shows us that it is given a name, a URL, and its own git repository. If you navigate to the URL, you get a default welcome screen, but we won’t be building a web app in this article.
The git repository is interesting, because it seemed like we already had one. But this is a git remote hosted by Heroku itself, and it’s a big part of their deployment strategy. If I run a
git remote -v, I can see it:
heroku https://git.heroku.com/mysterious-badlands-45487.git (fetch) heroku https://git.heroku.com/mysterious-badlands-45487.git (push) origin https://github.com/baogorek/HerokuExample.git (fetch) origin https://github.com/baogorek/HerokuExample.git (push)
Even though I haven’t added anything new through git, I can deploy the app that I have through pushing to the heroku remote:
git push heroku master
Just that simple command sets off a lot of activity. Here is a selection of the output:
remote: Compressing source files... done. remote: Building source: remote: remote: -----> Python app detected remote: -----> Installing python-3.6.8 ... remote: Installing collected packages: numpy, scipy, six, python-dateutil, pytz, pandas, patsy, cython, statsmodels, psycopg2-binary ... remote: -----> Launching... remote: Released v3 remote: https://mysterious-badlands-45487.herokuapp.com/ deployed to Heroku
Pushing to the “heroku” remote triggered the build of a Python application with data science dependencies such as numpy, scipy, pandas, and statsmodels. We see at the end that the app was “deployed.”
Testing it out
Since Heroku is based on containers, one quick way to test that our app has the data science dependencies that we think it does is to spin up a local container. We can do that with:
heroku run bash python
In Python 3.6.8 within our local Heroku container, we can import a few packages just to make sure.
import statsmodels import pandas
If you didn’t get an error, then your cloud-deployed Heroku app has these data science dependencies installed. Good!
Getting the dependencies
Exiting out of the local Heroku container, look inside the
You’ll see a very modest text file with the following lines:
numpy scipy==1.2 pandas patsy cython statsmodels psycopg2-binary
I specified scipy to be exactly version 1.2 based on advice from a post on a problem I was having, but otherwise these are the minimum dependencies specified by statsmodels.
Why not conda?
There are some Heroku “buildpacks” for conda online, but many of them are years old and not well maintained. Using the
requirements.txt file was a breeze, and I didn’t see a reason to struggle with getting conda to work. But it clearly is possible.
If loading statsmodels and pandas in a local Heroku container didn’t send your pulse above 100, it’s not you. But we’re actually not too far away from making our Heroku app do things. One way to actually have your app act is to utilize the text file called
Procfile (no “.txt” extension). If you look inside the
Procfile for this app, it is completely blank.
Instead, I used the Heroku scheduler add on to run a file, like HerokuExample/herokuexample/run_glm.py. You can see how easy it is to set up by looking at the following screenshot:
Since you could potentially spin up some serious computing resources using the Heroku scheduler add on, you do need your credit card to enable it.
Running a script
Just so we see some output in this article, add the following line to the
release: python herokuexample/say_hello.py
Add the file to git staging, commit it, and then push to the heroku remote:
git add Procfile git commit -m "Updating Procfile" git push heroku master
Among the output lines you will find:
remote: Verifying deploy... done. remote: Running release command... remote: remote: I loaded statsmodels
Indeed it did.
To really do something interesting without a full blown web app, we really need a database. Fortunately, Heroku has powerful database add ons that can complete the picture of a useful data science application that runs in the cloud. Leave a comment if you want to hear about Heroku databases in conjunction with data science apps, and I’ll add it to the queue.