While attempting to explain the Datascroller release process to collaborators John and Kevin over Hangouts, it quickly became apparent that these steps needed to be put in writing. The process below is very manual, tedious, and constructed from a trial and error approach. So take with a grain of salt and leave a comment if you know a better way!
Verifying the code to be released
PyPi is not especially forgiving if you release code you didn’t intend to, so it’s worth a bit of time to double check. After checking out the master branch and pulling changes from the remote, I check the following:
Is version in setup.py the right one for the release?
If using download_url (which I just learned is not advised), does the suffix align with the version?
After confirming that the right code is in my local repo, the next checks concern the functionality. Datascroller only has one test script right now that is not hooked up to CI/CD, so here are our steps:
Create a virtual environment with venv (e.g., python -m venv myenvs/test) and enter it
While still in my local repo, perform a developer install using pip: pip install -e .
Run any test scripts you have
For Datascroller I like to repeat the steps above on both Windows and Linux.
Uploading to TestPyPi
With PyPi, if you successfully release version 1.2.0 and then realize something is wrong with the code, then that’s too bad because you’ll never be able to release it again. You can release version 1.2.1, but that requires a code change to setup.py and might mess up your plans. That’s where TestPyPi comes in (make an account if you don’t have one). Back in the root of your local repo, while still in the virtual environment created above, perform the following steps:
pip install twine (a package especially for publishing to PyPi)
python setup.py sdist (creates the “dist” source distribution folder)
You’ll then get a link to go look at your package on TestPyPi. “See, looks good!” I told collaborators John and Kevin. “So, what exactly looks good?” asked John. He had a point, so I created new virtual environments and ran
to ensure a key demo worked when installed through TestPyPi. Note in the command above that you have to link to the real PyPi to get third-party packages such as Pandas. This is enabled via the flag --extra-index-url.
If something when wrong on version 1.2.0, then even on TestPyPi, you cannot delete and resubmit version 1.2.0. You would need to change the version in setup.py and then change it back before submitting to PyPi. In these cases I would append a suffix to the current version (e.g., 1.2.0rc1).
Uploading to PyPi
If everything looks good and your account is in good standing on PyPi.org, then you’re ready for the very last step:
twine upload --skip-existing dist/*
Getting right with Github
After a successful release to PyPi, the final step is to go to the Releases section of the Datascroller project and draft a new release. This will snapshot the project and provide downloads of the entire codebase in zip and tar.gz form. Since we were using the download_url argument in setup.py, we’d try to match the future name of the tarball that only exists after this step. Provided we really can (or should) leave that argument out next time, that’s one less thing to worry about.
When showing this workflow to John and Kevin, the feedback was, “there has to be a way to automate this.” Creating multiple virtual environments and installations in itself a hassle, so this is something we’ll be investing in soon. Please comment below regarding any weak points in the process above and suggestions to make releasing to PyPi easier and more fun!
I’ll be trying something different with this blog post – logging my efforts while going after a goal. It might not be riveting reading but I’m looking for ways to increase my output in 2020. And Happy New Year!
I’ve been enjoying the very cheap and surprisingly functional web hosting from Siteground, but one way they keep the cost down is by using software over a decade old and denying install privileges on the machines (that I can ssh in 24/7 for less than the price a beer a month still amazes me). You won’t be able to FTP a copy of Miniconda (trust me, I tried it), so you’re stuck with the LAMP framework. But at the same time, you have powerful data science functions written in Python. What do you do?
The rest of this stream-of-consciousness style blog article explores the use of Heroku to solve this problem. Though I do not explicitly show a website requesting output from Python data science modules, I believe all the pieces are in place by the end.
Glancing at my first Heroku article, I realize need to get the Heroku CLI. Given the limitations of Gunicorn, I’ll install this on Windows Subsystem for Linux (WSL). But the default way of installing the Heroku CLI on Ubuntu, Snap, does not work with WSL at the time of writing. This command will install it on WSL:
Next, I create my own version of todo.py from Step 4 of the article, and note that the sole import is Resource from the flask_restful module. I learn that the name of the API game here is to inherit from this abstract class Resource, so that your subclasses will be able to do real things with HTTP. Here’s the docstring from the Resource class:
Represents an abstract RESTful resource. Concrete resources should extend from this class and expose methods for each supported HTTP method. If a resource is invoked with an unsupported HTTP method, the API will return a response with status 405 Method Not Allowed. Otherwise the appropriate method is called and passed all arguments from the url rule used when adding the resource to an Api instance.
When Parthiban creates the subclass Todo from Resource, it’s not immediately clear what specific functionality is coming from that parent class. After fixing some spacing issues, I was able instantiate a Todo class locally and run the get() method, but the put() method returned an error about ‘request’ not being defined. Let’s keep moving.
Realizing that I had forgotten to store my todo.py in a folder called “resources,” I created that so that app.py could import it from one level beneath (the name of the base folder holding app.py and resources doesn’t seem to matter). I ran the contents of app.py in iPython and was able to see the REST API work in my browser:
Well I’m already happy. But this won’t do much for me if I can’t host it somewhere. On to Step 5.
Getting the Rest API onto Heroku
The following text goes into a file called “Procfile” in the root directory of this app (where I can see the resources) folder.
web: gunicorn app:app
The Procfile documentation shows that “web” is the process type and “gunicorn app:app” is the command to be run. It’s probably not a coincidence that our root python program is app.py and the Api class being instantiated is named “api” as an object, but I don’t know for sure.
I only add “gunicorn” and “flask-restful” to my requirements.txt file (in the same level as app.py), and it turns out this will be enough. I also omit the runtime.txt file without consequence.
Starting with the command heroku login, I followed similar steps to Parthiban, but I diverged somewhat. Here are the steps generally laid out:
Log into Heroku via the CLI (heroku login).
Create the app with the Heroku CLI (heroku create) and save both the web address ending in “.git” and the URL.
If the folder you’re working in isn’t already a git repository, make it so with git init.
If you don’t already have a remote that’s set to address copied in the second step, create one now. (This may happen automatically if you run git init before heroku create. I should find out!)
Add all relevant files, commit, and push to your Heroku remote.
Pushing code to the Heroku remote is what kicks off the magic, and I just watched some happen in my Terminal. (It really does feel like magic to me.)
It’s time to test the API. If you forgot to save the URL you can get it with heroku apps:info in the CLI. Add “/todo/1” at the end and see what happens:
Parthiban’s tutorial deserves more than 12 “claps”; people are just missing out. It lays out an easy to follow set of steps for getting started with Python REST APIs hosted internally. Thank you Parthiban!
Combined with the techniques I used in Data Science Apps in the Cloud with Heroku, there’s no reason to think this approach wouldn’t work with packages such as statsmodels. There’s a database featured there as well. I believe I have the elements for incorporating powerful data science functionality in a cheaply hosted website, but we’ll soon see.
Despite the many successes of modern neural network toolkits like TensorFlow, one of the advantages of classical methods like linear mixed models is that they can have different levels of regularization for different subsets of variables. For example, a customer-level factor with thousands of levels would likely benefit from more regularization than a US state-level factor, and linear mixed models estimate those levels of regularization from the data. In a neural networks context, learning multiple penalties using validation sets would be “very expensive to do,” according to Geoff Hinton, co-inventor of back propagation and professor of Neural Networks for Machine Learning, Coursera course from a few years ago.
Professor Hinton’s statement comes from Lecture 9f where he introduces MacKay’s “quick and dirty” method for using empirical Bayes to bypass the validation set in neural network training. The slide from the course describing the method is shown below:
In this article, were going to implement MacKay’s method in TensorFlow 2.0, but considering theory for a moment, the law of total variance gives us a reason for concern. It’s the impetus for this Cross Validated question on why the variance of the predicted random effects fromR’s lme4 isn’t the same as the estimated random effects variance matrix. Though it feels like you’re seeing the actual random effects in lmer’s output, you’re actually seeing the predicted value of the random effect given the response, i.e., for subject-specific random effect and data vector .
From the Law of Total Variance,
which means that if we follow MacKay’s recipe for estimating , we’re going to come up short in estimating the total variance of the weights. Since our goal is effective regularization rather than weight estimation, the question is whether this is good enough.
Using lme4 on the sleepstudy data
Consider the sleepstudy example featured in R’s lme4 package:
lme1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), sleepstudy)
Groups Name Variance Std.Dev. Corr
Subject (Intercept) 612.09 24.740
Days 35.07 5.922 0.07
Residual 654.94 25.592
Number of obs: 180, groups: Subject, 18
Estimate Std. Error t value
(Intercept) 251.405 6.825 36.838
Days 10.467 1.546 6.771
308 2.258565 9.1989719
309 -40.398577 -8.6197032
310 -38.960246 -5.4488799
330 23.690498 -4.8143313
331 22.260203 -3.0698946
332 9.039526 -0.2721707
MacKay’s method on sleepstudy
The SleepReg class
The following examples will use the SleepReg class, an ad hoc subclass of tensorflow.Module specifically for implementing maximum likelihood (also GLS) estimation / prediction of fixed and random effects given variances for random effects and model errors. For an explanation of the TensorFlow 2.0 strategy and why inheriting from tf.Module is so important, refer to Multiple Regression in TensorFlow 2.0 using Matrix Notation.
The SleepReg class incorporates a (profiled) maximum likelihood loss of the form:
with tf.GradientTape() as gradient_tape:
y_pred = self._get_expectation(X, Z, self.beta, self.b)
loss = (self._get_sse(y, y_pred) / self.sigmasq_epsilon
+ self._get_neg_log_prior(self.b, V))
This involves the sum of squared errors divided by the error variance plus the likelihood contribution of the latent random effects in _get_neg_log_prior (referred to as a “prior” to reflect the empirical Bayes interpretation). The latter quantity is a weighted sum of squares of the random effects, where the weight matrix V is a block diagonal of the inverse random effects variance matrices.
def _get_neg_log_prior(self, b, V):
"""Get the weight pentalty from the full Gaussian distribution"""
bTV = tf.matmul(tf.transpose(b), V)
bTVb = tf.matmul(bTV, b)
Reproducing lmer’s estimates in TensorFlow
The following shows TensorFlow 2.0 code capable of reproducing both the random effect predictions and fixed effect estimates of lmer, but without the routines to estimate the unknown variances such as REML. You’ll see that the optimization routine matches lmer’s output (to a high degree of accurracy) for both fixed effects estimates and random effects predictions.
The loss function component _get_neg_log_prior in SleepReg uses a block diagonal matrix, V, which is non-diagonal if there are correlations between the random effects. MacKay’s proposed method uses the raw sum of squares of the weights, making for a very clean equation:
While we go through MacKay’s “while not yet bored” loop, we’ll zero out the non-diagonals of V that result from non-zero covariances in the empirical variance matrix of the random effect predictions. What happens if you don’t? I thought it would lead to a slightly less “quick and dirty” version of the algorithm, but the procedure actually bombs after a few iterations. You can see this yourself by commenting out the line with the diag function calls.
sleep_reg.reset_variances(np.array([[410, 10], [10, 22]]),
.25 * np.var(sleep_reg.y))
for i in range(100):
sigmasq_epsilon = sleep_reg.estimate_sigmasq_epsilon()
V = sleep_reg.get_rnd_effs_variance()
V_diag = np.diag(np.diag(V)) # comment out and watch procedure fail
--- last V_diag
[[302.9045408 0. ]
[ 0. 31.08902388]]
--- last sigmasq_epsilon
--- final estimate of fixed effect beta
<tf.Variable 'Variable:0' shape=(2, 1) dtype=float64, numpy=
--- final random effects predictions
0 2.013963 9.147986
1 -32.683526 -9.633964
2 -20.255296 0.459532
3 -10.372529 6.169903
4 3.618080 -2.851007
As foretold by the law of total variances, the random effects variance estimates from MacKay’s method are low, the variance of the random intercepts coming in at right under half of lmer’s estimated variance of 612 at 303. Whether or not it’s a coincidence, the empirically estimated variance of the random slopes was 31, much closer to the lmer estimated value of 35. The poorer random effect predictions led to a slightly larger error variance of 671 vs lmer’s 655, but still relatively close.
Even with the inadequacies in variance estimation, the fixed effects estimates produced by the MacKay method are much closer to lmer’s than to an OLS regression treating subjects as fixed factor levels. The random effect predictions themselves are shrunken down too much but are also quite close for some subjects. The procedure, true to its name, is quick and dirty, but it clearly has some value. I’m curious whether there’s a data-driven to scale up the the empirical weight variances; that also gets into the inherent uncertainty in the weight estimation.
That the procedure breaks down from even a slight deviation from an independent random coefficients model is a mystery to me.
I have a vision of a toolkit with the power of TensorFlow but with the utility of empirical Bayes for estimating hyperparameters. Parts of that vision were explored in this article. Whether or not MacKay’s method will find it into my standard modeling toolkit is yet to be seen, but my curiosity regarding the method is only enhanced by the experiments done here.
I’m excited to announce my very first package on PyPi, datascroller, a Python package for interactive terminal data scrolling. It’s available for Windows as well as *nix systems (thanks to windows-curses), and contributors to the codebase are welcome!
How it works
See the gif below for a glimpse of datascroller in action:
The syntax has changed slightly since the gif was created, but during that demo, I was pressing keys to resize the terminal viewing window and to scroll from left-to-right and up-to-down within a Pandas data frame. Currently the scrolling keys are inspired by vim but later versions will offer customization options.
You can install datascroller with pip using:
pip install datascroller
Try datascroller out in iPython with the following code:
import pandas as pd
from datascroller import scroll
train = pd.read_csv(
Why a terminal data scroller?
Scrolling a through a data set is a fundamental part of exploratory data analysis, and open-source tools let us down in this regard. SAS has had it right for a while. From my memory of around 2001, you could scroll through tens of millions of rows through what must have been a very clever paging strategy. Say what you want about SAS, but honestly no other data viewer has come close.
Moving to R in 2009, I had to accept the loss of SAS’s data set viewer and learn to accept the built-in viewer or just print slices of the data frame in the console. Soon after, I started using RStudio. They offered a nice improvement on the default viewer, but it still couldn’t hold a candle to SAS’s and didn’t handle very large data sets well at the time (to the best of my recollection).
In 2019, RStudio may very well have their data viewer tuned to perfection. But some people prefer working in the terminal, and sometimes you have to (say, a client gives you an ssh login for a particular remote machine). It is possible to hook up notebooks, or use an X-server, but often it’s easier to just print slices of your data sets in the terminal for exploratory analysis. Ehile R’s tibble and Panda’s DataFrame are smart enough to not overwhelm your console with output, they make you work to see the parts of the data that you really need to see.
The datascroller vision
The featured image is a play on the movie “Minority Report” and its very memorable scene with Tom Cruise’s character using the futuristic API to sort through information. I always wanted to move around the data set like that, and I felt that the terminal would be a good place to do it. In 2014, at Google, I took my first crack at this with an internal R package I called “terminalR.” I got helpful feedback from data scientists there, especially Tim Hesterberg. Tim convinced me of the need to implement user configuration options (still a TODO for datascroller!) and also to transition to Emacs/ESS since they came with Emacs Lisp. But, we stopped short of achieving the vision full interactivity.
The terminalR package’s original mechanism was “drumming” on the enter key while you pressed other navigation buttons, as it relied on R’s standard console input methods). With Python offering wrappers for the curses library for both *nix systems and Windows, the interactive “vision” has become a reality.
What’s next for datascroller?
The Python package datascroller, currently for use with Pandas dataframes, will become the tool “datascroller” for general purpose terminal data scrolling. Imagine interactive terminal scrolling of any csv, text, or even JSON file that can be initiated from outside of Python. My past colleague John Merfeld, who makes extensive use of low vision accessibility tools, is on the project and will help consult as to whether certain color schemes (curses offers those) help make the terminal output easier to see, thus giving datascroller an accessibility angle.
Even with TerminalR, I could get around an R data frame pretty fast, fast than any GUI viewer. It has column and row searching functionality from the keyboard, and a lot of movement options. All these options and more are coming to datascroller soon, in full interactive fashion.