Releasing to PyPi

Category : Python

While attempting to explain the Datascroller release process to collaborators John and Kevin over Hangouts, it quickly became apparent that these steps needed to be put in writing. The process below is very manual, tedious, and constructed from a trial and error approach. So take with a grain of salt and leave a comment if you know a better way!

Verifying the code to be released

PyPi is not especially forgiving if you release code you didn’t intend to, so it’s worth a bit of time to double check. After checking out the master branch and pulling changes from the remote, I check the following:

  • Is version in setup.py the right one for the release?
  • If using download_url (which I just learned is not advised), does the suffix align with the version?

After confirming that the right code is in my local repo, the next checks concern the functionality. Datascroller only has one test script right now that is not hooked up to CI/CD, so here are our steps:

  • Create a virtual environment with venv (e.g., python -m venv myenvs/test) and enter it
  • While still in my local repo, perform a developer install using pip: pip install -e .
  • Run any test scripts you have

For Datascroller I like to repeat the steps above on both Windows and Linux.

Uploading to TestPyPi

With PyPi, if you successfully release version 1.2.0 and then realize something is wrong with the code, then that’s too bad because you’ll never be able to release it again. You can release version 1.2.1, but that requires a code change to setup.py and might mess up your plans. That’s where TestPyPi comes in (make an account if you don’t have one). Back in the root of your local repo, while still in the virtual environment created above, perform the following steps:

  • pip install twine (a package especially for publishing to PyPi)
  • python setup.py sdist (creates the “dist” source distribution folder)
  • twine upload --skip-existing --repository-url https://test.pypi.org/legacy/ dist/*

You’ll then get a link to go look at your package on TestPyPi. “See, looks good!” I told collaborators John and Kevin. “So, what exactly looks good?” asked John. He had a point, so I created new virtual environments and ran

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ datascroller

to ensure a key demo worked when installed through TestPyPi. Note in the command above that you have to link to the real PyPi to get third-party packages such as Pandas. This is enabled via the flag --extra-index-url.

If something when wrong on version 1.2.0, then even on TestPyPi, you cannot delete and resubmit version 1.2.0. You would need to change the version in setup.py and then change it back before submitting to PyPi. In these cases I would append a suffix to the current version (e.g., 1.2.0rc1).

Uploading to PyPi

If everything looks good and your account is in good standing on PyPi.org, then you’re ready for the very last step:

  • twine upload --skip-existing dist/*

Getting right with Github

After a successful release to PyPi, the final step is to go to the Releases section of the Datascroller project and draft a new release. This will snapshot the project and provide downloads of the entire codebase in zip and tar.gz form. Since we were using the download_url argument in setup.py, we’d try to match the future name of the tarball that only exists after this step. Provided we really can (or should) leave that argument out next time, that’s one less thing to worry about.

Concluding thoughts

When showing this workflow to John and Kevin, the feedback was, “there has to be a way to automate this.” Creating multiple virtual environments and installations in itself a hassle, so this is something we’ll be investing in soon. Please comment below regarding any weak points in the process above and suggestions to make releasing to PyPi easier and more fun!


On Getting Python Functionality in a Simple Website

Tags :

Category : Infrastructure , Python

I’ll be trying something different with this blog post – logging my efforts while going after a goal. It might not be riveting reading but I’m looking for ways to increase my output in 2020. And Happy New Year!

I’ve been enjoying the very cheap and surprisingly functional web hosting from Siteground, but one way they keep the cost down is by using software over a decade old and denying install privileges on the machines (that I can ssh in 24/7 for less than the price a beer a month still amazes me). You won’t be able to FTP a copy of Miniconda (trust me, I tried it), so you’re stuck with the LAMP framework. But at the same time, you have powerful data science functions written in Python. What do you do?

The rest of this stream-of-consciousness style blog article explores the use of Heroku to solve this problem. Though I do not explicitly show a website requesting output from Python data science modules, I believe all the pieces are in place by the end.

Parthiban’s REST API via Heroku Tutorial

Previously I wrote about deploying data science apps on Heroku, and while I was very impressed with the service and how it let me work with common data science Python dependencies like statsmodels, I never exposed the app to the outside world. Today, I found this great looking article from Parthiban Sudhaman that promises a walk-through of developing a REST API in Python and deploying it on Heroku. Let’s try it out.

Environment

Parthiban recommends creating a virtual environment and installing the following dependencies:

Glancing at my first Heroku article, I realize need to get the Heroku CLI. Given the limitations of Gunicorn, I’ll install this on Windows Subsystem for Linux (WSL). But the default way of installing the Heroku CLI on Ubuntu, Snap, does not work with WSL at the time of writing. This command will install it on WSL:

curl https://cli-assets.heroku.com/install.sh | sh.

Trying the simple REST API

Next, I create my own version of todo.py from Step 4 of the article, and note that the sole import is Resource from the flask_restful module. I learn that the name of the API game here is to inherit from this abstract class Resource, so that your subclasses will be able to do real things with HTTP. Here’s the docstring from the Resource class:

Represents an abstract RESTful resource. Concrete resources should extend from this class and expose methods for each supported HTTP method. If a resource is invoked with an unsupported HTTP method, the API will return a response with status 405 Method Not Allowed. Otherwise the appropriate method is called and passed all arguments from the url rule used when adding the resource to an Api instance.

When Parthiban creates the subclass Todo from Resource, it’s not immediately clear what specific functionality is coming from that parent class. After fixing some spacing issues, I was able instantiate a Todo class locally and run the get() method, but the put() method returned an error about ‘request’ not being defined. Let’s keep moving.

Realizing that I had forgotten to store my todo.py in a folder called “resources,” I created that so that app.py could import it from one level beneath (the name of the base folder holding app.py and resources doesn’t seem to matter). I ran the contents of app.py in iPython and was able to see the REST API work in my browser:

Well I’m already happy. But this won’t do much for me if I can’t host it somewhere. On to Step 5.

Getting the Rest API onto Heroku

The following text goes into a file called “Procfile” in the root directory of this app (where I can see the resources) folder.

web: gunicorn app:app

The Procfile documentation shows that “web” is the process type and “gunicorn app:app” is the command to be run. It’s probably not a coincidence that our root python program is app.py and the Api class being instantiated is named “api” as an object, but I don’t know for sure.

I only add “gunicorn” and “flask-restful” to my requirements.txt file (in the same level as app.py), and it turns out this will be enough. I also omit the runtime.txt file without consequence.

Starting with the command heroku login, I followed similar steps to Parthiban, but I diverged somewhat. Here are the steps generally laid out:

  • Log into Heroku via the CLI (heroku login).
  • Create the app with the Heroku CLI (heroku create) and save both the web address ending in “.git” and the URL.
  • If the folder you’re working in isn’t already a git repository, make it so with git init.
  • If you don’t already have a remote that’s set to address copied in the second step, create one now. (This may happen automatically if you run git init before heroku create. I should find out!)
  • Add all relevant files, commit, and push to your Heroku remote.

Pushing code to the Heroku remote is what kicks off the magic, and I just watched some happen in my Terminal. (It really does feel like magic to me.)

It’s time to test the API. If you forgot to save the URL you can get it with heroku apps:info in the CLI. Add “/todo/1” at the end and see what happens:

Very cool!

Conclusions

Parthiban’s tutorial deserves more than 12 “claps”; people are just missing out. It lays out an easy to follow set of steps for getting started with Python REST APIs hosted internally. Thank you Parthiban!

Combined with the techniques I used in Data Science Apps in the Cloud with Heroku, there’s no reason to think this approach wouldn’t work with packages such as statsmodels. There’s a database featured there as well. I believe I have the elements for incorporating powerful data science functionality in a cheaply hosted website, but we’ll soon see.

To many good tutorials in 2020!


Winding and unwinding multiply differenced time series: A mechanical view

Category : Time Series

Machine Learning practitioners often advise taking first differences of the target time series prior to forecasting. “Stationary data is easier to model and will very likely result in more skillful forecasts,” writes Jason Brownlee in Machine Learning Mastery, who afterwards demonstrates the use of differences to remove trend in a time series. Differences are not limited to y(t) – y(t-1); they can be taken at various lags and stacked iteratively to create a time series that is stable and easy to work with. But after the forecasting work is done, there must be a reconstruction. This post examines the conceptual issues of this reconstruction, using a simple example and only first principles.

Statisticians and econometricians believe that over-differencing is real, so we should enter this investigation vigilant for downsides to differencing. There is no theory here, rather a purely mechanical view of the differencing and reconstitution process when there is a seasonal difference followed by a first difference. But the simplicity of the example is helpful for clarifying the exact steps taken as well as the contributions of the real data versus the forecasted series. This purely mechanical analysis will also make suggestions about forecast quality in the face of repeated differencing.

A simple example

We start with an artificially simple time series y(t):

ty(t)
01
13
26
39
410

Consider the protocol of taking the third difference followed by the first difference. The table of third differences is smaller due to the end effect:

tdiff3 y(t)
38
47

Now taking first differences (here there’s only one to take), we arrive at the final data set for model building:

tdiff1 diff 3 y(t)
4-1

Forget for a moment that you’d have a hard time building a forecasting model with one data point and imagine that [-1] is representative of a larger data set used to build a model. This data set, after all differencing operations, is what we use for building models. If we spend extensive time in the forecasting process, we may get very comfortable with this data set and forget that we are two levels deep in differencing.

The Forecast

Given our single data point’s magical ability to generate a forecasting model, we have forecasted values from time periods past 4, i.e., the unseen future. The known past will be displayed as boldface text.

tdiff1 diff3 y(t)
4-1
53
67
7-2
8-1
91

Proceeding in a last-in-first-out manner, we must undo the first differences before undoing the third differences. Again, we must reach back into our original data set, a level above in its differencing history.

tdiff 3 y(t)
38
47
510
617
715
814
915

To initiate the sequence of unraveling events, we really only needed the value from t = 4, but that data point needed the value at t = 3 to even exist given it is a first difference.

Now we unravel the third differences, which requires us to dig back into the original series for at least the last three observed periods (2 <= t <= 4). Italics will be used to show which third differences are anchored on forecasts rather than real values.

ty(t)
01
13
26
39
410
516
626
725
830
941

Takeaways

Consider only the reconstruction of of the final forecasts from the third differences. For future time periods 5 <= t <= 7, the anchor of the reconstructed forecast (on the original scale) was a real data point from the original series, with no uncertainty. Contrast that with t = 8, a forecast which must be anchored on a previous forecast. Every three time points (i.e., the number of time points used for this level of differencing), there will be an additional level of uncertainly as forecasts are added to another level of forecasts.

After the third difference, there was the first difference, meaning that all forecast values available at the last stage of reconstruction where themselves reconstructions. Being a first difference, all but one value (at t = 5) were anchored upon other forecasts, and the level of stacking forecast upon forecast increases with every additional time increment. The forecast at t = 5 is the only one anchored on real data points for both the first and the third difference.

Differencing may be a way to tame unweildy time series, but there is a cost. The alternatives may not be superior; achieving stationarity using a structural trend and seasonality removal trades compounding forecast uncertainty for the uncertainty in extrapolation of these structural element into the future.

One point is clear: if you have to difference, a larger differencing lag provides more time before the compounding effects of forecasts-upon-forecasts start to pile up. If that larger differencing lag lines up with real seasonal effects, then you’ve got both trend and seasonality removal alongside a (potentially substantial) grace period before the uncertainty begins to compound.


Linear Mixed Models in Tensorflow 2.0 via MacKay’s method

Despite the many successes of modern neural network toolkits like TensorFlow, one of the advantages of classical methods like linear mixed models is that they can have different levels of regularization for different subsets of variables. For example, a customer-level factor with thousands of levels would likely benefit from more regularization than a US state-level factor, and linear mixed models estimate those levels of regularization from the data. In a neural networks context, learning multiple penalties using validation sets would be “very expensive to do,” according to Geoff Hinton, co-inventor of back propagation and professor of Neural Networks for Machine Learning, Coursera course from a few years ago. 

Professor Hinton’s statement comes from Lecture 9f  where he introduces MacKay’s “quick and dirty” method for using empirical Bayes to bypass the validation set in neural network training. The slide from the course describing the method is shown below:


In this article, were going to implement MacKay’s method in TensorFlow 2.0, but  considering theory for a moment, the law of total variance gives us a reason for concern. It’s the impetus for this Cross Validated question on why the variance of the predicted random effects fromR’s  lme4 isn’t the same as the estimated random effects variance matrix. Though it feels like you’re seeing the actual random effects in lmer’s output, you’re actually seeing the predicted value of the random effect given the response, i.e., \text{E}(b_i \vert \mathbf{y}_i) for subject-specific random effect b_i and data vector \mathbf{y}_i.


From the Law of Total Variance,

    \[\text{Var}(b_i) = \text{E}(\text{Var}(b_i \vert \mathbf{y}_i)) + \text{Var}(\text{E}(b_i \vert \mathbf{y}_i)),\]


which means that if we follow MacKay’s recipe for estimating \text{Var}(b_i), we’re going to come up short in estimating the total variance of the weights. Since our goal is effective regularization rather than weight estimation, the question is whether this is good enough.

Using lme4 on the sleepstudy data

Consider the sleepstudy example featured in R’s lme4 package:

library(lme4)
lme1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), sleepstudy)
summary(lme1)
head(ranef(fm1)[[1]])
Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.09   24.740
          Days         35.07    5.922   0.07
 Residual             654.94   25.592
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

> head(ranef(fm1)[[1]])
    (Intercept)       Days
308    2.258565  9.1989719
309  -40.398577 -8.6197032
310  -38.960246 -5.4488799
330   23.690498 -4.8143313
331   22.260203 -3.0698946
332    9.039526 -0.2721707

MacKay’s method on sleepstudy

The SleepReg class

The following examples will use the SleepReg class, an ad hoc subclass of tensorflow.Module specifically for implementing maximum likelihood (also GLS) estimation / prediction of fixed and random effects given variances for random effects and model errors. For an explanation of the TensorFlow 2.0 strategy and why inheriting from tf.Module is so important, refer to Multiple Regression in TensorFlow 2.0 using Matrix Notation.

The SleepReg class incorporates a (profiled) maximum likelihood loss of the form:

with tf.GradientTape() as gradient_tape:
    y_pred = self._get_expectation(X, Z, self.beta, self.b) 
    loss = (self._get_sse(y, y_pred) / self.sigmasq_epsilon
            + self._get_neg_log_prior(self.b, V))

This involves the sum of squared errors divided by the error variance plus the likelihood contribution of the latent random effects in _get_neg_log_prior (referred to as a “prior” to reflect the empirical Bayes interpretation). The latter quantity is a weighted sum of squares of the random effects, where the weight matrix V is a block diagonal of the inverse random effects variance matrices.

@tf.function
def _get_neg_log_prior(self, b, V):
    """Get the weight pentalty from the full Gaussian distribution"""
    bTV = tf.matmul(tf.transpose(b), V)                                                              
    bTVb = tf.matmul(bTV, b)
    return tf.squeeze(bTVb)

Reproducing lmer’s estimates in TensorFlow

The following shows TensorFlow 2.0 code capable of reproducing both the random effect predictions and fixed effect estimates of lmer, but without the routines to estimate the unknown variances such as REML. You’ll see that the optimization routine matches lmer’s output (to a high degree of accurracy) for both fixed effects estimates and random effects predictions.

from sleepstudy import SleepReg
import numpy as np

sleep_reg = SleepReg("/mnt/c/devl/data/sleepstudy.csv")

# Replicate lme4's result
off_diag = 24.7404 * 5.9221 * 0.066
lmer_vcov = np.array([[24.7404 ** 2, off_diag],
                      [off_diag, 5.9221 ** 2]])

sleep_reg.reset_variances(lmer_vcov, 25.5918 ** 2)

sleep_reg.train()
sleep_reg.set_optimizer(adam=True)
sleep_reg.train(epochs=300)

print(sleep_reg.beta)
print(sleep_reg.get_random_effects().head())
<tf.Variable 'Variable:0' shape=(2, 1) dtype=float64, numpy=
array([[251.40510486],
       [ 10.46728596]])>
          mu         b
0   2.262934  9.198305
1 -40.399556 -8.619793
2 -25.207478  1.172853
3 -13.065620  6.613451
4   4.575970 -3.014939

Implementing MacKay’s method

The loss function component _get_neg_log_prior in SleepReg uses a block diagonal matrix, V, which is non-diagonal if there are correlations between the random effects. MacKay’s proposed method uses the raw sum of squares of the weights, making for a very clean equation:

Lecture 9 slide describing Bayesian weight decay from Geoff Hinton’s course

While we go through MacKay’s “while not yet bored” loop, we’ll zero out the non-diagonals of V that result from non-zero covariances in the empirical variance matrix of the random effect predictions. What happens if you don’t? I thought it would lead to a slightly less “quick and dirty” version of the algorithm, but the procedure actually bombs after a few iterations. You can see this yourself by commenting out the line with the diag function calls.

sleep_reg.zero_coefficients()
sleep_reg.reset_variances(np.array([[410, 10], [10, 22]]),
                         .25 * np.var(sleep_reg.y))
sleep_reg.set_optimizer(adam=False)

for i in range(100):
    sleep_reg.train(display_beta=False)
    
    sigmasq_epsilon = sleep_reg.estimate_sigmasq_epsilon()
       
    V = sleep_reg.get_rnd_effs_variance()
    V_diag = np.diag(np.diag(V)) # comment out and watch procedure fail

    sleep_reg.reset_variances(V_diag, sigmasq_epsilon)

    print(V_diag)
    print(sigmasq_epsilon)

print(sleep_reg.beta)
print(sleep_reg.get_random_effects().head())
--- last V_diag
[[302.9045408    0.        ]
 [  0.          31.08902388]]
--- last sigmasq_epsilon
[670.8546961]
--- final estimate of fixed effect beta
<tf.Variable 'Variable:0' shape=(2, 1) dtype=float64, numpy=
array([[251.40510485],
       [ 10.46728596]])>
--- final random effects predictions
          mu         b
0   2.013963  9.147986
1 -32.683526 -9.633964
2 -20.255296  0.459532
3 -10.372529  6.169903
4   3.618080 -2.851007

Discussion

As foretold by the law of total variances, the random effects variance estimates from MacKay’s method are low, the variance of the random intercepts coming in at right under half of lmer’s estimated variance of 612 at 303. Whether or not it’s a coincidence, the empirically estimated variance of the random slopes was 31, much closer to the lmer estimated value of 35. The poorer random effect predictions led to a slightly larger error variance of 671 vs lmer’s 655, but still relatively close.

Even with the inadequacies in variance estimation, the fixed effects estimates produced by the MacKay method are much closer to lmer’s than to an OLS regression treating subjects as fixed factor levels. The random effect predictions themselves are shrunken down too much but are also quite close for some subjects. The procedure, true to its name, is quick and dirty, but it clearly has some value. I’m curious whether there’s a data-driven to scale up the the empirical weight variances; that also gets into the inherent uncertainty in the weight estimation.

That the procedure breaks down from even a slight deviation from an independent random coefficients model is a mystery to me. 

I have a vision of a toolkit with the power of TensorFlow but with the utility of empirical Bayes for estimating hyperparameters. Parts of that vision were explored in this article. Whether or not MacKay’s method will find it into my standard modeling toolkit is yet to be seen, but my curiosity regarding the method is only enhanced by the experiments done here.


Introducing “datascroller” for fast terminal data frame scrolling

Category : Python , Tools

I’m excited to announce my very first package on PyPi, datascroller, a Python package for interactive terminal data scrolling. It’s available for Windows as well as *nix systems (thanks to windows-curses), and contributors to the codebase are welcome!

How it works

See the gif below for a glimpse of datascroller in action:

datascroller allows terminal datascrolling

The syntax has changed slightly since the gif was created, but during that demo, I was pressing keys to resize the terminal viewing window and to scroll from left-to-right and up-to-down within a Pandas data frame. Currently the scrolling keys are inspired by vim but later versions will offer customization options.

You can install datascroller with pip using:

pip install datascroller

Try datascroller out in iPython with the following code:

import pandas as pd
from datascroller import scroll

train = pd.read_csv(
    'https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv')

scroll(train)

Why a terminal data scroller?

Scrolling a through a data set is a fundamental part of exploratory data analysis, and open-source tools let us down in this regard. SAS has had it right for a while. From my memory of around 2001, you could scroll through tens of millions of rows through what must have been a very clever paging strategy. Say what you want about SAS, but honestly no other data viewer has come close.

Moving to R in 2009, I had to accept the loss of SAS’s data set viewer and learn to accept the built-in viewer or just print slices of the data frame in the console. Soon after, I started using RStudio. They offered a nice improvement on the default viewer, but it still couldn’t hold a candle to SAS’s and didn’t handle very large data sets well at the time (to the best of my recollection).

In 2019, RStudio may very well have their data viewer tuned to perfection. But some people prefer working in the terminal, and sometimes you have to (say, a client gives you an ssh login for a particular remote machine). It is possible to hook up notebooks, or use an X-server, but often it’s easier to just print slices of your data sets in the terminal for exploratory analysis. Ehile R’s tibble and Panda’s DataFrame are smart enough to not overwhelm your console with output, they make you work to see the parts of the data that you really need to see.

The datascroller vision

The featured image is a play on the movie “Minority Report” and its very memorable scene with Tom Cruise’s character using the futuristic API to sort through information. I always wanted to move around the data set like that, and I felt that the terminal would be a good place to do it. In 2014, at Google, I took my first crack at this with an internal R package I called “terminalR.” I got helpful feedback from data scientists there, especially Tim Hesterberg. Tim convinced me of the need to implement user configuration options (still a TODO for datascroller!) and also to transition to Emacs/ESS since they came with Emacs Lisp. But, we stopped short of achieving the vision full interactivity.

The terminalR package’s original mechanism was “drumming” on the enter key while you pressed other navigation buttons, as it relied on R’s standard console input methods). With Python offering wrappers for the curses library for both *nix systems and Windows, the interactive “vision” has become a reality.

What’s next for datascroller?

The Python package datascroller, currently for use with Pandas dataframes, will become the tool “datascroller” for general purpose terminal data scrolling. Imagine interactive terminal scrolling of any csv, text, or even JSON file that can be initiated from outside of Python. My past colleague John Merfeld, who makes extensive use of low vision accessibility tools, is on the project and will help consult as to whether certain color schemes (curses offers those) help make the terminal output easier to see, thus giving datascroller an accessibility angle.

Even with TerminalR, I could get around an R data frame pretty fast, fast than any GUI viewer. It has column and row searching functionality from the keyboard, and a lot of movement options. All these options and more are coming to datascroller soon, in full interactive fashion.

I have big plans for this tool.


Multiple Regression in TensorFlow 2.0 using Matrix Notation

Category : Tools

While the two were always friendly, TensorFlow has fully embraced the Keras API in version 2.0, making it THE high-level API and further tightening its integration into the platform’s core. From tutorials, videos, press releases, the message is resounding: use the Keras API, unless you absolutely, positively, can’t.

All signs point to the Keras API as being a world class API, but it is a neural networks API. And while many statistical models can be framed as neural networks, there is another API that some prefer: the “matrix algebra API.” The good news is that TensorFlow 2.0’s new eager execution defaults mean that working with linear models in matrix form is easier than ever. Once you know a few caveats, it doesn’t feel so different than working in numpy. And this is great news!

In this post, we’re going to do multiple regression using Fisher’s Iris data set, regressing Sepal Length on Sepal Width and Petal Length (for no particular scientific reason) using TensorFlow 2.0. Yes, there is an official linear regression tutorial for TensorFlow 2.0, but it does not feature the matrix calculations (or explain the caveats) that this article will.

In matrix notation, we’ll be fitting the following model:


    \[\left[\begin{matrix} y_1 \\y_2 \\\ldots \\y_{150}\end{matrix}\right] = \left[\begin{matrix} 1 & x_1 & z_1 \\1 & x_2 & z_2 \\\ldots \\1 & x_{150} & z_{150}\end{matrix}\right] \left[\begin{matrix} \beta_0 \\\beta_1 \\\beta_2\end{matrix}\right] + \left[\begin{matrix} \epsilon_1 \\\epsilon_2 \\\ldots \\\epsilon_{150}\end{matrix}\right],\]


Where y is Sepal Length, x is Sepal Width, z is Petal Length, and \epsilon_1, \ldots \epsilon_{150} are i.i.d. N(0, \sigma^2).

Let’s make this regression happen in Python using the statsmodels module:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Part 1: OLS Regression on two predictors using statsmodels
iris_df = sm.datasets.get_rdataset('iris').data
iris_df.columns = [name.replace('.', '_') for name in iris_df.columns]
reg_model = smf.ols(formula='Sepal_Length ~ Sepal_Width + Petal_Length',
                    data=iris_df)
fitted_model = reg_model.fit()
fitted_model.summary()

This gives us the (partial) output:

===================================================================
                   coef    std err          t      P>|t|
-------------------------------------------------------------------
Intercept        2.2491      0.248      9.070      0.000
Sepal_Width      0.5955      0.069      8.590      0.000
Petal_Length     0.4719      0.017     27.569      0.000
===================================================================

Now let’s spin up TensorFlow and convert our matrices and vectors into “tensors”:

import tensorflow as tf
import patsy                                                                                                                                                            
X_matrix = patsy.dmatrix('1 + Sepal_Width + Petal_Length', data=iris_df)

X = tf.constant(X_matrix)
y = tf.constant(iris_df.Sepal_Length.values.reshape((150, 1)))

We’re using constant tensors for our data vectors, and everything looks pretty straightforward here, but there is a spike-filled trap that we just stepped over. Caveat #1 is: if you don’t reshape your vector y into an actual column vector, the following code will run but lead to incorrect estimates. The fit leads to basically an intercept-only model, so that must mean that somehow the ordering is compromised unless there are actually two dimensions.

The next thing we’ll do is to create our variable tensor that will hold our regression weights. You could directly create a TensorFlow variable, but don’t. Instead, subclass from tf.Module:

class IrisReg(tf.Module):
    def __init__(self, starting_vector = [[0.0], [0.0], [0.0]]):
        self.beta = tf.Variable(starting_vector, dtype=tf.float64)

irisreg = IrisReg()

I don’t love this, as it feels bureaucratic and I’d rather just work with a variable called “beta.” But you really need to do this unless you want to roll your own gradient descent. Caveat #2 is not to bypass subclassing from tf.Module or else you will struggle with your optimizer’s .apply_gradients method. By subclassing from tf.Module, you get a property trainable_variables, that you can treat like the parameter vector but it is also iterable.

The matrix math for the prediction is a little anticlimactic:

@tf.function
def predict(X, beta):
    return tf.matmul(X, beta)

and the @tf.function decorator is optional for a performance benefit. The OLS regression loss is so simple that it’s also worth defining it explicitly (and I did have some trouble with the built-in losses, for full transparency):

@tf.function
def get_loss(observed, predicted):
    return tf.reduce_mean(tf.square(observed - predicted))

While the loss function can be easily coded from scratch, there are too many benefits of using a built-in optimizer, like built-in momentum for gradient decent.

sgd_optimizer = tf.optimizers.SGD(learning_rate=.01, momentum=.98)

The rest of the training is presented in the following loop:

for epoch in range(1000):

    with tf.GradientTape() as gradient_tape:
        y_pred = predict(X, irisreg.trainable_variables)
        loss = get_loss(y, y_pred)

    gradient = gradient_tape.gradient(loss, irisreg.trainable_variables)
    sgd_optimizer.apply_gradients(zip(gradient,
                                      irisreg.trainable_variables))

    print(irisreg.trainable_variables)

With eager execution enabled by default in TensorFlow 2.0, running “gradient tape” through the “forward pass” (i.e. prediction) is necessary to get the gradients. Notice that the trainable_variables property is used in place of the parameter vector in all situations. You could get away with a plain variable for every step until the optimizer’s apply_gradients method, and mixing and matching was causing trouble as well.

(<tf.Variable 'Variable:0' shape=(3, 1) dtype=float64, numpy=
array([[2.24920008],
       [0.59551132],
       [0.47195184]])>,)

It’s not the most sophisticated training loop, but starting from an awful choice of starting vector, the procedure quickly converges to the OLS regression estimates. The loss function is easy to alter to create a Ridge Regression or LASSO procedure. And being in the TensorFlow ecosystem means that these techniques would scale to big datasets, be easily ported to JavaScript using TensorFlow.js, and made available to the TensorBoard debugging utilities.

It’s not just neural network enthusiasts who can gain from TensorFlow. Statisticians and other Data Scientists who prefer matrix manipulation can now really enjoy using TensorFlow thanks to the very cool eager enhancements in TensorFlow 2.0.


Using the WordPress Template Hierarchy to Improve a Resume Plugin

Every Sole Proprietor’s website needs a resume, and given the expansiveness of the WordPress plugin ecosystem, I expected to see a half dozen to choose from. In reality, there’s only one with any amount of attention, Resume Builder from Boxy Studio.

I’m still working on the resume content, but the auto-formatted layout is not bad. The “star ratings” for skills are maybe a little corny but fun.

The plugin adds a “Resumes” section to your admin console screen, and when you add a new resume you’re treated to a very structured layout.

Don’t expect to drag and drop sections like they’re blocks in a WordPress editor. The plug in works, but doesn’t have that level of polish.

A resume that looks like a blog entry

The problem I encountered was that the resume looked like this:

Ridiculous! One copy of my mug per page is plenty. You can see from top left corner of the image that there’s a bit of WordPress debugging going on (live on the website, of course). I just couldn’t figure out why it was displaying like a blog article.

Well, if it looks like a duck… OK, when I’m editing my resume, the URL ends with post.php?post=75&action=edit. So resumes are just posts with a hard coded ID of 75. This explains why the shortcodes have 75 in them (e.g. rb-resume id="75" section="intro"). That left me with a dilemma, because I want at least the date on the blog articles (not sure about the mug), but I don’t want it on the resume.

Working harder, I echoed the post type in single.php by adding the following after the header:

<?php echo get_post_type(); ?>

When I clicked on the link for an regular blog article (which you have to do to trigger single.php. Going to the Blog link isn’t enough!), I’d see “post” echoed on the screen. But when I clicked on the resume, I saw “rb_resume.” Resume Builder is using a post type of “rb_resume.”

A Single.php for different post types

To solve the problem, I copied my single.php file into a new file called single-rb_resume.php. Since single.php calls another template part, I had to copy that file as well. And then I started hacking, and more in the machete sense than the computer programming one. Finally I was able to pull away the parts that made an individual blog post look like a blog post, and I arrived at the following:

Of course, instead of actually finishing the resume, I left satisfied with solving the templating problem; it’s still a work in progress.

The template hierarchy is a pretty scary part of WordPress. I can’t even begin to say I understand it, but I know that it’s there, and this situation has proven that exploiting the hierarchy can be a very powerful technique.


Data Science Apps in the Cloud with Heroku

Category : Infrastructure

I recently had an opportunity to work with Heroku, a platform-as-a-service for deploying and running apps, for deploying Python-based data science applications in the cloud. At first, I didn’t understand why the engagement wasn’t just using AWS, since data science related instances abound on the EC2 marketplace. What I learned however is that AWS can be a money pit for businesses without a dedicated IT team. It is a complex beast that requires competent professionals to tame. Heroku, on the other hand, just seems to just work.

In this article, I’ll go through the basics of creating a Heroku application that at least loads popular data science dependencies in python. In later articles I may take the example to the end, where I load the Iris data set, run a regression on it using the statsmodels package, and write the results into a database on Heroku. All of this can be run using Heroku’s very simple free scheduler.

Preliminaries

To get started, create a free Heroku account at www.heroku.com, install the Heroku CLI, and run the following commands in a bash shell (Windows 10 users are encouraged to use the Ubuntu 18.04 app):

git clone https://github.com/baogorek/HerokuExample.git
cd HerokuExample
heroku login
heroku create

After cloning in the first line, the second line changes directories to the folder that contains the Heroku application, and the third line opens a browser window to log into your Heroku account. Finally, heroku create registers the app with the service. The output of that line is:

Creating app... done, ⬢ mysterious-badlands-45487                              https://mysterious-badlands-45487.herokuapp.com/ | https://git.heroku.com/mysterious-badlands-45487.git

which shows us that it is given a name, a URL, and its own git repository. If you navigate to the URL, you get a default welcome screen, but we won’t be building a web app in this article.

The git repository is interesting, because it seemed like we already had one. But this is a git remote hosted by Heroku itself, and it’s a big part of their deployment strategy. If I run a git remote -v, I can see it:

heroku  https://git.heroku.com/mysterious-badlands-45487.git (fetch)
heroku  https://git.heroku.com/mysterious-badlands-45487.git (push)
origin  https://github.com/baogorek/HerokuExample.git (fetch)
origin  https://github.com/baogorek/HerokuExample.git (push)

Even though I haven’t added anything new through git, I can deploy the app that I have through pushing to the heroku remote:

git push heroku master

Just that simple command sets off a lot of activity. Here is a selection of the output:

remote: Compressing source files... done.                                      remote: Building source:                                                       remote:                                                                        remote: -----> Python app detected                                             remote: -----> Installing python-3.6.8    
...
remote:        Installing collected packages: numpy, scipy, six, python-dateutil, pytz, pandas, patsy, cython, statsmodels, psycopg2-binary   
...
remote: -----> Launching...                                                    remote:        Released v3                                                     remote:        https://mysterious-badlands-45487.herokuapp.com/ deployed to Heroku  

Pushing to the “heroku” remote triggered the build of a Python application with data science dependencies such as numpy, scipy, pandas, and statsmodels. We see at the end that the app was “deployed.”

Testing it out

Since Heroku is based on containers, one quick way to test that our app has the data science dependencies that we think it does is to spin up a local container. We can do that with:

heroku run bash
python

In Python 3.6.8 within our local Heroku container, we can import a few packages just to make sure.

import statsmodels
import pandas

If you didn’t get an error, then your cloud-deployed Heroku app has these data science dependencies installed. Good!

Getting the dependencies

Exiting out of the local Heroku container, look inside the requirements.txt file:

cat requirements.txt

You’ll see a very modest text file with the following lines:

numpy
scipy==1.2
pandas
patsy
cython
statsmodels
psycopg2-binary

I specified scipy to be exactly version 1.2 based on advice from a post on a problem I was having, but otherwise these are the minimum dependencies specified by statsmodels.

Why not conda?

There are some Heroku “buildpacks” for conda online, but many of them are years old and not well maintained. Using the requirements.txt file was a breeze, and I didn’t see a reason to struggle with getting conda to work. But it clearly is possible.

Running jobs

If loading statsmodels and pandas in a local Heroku container didn’t send your pulse above 100, it’s not you. But we’re actually not too far away from making our Heroku app do things. One way to actually have your app act is to utilize the text file called Procfile (no “.txt” extension). If you look inside the Procfile for this app, it is completely blank.

Instead, I used the Heroku scheduler add on to run a file, like HerokuExample/herokuexample/run_glm.py. You can see how easy it is to set up by looking at the following screenshot:

Since you could potentially spin up some serious computing resources using the Heroku scheduler add on, you do need your credit card to enable it.

Running a script

Just so we see some output in this article, add the following line to the Procfile:

release: python herokuexample/say_hello.py

Add the file to git staging, commit it, and then push to the heroku remote:

git add Procfile
git commit -m "Updating Procfile"
git push heroku master

Among the output lines you will find:

remote: Verifying deploy... done.
remote: Running release command...
remote:
remote: I loaded statsmodels

Indeed it did.

Next steps

To really do something interesting without a full blown web app, we really need a database. Fortunately, Heroku has powerful database add ons that can complete the picture of a useful data science application that runs in the cloud. Leave a comment if you want to hear about Heroku databases in conjunction with data science apps, and I’ll add it to the queue.


Getting a business bank account for a sole prop with a DBA

Category : The Journey

After applying for a DBA, I was confused whether or not I had actually received one, given the WI Register of Deeds sent back my filled out Registration of Firm Names form with a computer generated stamp on the top right corner. Surely that was just a receipt of payment, and something called a “Certificate of Assumed Name” was coming in the mail, right?

At least, that’s the idea I got based on interactions with “Bank A” (no reason to name names), my bank of over 10 years. Banker 1 from Bank A told me that I could only open a sole proprietor in my own name. I told him that I wanted a different name on the account, “Ogorek Data Sciences,” and that I was pursuing a Doing Business As (DBA) in order to do that. He seemed unsure but told me he could help me when “the paperwork” arrived. When the Registration of Firm Names form came back stamped, I sent Banker 1 a picture of the form. He responded that he absolutely needed the Certificate of Assumed Name to open the account.

So I waited, and watched the mailbox, and waited…

After a month of waiting, I went down to the Register of Deeds in person and asked what was going on. “We don’t give out certificates!” the woman behind (what I’m pretty sure was bulletproof) glass told me. “This is all you should need to open a bank account” and she gave be a printed Wisconsin.gov web page to prove it. Not convinced that Banker 1 knew the laws of the land, I made an appointment with Banker 2 (still of Bank A), armed with newfound confidence in my form and a printed out government web page with the exact instructions that I had followed. These instructions were clear.

Unfortunately, Banker 2 didn’t know much more than Banker 1. Again, she wanted a Certificate of Assumed Name, and I told her that I had fulfilled the requirements for the DBA according to the Wisconsin Register of Deeds and showed her the Wisconsin.gov printout. She still didn’t buy it and sent a photocopy of my form off to business documents review, which was supposed to take 24-48 hours.

Some 60 hours later, without a verdict, I went to “Bank B.” To Bank B’s credit, in an hour I had a business bank account, but a few things still bothered me. First, this third banker (“Banker 3”!) claimed that my Registration of Firm Names form was completely unnecessary, or at least he had been opening up accounts without them. He also told me that he didn’t think I could use an EIN for a sole prop (which I got a few weeks ago), but then he looked it up and realized I could use the EIN. “Hey, I learned something new!” Glad I could help.

To be fair to the first two bankers from Bank A, they were actually right about the account having to be opened in my name, but what they didn’t tell me was that I could still deposit checks written to the DBA name and the DBA name could appear on that account’s checks. If they did know this, then I suppose some blame lies with me for insisting on the account being opened in the DBA name. But come on, why else would I be so adamant about an account name?

And, to be fair to Banker 3, it turns out that the printed out Wisconsin.gov web page – that the Wisconsin Register of Deeds gave me in person – was from March of 2013. Maybe registering the Sole Prop is no longer required to open a bank account. And at least Banker 3 knew enough to convince me that I did indeed need to open the account in my own name, but with a DBA name attached to the account. (Seeing him type “Ogorek Data Sciences” after the “DBA” field in the application form, which made me feel better.)

It doesn’t seem like it should have been so hard. But, I’ve got a business bank account set up with a DBA name (“Ogorek Data Sciences”) and using an EIN. Mission accomplished.


Getting Health Insurance as a Freelancer

Category : The Journey

  • The Individual Mandate is no longer in effect as of 2019
  • Clicking the sponsored health care links on Google is bad
  • Individual health insurance that protects your wealth is inexpensive

Disclaimer: This author claims no special experience or knowledge of health insurance, other than the experience of losing corporate provided health insurance and having to buy new insurance (and screwing up a little along the way).

When you talk to someone who likes the idea of leaving their corporate job, two times out of three the person will use health insurance as the reason they can’t do it. When I did it this year, I’ll admit that I was worried about it myself, and back in 2015 when I took some time off, I paid around $300 a month for Affordable Care Act (ACA) qualifying health insurance that basically did nothing for me beyond wealth protection in the case of a really big bill. And it also let me avoid the individual mandate penalty on 2015 taxes.

Fast forward to June 1st, 2019, the first date I would lose corporate coverage, I was prepared to eat the $300 (or more) per month once again. Since I was losing coverage, I went on www.HealthCare.gov to try to apply during the enrollment off season. First, I wasted a lot of time by answering questions that could lead to discounts (they won’t for any remotely comfortable income, even if earned earlier in the year). Then, over a period of five days, the site was down and couldn’t even process my application. I thought I might have to eat one month of COBRA insurance for over $600 from my last job. Then I realized that the individual mandate is gone, at least at the federal level.

This is due to the Tax Cuts and Jobs act that was passed in December 2017 and eliminated the individual mandate penalty, effective January 1, 2019. It’d be a good idea to search for any law particular to your state, but nothing stood out as too powerful to me. It’s gone.

I have nothing to say about policy, or whether this is good or bad for the country, but I won’t buy expensive health insurance as a freelancer right after losing most of my income, especially when the site to do it isn’t even working.

After realizing I could just buy insurance from anyone, I got in a hurry and searched Google for “buying health insurance online.” Thinking that the top site was some kind of Kayak-like engine for comparison shopping health insurance, it clicked it. Do not do this. I will not write down the link out of fear that I will help these sites’ rankings, but I will show you a picture of what not to click:

The first website actually looked quite clean and professional, and made it easy for me to answer the few, very reasonable questions. I clicked submit.

Within, and I’m not kidding, 5 seconds, my phone started to ring. That’s when I knew I screwed up. Over the next three weeks, I got hundreds of phone calls and dozens of voicemails and emails. Telling them to stop doesn’t work:

It was pretty bad, but in the end it was only digital communications and not impossible to ignore (if you physically turned a few things off). Still, don’t do it.

In the end, I went to a website of a name I’ve seen before, United Health One, who’s offering non-ACA compliant “short-term” insurance. At first I didn’t understand what that meant, but when you’re buying the insurance you can specify different term lengths, and that seems to be about it. These policies are not ACA-compliant because they do almost nothing for you unless you’re out a lot of money (explaining the term “junk insurance”). But that’s okay for me.

What I really want is wealth protection in case something really bad happens and I owe a hospital $700,000, and thus I went for the highest maximum benefit of $2,000,000. The deductible situation is a little complicated because there are two of them. Under the first one the company will not pay anything, and in the sweet middle ground between the first deductible and second, they will pay some. I wasn’t that interested in that middle spot so I minimized the benefits there.

The UnitedHealthOne website (www.uhone.com) almost felt like one of those old school restaurants that offered so many million menu items by sheer combinatorics of the options, which I didn’t like so much. And it annoyed me a little bit that the company name changed from UnitedHealthOne to “Golden Rule” when it was time to buy the policy. But it was relatively easy to make the final purchase, I felt I got what I needed, and it was under $100 a month.

I’ll repeat the disclaimer: I have no idea what I’m doing. Maybe I do rack up $700,000 of medical expenses and then $400,000 of those expenses are denied by “Golden Rule.” Who knows? But I do know one thing: if you don’t click on one of those links I showed you, you can buy health insurance at a very reasonable rate without getting attacked by vultures. Sometimes, that’s all you can ask for.