Linear Mixed Models in Tensorflow 2.0 via MacKay’s method

Linear Mixed Models in Tensorflow 2.0 via MacKay’s method

Despite the many successes of modern neural network toolkits like TensorFlow, one of the advantages of old-school tools like R’s lme4 is that a linear model can have different levels of regularization for subsets of variables. A user-level factor with thousands of levels would likely benefit from more regularization than a US state-level factor, for instance, and linear mixed models estimate those levels of regularization directly from the data. In a neural networks context, according to Geoff Hinton, learning multiple penalties using validation sets would be “very expensive to do.”

Professor Hinton’s statement comes from Lecture 9f of Neural Networks for Machine Learning, where he introduces MacKay’s “quick and dirty” method for using empirical Bayes to bypass the validation set when training neural networks. The slide from the course describing the method is shown below:

Lecture 9 slide describing MacKay’s method from Geoff Hinton’s course


In this article, were going to implement MacKay’s method in TensorFlow 2.0 using linear mixed models as an example. However, the law of total variance, gives the classical statistician some reason for concern. It’s the impetus for this Cross Validated question on why the variance of the predicted random effects from lme4 isn’t the same as the estimated random effects variance matrix. Though it feels like you’re seeing the actual random effects in lmer’s output, you’re actually seeing the predicted value of the random effect given the response, i.e., \text{E}(b_i \vert \mathbf{y}_i) for subject-specific random effect b_i and data vector \mathbf{y}_i.


From the Law of Total Variance,

    \[\text{Var}(b_i) = \text{E}(\text{Var}(b_i \vert \mathbf{y}_i)) + \text{Var}(\text{E}(b_i \vert \mathbf{y}_i)),\]


which means that if we follow MacKay’s recipe for estimating \text{Var}(b_i), we’re going to come up short. But our goal is effective regularization rather than publication of random effects variance estimates in Nature. So let’s give it a try.

Using lme4 on the sleepstudy data

Consider the sleepstudy example featured in R’s lme4 package:

library(lme4)
lme1 <- lmer(Reaction ~ 1 + Days + (1 + Days | Subject), sleepstudy)
summary(lme1)
head(ranef(fm1)[[1]])
Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 612.09   24.740
          Days         35.07    5.922   0.07
 Residual             654.94   25.592
Number of obs: 180, groups:  Subject, 18

Fixed effects:
            Estimate Std. Error t value
(Intercept)  251.405      6.825  36.838
Days          10.467      1.546   6.771

> head(ranef(fm1)[[1]])
    (Intercept)       Days
308    2.258565  9.1989719
309  -40.398577 -8.6197032
310  -38.960246 -5.4488799
330   23.690498 -4.8143313
331   22.260203 -3.0698946
332    9.039526 -0.2721707

MacKay’s method on sleepstudy

The SleepReg class

The following examples will use the SleepReg class, an ad hoc subclass of tensorflow.Module specifically for implementing maximum likelihood (also GLS) estimation / prediction of fixed and random effects given variances for random effects and model errors. For an explanation of the TensorFlow 2.0 strategy and why inheriting from tf.Module is so important, refer to Multiple Regression in TensorFlow 2.0 using Matrix Notation.

The SleepReg class incorporates a (profiled) maximum likelihood loss of the form:

with tf.GradientTape() as gradient_tape:
    y_pred = self._get_expectation(X, Z, self.beta, self.b) 
    loss = (self._get_sse(y, y_pred) / self.sigmasq_epsilon
            + self._get_neg_log_prior(self.b, V))

This involves the sum of squared errors divided by the error variance plus the likelihood contribution of the latent random effects in _get_neg_log_prior (referred to as a “prior” to reflect the empirical Bayes interpretation). The latter quantity is a weighted sum of squares of the random effects, where the weight matrix V is a block diagonal of the inverse random effects variance matrices.

@tf.function
def _get_neg_log_prior(self, b, V):
    """Get the weight pentalty from the full Gaussian distribution"""
    bTV = tf.matmul(tf.transpose(b), V)                                                              
    bTVb = tf.matmul(bTV, b)
    return tf.squeeze(bTVb)

Reproducing lmer’s estimates in TensorFlow

The following shows TensorFlow 2.0 code capable of reproducing both the random effect predictions and fixed effect estimates of lmer, but without the routines to estimate the unknown variances such as REML. You’ll see that the optimization routine matches lmer’s output (to a high degree of accurracy) for both fixed effects estimates and random effects predictions.

from sleepstudy import SleepReg
import numpy as np

sleep_reg = SleepReg("/mnt/c/devl/data/sleepstudy.csv")

# Replicate lme4's result
off_diag = 24.7404 * 5.9221 * 0.066
lmer_vcov = np.array([[24.7404 ** 2, off_diag],
                      [off_diag, 5.9221 ** 2]])

sleep_reg.reset_variances(lmer_vcov, 25.5918 ** 2)

sleep_reg.train()
sleep_reg.set_optimizer(adam=True)
sleep_reg.train(epochs=300)

print(sleep_reg.beta)
print(sleep_reg.get_random_effects().head())
<tf.Variable 'Variable:0' shape=(2, 1) dtype=float64, numpy=
array([[251.40510486],
       [ 10.46728596]])>
          mu         b
0   2.262934  9.198305
1 -40.399556 -8.619793
2 -25.207478  1.172853
3 -13.065620  6.613451
4   4.575970 -3.014939

Implementing MacKay’s method

The loss function component _get_neg_log_prior in SleepReg uses a block diagonal matrix, V, which is non-diagonal if there are correlations between the random effects. MacKay’s proposed method uses the raw sum of squares of the weights, making for a very clean equation:

Lecture 9 slide describing Bayesian weight decay from Geoff Hinton’s course

While we go through MacKay’s “while not yet bored” loop, we’ll zero out the non-diagonals of V that result from non-zero covariances in the empirical variance matrix of the random effect predictions. What happens if you don’t? I wondered if you’d get a slightly less “quick and dirty” version of the algorithm, but the procedure actually bombs after a few iterations. You can see this yourself by commenting out the line with the diag function calls.

sleep_reg.zero_coefficients()
sleep_reg.reset_variances(np.array([[410, 10], [10, 22]]),
                         .25 * np.var(sleep_reg.y))
sleep_reg.set_optimizer(adam=False)

for i in range(100):
    sleep_reg.train(display_beta=False)
    
    sigmasq_epsilon = sleep_reg.estimate_sigmasq_epsilon()
       
    V = sleep_reg.get_rnd_effs_variance()
    V_diag = np.diag(np.diag(V)) # comment out and watch procedure fail

    sleep_reg.reset_variances(V_diag, sigmasq_epsilon)

    print(V_diag)
    print(sigmasq_epsilon)

print(sleep_reg.beta)
print(sleep_reg.get_random_effects().head())
--- last V_diag
[[302.9045408    0.        ]
 [  0.          31.08902388]]
--- last sigmasq_epsilon
[670.8546961]
--- final estimate of fixed effect beta
<tf.Variable 'Variable:0' shape=(2, 1) dtype=float64, numpy=
array([[251.40510485],
       [ 10.46728596]])>
--- final random effects predictions
          mu         b
0   2.013963  9.147986
1 -32.683526 -9.633964
2 -20.255296  0.459532
3 -10.372529  6.169903
4   3.618080 -2.851007

Discussion

As foretold by the law of total variances, the random effects variance estimates from MacKay’s method are low, the variance of the random intercepts coming in at right under half of lmer’s estimated variance of 612 at 303. Whether or not it’s a coincidence, the empirically estimated variance of the random slopes was 31, much closer to the lmer estimated value of 35. The poorer random effect predictions led to a slightly larger error variance of 671 vs lmer’s 655, but still relatively close.

Even with the inadequacies in variance estimation, the fixed effects estimates produced by the MacKay method are much closer to lmer’s than they would be to an OLS regression treating subjects as fixed factor levels. The random effect predictions themselves are shrunken down too much also quite close for some subjects. The procedure, true to its name, is quick and dirty, but it works.

That the procedure breaks down from even a slight deviation from independent random effects is a mystery to me, however. Underestimates of the random effects variances seem to be forgiven (at least in algorithm stability) if there is no link between them. Perhaps another improvement for the method, that might even allow correlations, is to multiply the empirical variances by a scaling factor greater than one. But what would that factor be so as not to make the method even “dirtier”?

I have a vision of a toolkit as powerful as TensorFlow but one that offers the same basic inferential tools and benefits of classical Statistics. That vision is still far from a reality, but components were explored in this article. I thought it was cool how teaming stochastic gradient descent with the adam optimizer really locked in mixed model parameter values quickly (when starting far away from the MLE estimates, adam moves very slowly). Whether or not MacKay’s method will find it into my standard modeling toolkit is yet to be seen, but my curiosity regarding the method is only enhanced by these latest experiments.


Multiple Regression in TensorFlow 2.0 using Matrix Notation

Category : Tools

While the two were always friendly, TensorFlow has fully embraced the Keras API in version 2.0, making it THE high-level API and further tightening its integration into the platform’s core. From tutorials, videos, press releases, the message is resounding: use the Keras API, unless you absolutely, positively, can’t.

All signs point to the Keras API as being a world class API, but it is a neural networks API. And while many statistical models can be framed as neural networks, there is another API that some prefer: the “matrix algebra API.” The good news is that TensorFlow 2.0’s new eager execution defaults mean that working with linear models in matrix form is easier than ever. Once you know a few caveats, it doesn’t feel so different than working in numpy. And this is great news!

In this post, we’re going to do multiple regression using Fisher’s Iris data set, regressing Sepal Length on Sepal Width and Petal Length (for no particular scientific reason) using TensorFlow 2.0. Yes, there is an official linear regression tutorial for TensorFlow 2.0, but it does not feature the matrix calculations (or explain the caveats) that this article will.

In matrix notation, we’ll be fitting the following model:


    \[\left[\begin{matrix} y_1 \\y_2 \\\ldots \\y_{150} \end{matrix}\right] =  \left[\begin{matrix} 1 & x_1 & z_1 \\1 & x_2 & z_2 \\\ldots \\1 & x_{150} & z_{150} \end{matrix}\right]   \left[\begin{matrix} \beta_0 \\\beta_1 \\\beta_2 \end{matrix}\right]  +  \left[\begin{matrix} \epsilon_1 \\\epsilon_2 \\\ldots \\\epsilon_{150} \end{matrix}\right],\]


Where y is Sepal Length, x is Sepal Width, z is Petal Length, and \epsilon_1, \ldots \epsilon_{150} are i.i.d. N(0, \sigma^2).

Let’s make this regression happen in Python using the statsmodels module:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Part 1: OLS Regression on two predictors using statsmodels
iris_df = sm.datasets.get_rdataset('iris').data
iris_df.columns = [name.replace('.', '_') for name in iris_df.columns]
reg_model = smf.ols(formula='Sepal_Length ~ Sepal_Width + Petal_Length',
                    data=iris_df)
fitted_model = reg_model.fit()
fitted_model.summary()

This gives us the (partial) output:

===================================================================
                   coef    std err          t      P>|t|
-------------------------------------------------------------------
Intercept        2.2491      0.248      9.070      0.000
Sepal_Width      0.5955      0.069      8.590      0.000
Petal_Length     0.4719      0.017     27.569      0.000
===================================================================

Now let’s spin up TensorFlow and convert our matrices and vectors into “tensors”:

import tensorflow as tf
import patsy                                                                                                                                                            
X_matrix = patsy.dmatrix('1 + Sepal_Width + Petal_Length', data=iris_df)

X = tf.constant(X_matrix)
y = tf.constant(iris_df.Sepal_Length.values.reshape((150, 1)))

We’re using constant tensors for our data vectors, and everything looks pretty straightforward here, but there is a spike-filled trap that we just stepped over. Caveat #1 is: if you don’t reshape your vector y into an actual column vector, the following code will run but lead to incorrect estimates. The fit leads to basically an intercept-only model, so that must mean that somehow the ordering is compromised unless there are actually two dimensions.

The next thing we’ll do is to create our variable tensor that will hold our regression weights. You could directly create a TensorFlow variable, but don’t. Instead, subclass from tf.Module:

class IrisReg(tf.Module):
    def __init__(self, starting_vector = [[0.0], [0.0], [0.0]]):
        self.beta = tf.Variable(starting_vector, dtype=tf.float64)

irisreg = IrisReg()

I don’t love this, as it feels bureaucratic and I’d rather just work with a variable called “beta.” But you really need to do this unless you want to roll your own gradient descent. Caveat #2 is not to bypass subclassing from tf.Module or else you will struggle with your optimizer’s .apply_gradients method. By subclassing from tf.Module, you get a property trainable_variables, that you can treat like the parameter vector but it is also iterable.

The matrix math for the prediction is a little anticlimactic:

@tf.function
def predict(X, beta):
    return tf.matmul(X, beta)

and the @tf.function decorator is optional for a performance benefit. The OLS regression loss is so simple that it’s also worth defining it explicitly (and I did have some trouble with the built-in losses, for full transparency):

@tf.function
def get_loss(observed, predicted):
    return tf.reduce_mean(tf.square(observed - predicted))

While the loss function can be easily coded from scratch, there are too many benefits of using a built-in optimizer, like built-in momentum for gradient decent.

sgd_optimizer = tf.optimizers.SGD(learning_rate=.01, momentum=.98)

The rest of the training is presented in the following loop:

for epoch in range(1000):

    with tf.GradientTape() as gradient_tape:
        y_pred = predict(X, irisreg.trainable_variables)
        loss = get_loss(y, y_pred)

    gradient = gradient_tape.gradient(loss, irisreg.trainable_variables)
    sgd_optimizer.apply_gradients(zip(gradient,
                                      irisreg.trainable_variables))

    print(irisreg.trainable_variables)

With eager execution enabled by default in TensorFlow 2.0, running “gradient tape” through the “forward pass” (i.e. prediction) is necessary to get the gradients. Notice that the trainable_variables property is used in place of the parameter vector in all situations. You could get away with a plain variable for every step until the optimizer’s apply_gradients method, and mixing and matching was causing trouble as well.

(<tf.Variable 'Variable:0' shape=(3, 1) dtype=float64, numpy=
array([[2.24920008],
       [0.59551132],
       [0.47195184]])>,)

It’s not the most sophisticated training loop, but starting from an awful choice of starting vector, the procedure quickly converges to the OLS regression estimates. The loss function is easy to alter to create a Ridge Regression or LASSO procedure. And being in the TensorFlow ecosystem means that these techniques would scale to big datasets, be easily ported to JavaScript using TensorFlow.js, and made available to the TensorBoard debugging utilities.

It’s not just neural network enthusiasts who can gain from TensorFlow. Statisticians and other Data Scientists who prefer matrix manipulation can now really enjoy using TensorFlow thanks to the very cool eager enhancements in TensorFlow 2.0.