Categories
Python Tools

Introducing “datascroller” for fast terminal data frame scrolling

I’m excited to announce my very first package on PyPi, datascroller, a Python package for interactive terminal data scrolling. It’s available for Windows as well as *nix systems (thanks to windows-curses), and contributors to the codebase are welcome!

How it works

See the gif below for a glimpse of datascroller in action:

datascroller allows terminal datascrolling

The syntax has changed slightly since the gif was created, but during that demo, I was pressing keys to resize the terminal viewing window and to scroll from left-to-right and up-to-down within a Pandas data frame. Currently the scrolling keys are inspired by vim but later versions will offer customization options.

You can install datascroller with pip using:

pip install datascroller

Try datascroller out in iPython with the following code:

import pandas as pd
from datascroller import scroll

train = pd.read_csv(
    'https://raw.githubusercontent.com/datasets/house-prices-uk/master/data/data.csv')

scroll(train)

Why a terminal data scroller?

Scrolling a through a data set is a fundamental part of exploratory data analysis, and open-source tools let us down in this regard. SAS has had it right for a while. From my memory of around 2001, you could scroll through tens of millions of rows through what must have been a very clever paging strategy. Say what you want about SAS, but honestly no other data viewer has come close.

Moving to R in 2009, I had to accept the loss of SAS’s data set viewer and learn to accept the built-in viewer or just print slices of the data frame in the console. Soon after, I started using RStudio. They offered a nice improvement on the default viewer, but it still couldn’t hold a candle to SAS’s and didn’t handle very large data sets well at the time (to the best of my recollection).

In 2019, RStudio may very well have their data viewer tuned to perfection. But some people prefer working in the terminal, and sometimes you have to (say, a client gives you an ssh login for a particular remote machine). It is possible to hook up notebooks, or use an X-server, but often it’s easier to just print slices of your data sets in the terminal for exploratory analysis. Ehile R’s tibble and Panda’s DataFrame are smart enough to not overwhelm your console with output, they make you work to see the parts of the data that you really need to see.

The datascroller vision

The featured image is a play on the movie “Minority Report” and its very memorable scene with Tom Cruise’s character using the futuristic API to sort through information. I always wanted to move around the data set like that, and I felt that the terminal would be a good place to do it. In 2014, at Google, I took my first crack at this with an internal R package I called “terminalR.” I got helpful feedback from data scientists there, especially Tim Hesterberg. Tim convinced me of the need to implement user configuration options (still a TODO for datascroller!) and also to transition to Emacs/ESS since they came with Emacs Lisp. But, we stopped short of achieving the vision full interactivity.

The terminalR package’s original mechanism was “drumming” on the enter key while you pressed other navigation buttons, as it relied on R’s standard console input methods). With Python offering wrappers for the curses library for both *nix systems and Windows, the interactive “vision” has become a reality.

What’s next for datascroller?

The Python package datascroller, currently for use with Pandas dataframes, will become the tool “datascroller” for general purpose terminal data scrolling. Imagine interactive terminal scrolling of any csv, text, or even JSON file that can be initiated from outside of Python. My past colleague John Merfeld, who makes extensive use of low vision accessibility tools, is on the project and will help consult as to whether certain color schemes (curses offers those) help make the terminal output easier to see, thus giving datascroller an accessibility angle.

Even with TerminalR, I could get around an R data frame pretty fast, fast than any GUI viewer. It has column and row searching functionality from the keyboard, and a lot of movement options. All these options and more are coming to datascroller soon, in full interactive fashion.

I have big plans for this tool.

Categories
Tools

Multiple Regression in TensorFlow 2.0 using Matrix Notation

While the two were always friendly, TensorFlow has fully embraced the Keras API in version 2.0, making it THE high-level API and further tightening its integration into the platform’s core. From tutorials, videos, press releases, the message is resounding: use the Keras API, unless you absolutely, positively, can’t.

All signs point to the Keras API as being a world class API, but it is a neural networks API. And while many statistical models can be framed as neural networks, there is another API that some prefer: the “matrix algebra API.” The good news is that TensorFlow 2.0’s new eager execution defaults mean that working with linear models in matrix form is easier than ever. Once you know a few caveats, it doesn’t feel so different than working in numpy. And this is great news!

In this post, we’re going to do multiple regression using Fisher’s Iris data set, regressing Sepal Length on Sepal Width and Petal Length (for no particular scientific reason) using TensorFlow 2.0. Yes, there is an official linear regression tutorial for TensorFlow 2.0, but it does not feature the matrix calculations (or explain the caveats) that this article will.

In matrix notation, we’ll be fitting the following model:


    \[\left[\begin{matrix} y_1 \\y_2 \\\ldots \\y_{150}\end{matrix}\right] = \left[\begin{matrix} 1 & x_1 & z_1 \\1 & x_2 & z_2 \\\ldots \\1 & x_{150} & z_{150}\end{matrix}\right] \left[\begin{matrix} \beta_0 \\\beta_1 \\\beta_2\end{matrix}\right] + \left[\begin{matrix} \epsilon_1 \\\epsilon_2 \\\ldots \\\epsilon_{150}\end{matrix}\right],\]


Where y is Sepal Length, x is Sepal Width, z is Petal Length, and \epsilon_1, \ldots \epsilon_{150} are i.i.d. N(0, \sigma^2).

Let’s make this regression happen in Python using the statsmodels module:

import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Part 1: OLS Regression on two predictors using statsmodels
iris_df = sm.datasets.get_rdataset('iris').data
iris_df.columns = [name.replace('.', '_') for name in iris_df.columns]
reg_model = smf.ols(formula='Sepal_Length ~ Sepal_Width + Petal_Length',
                    data=iris_df)
fitted_model = reg_model.fit()
fitted_model.summary()

This gives us the (partial) output:

===================================================================
                   coef    std err          t      P>|t|
-------------------------------------------------------------------
Intercept        2.2491      0.248      9.070      0.000
Sepal_Width      0.5955      0.069      8.590      0.000
Petal_Length     0.4719      0.017     27.569      0.000
===================================================================

Now let’s spin up TensorFlow and convert our matrices and vectors into “tensors”:

import tensorflow as tf
import patsy                                                                                                                                                            
X_matrix = patsy.dmatrix('1 + Sepal_Width + Petal_Length', data=iris_df)

X = tf.constant(X_matrix)
y = tf.constant(iris_df.Sepal_Length.values.reshape((150, 1)))

We’re using constant tensors for our data vectors, and everything looks pretty straightforward here, but there is a spike-filled trap that we just stepped over. Caveat #1 is: if you don’t reshape your vector y into an actual column vector, the following code will run but lead to incorrect estimates. The fit leads to basically an intercept-only model, so that must mean that somehow the ordering is compromised unless there are actually two dimensions.

The next thing we’ll do is to create our variable tensor that will hold our regression weights. You could directly create a TensorFlow variable, but don’t. Instead, subclass from tf.Module:

class IrisReg(tf.Module):
    def __init__(self, starting_vector = [[0.0], [0.0], [0.0]]):
        self.beta = tf.Variable(starting_vector, dtype=tf.float64)

irisreg = IrisReg()

I don’t love this, as it feels bureaucratic and I’d rather just work with a variable called “beta.” But you really need to do this unless you want to roll your own gradient descent. Caveat #2 is not to bypass subclassing from tf.Module or else you will struggle with your optimizer’s .apply_gradients method. By subclassing from tf.Module, you get a property trainable_variables, that you can treat like the parameter vector but it is also iterable.

The matrix math for the prediction is a little anticlimactic:

@tf.function
def predict(X, beta):
    return tf.matmul(X, beta)

and the @tf.function decorator is optional for a performance benefit. The OLS regression loss is so simple that it’s also worth defining it explicitly (and I did have some trouble with the built-in losses, for full transparency):

@tf.function
def get_loss(observed, predicted):
    return tf.reduce_mean(tf.square(observed - predicted))

While the loss function can be easily coded from scratch, there are too many benefits of using a built-in optimizer, like built-in momentum for gradient decent.

sgd_optimizer = tf.optimizers.SGD(learning_rate=.01, momentum=.98)

The rest of the training is presented in the following loop:

for epoch in range(1000):

    with tf.GradientTape() as gradient_tape:
        y_pred = predict(X, irisreg.trainable_variables)
        loss = get_loss(y, y_pred)

    gradient = gradient_tape.gradient(loss, irisreg.trainable_variables)
    sgd_optimizer.apply_gradients(zip(gradient,
                                      irisreg.trainable_variables))

    print(irisreg.trainable_variables)

With eager execution enabled by default in TensorFlow 2.0, running “gradient tape” through the “forward pass” (i.e. prediction) is necessary to get the gradients. Notice that the trainable_variables property is used in place of the parameter vector in all situations. You could get away with a plain variable for every step until the optimizer’s apply_gradients method, and mixing and matching was causing trouble as well.

(<tf.Variable 'Variable:0' shape=(3, 1) dtype=float64, numpy=
array([[2.24920008],
       [0.59551132],
       [0.47195184]])>,)

It’s not the most sophisticated training loop, but starting from an awful choice of starting vector, the procedure quickly converges to the OLS regression estimates. The loss function is easy to alter to create a Ridge Regression or LASSO procedure. And being in the TensorFlow ecosystem means that these techniques would scale to big datasets, be easily ported to JavaScript using TensorFlow.js, and made available to the TensorBoard debugging utilities.

It’s not just neural network enthusiasts who can gain from TensorFlow. Statisticians and other Data Scientists who prefer matrix manipulation can now really enjoy using TensorFlow thanks to the very cool eager enhancements in TensorFlow 2.0.