Online Bayesian Regression

Bayesian Regression Review

Both the IPython notebook and code are available.

Recall that our regression coefficients \(\vec{w}\) come from a prior distribution

\[\begin{align} \vec{w} \sim \mathcal{N}\left(\vec{m}_0, S_0\right) \end{align}\]

and the likelihood is

\[\begin{align} p(\vec{t}|\vec{w}) = \mathcal{N}\left(\vec{t} | \mathbf{\Phi} \vec{w}, \beta^{-1}I\right) \end{align}\]

We can then derive the posterior to be

\[\begin{align} p(\vec{w} | \vec{t}) &= \mathcal{N}\left( \vec{w} \Big| \vec{m}_N, \mathbf{S}_N \right) \newline \vec{m}_N &= \mathbf{S}_N\left(S_0^{-1} \vec{m_0} + \beta \mathbf{\Phi}^T \vec{t} \right) \newline \mathbf{S}_N^{-1} &= \mathbf{S}_0^{-1} + \beta \mathbf{\Phi}^T\mathbf{\Phi} \end{align}\]

	import numpy as np
	import matplotlib.pyplot as plt
	np.random.seed(1)

	# generate toy data
	dim = 1
	numpoints = 100
	x = np.linspace(-3, 3, numpoints)
	x = x[np.newaxis].T

	slope = 3
	eps_mu = 0
	beta = 0.25
	eps = np.random.normal(eps_mu, 1/beta, size = (numpoints, dim))
	t = slope * x + eps

	# phi is a column of ones next to the column of x's
	phi = np.hstack((np.ones((numpoints, 1)), x))

	# choose an uninformative prior for simplicity
	m_0 = np.array([[0], [0]])
	S_0 = np.eye(2)
	beta = 0.25

	S_0inv = np.linalg.lstsq(S_0, np.eye(2), rcond = None)[0]
	S_n = np.linalg.lstsq(S_0inv + beta * phi.T @ phi, np.eye(2), rcond = None)[0]
	m_n = S_n @ (S_0inv @ m_0 + beta * phi.T @ t)

	line_x_b = np.linspace(np.min(x)-5, np.max(x) + 5, 100)
	line_y_b = m_n[0] + m_n[1]*line_x_b

	plt.scatter(x, t)
	plt.plot(line_x_b, line_y_b, 'r')
	plt.show()

Online Bayesian Regression

Large datasets often face computational and memory limitations, both of which motivate an online or sequential update scheme. Similar to stochastic gradient descent for frequentist regression, the online version of Bayesian regression not only allows model parameters to be updated without keeping the entire dataset in memory/storage, but also increases computational efficiency of the training/fitting process. The update rules can be derived as follows.

Suppose we have \(N + M\) observations. Let \(\mathbf{\Phi}_{N + M}\) denote the design matrix with \(N + M\) feature vectors and \(\vec{t}_{N + M}\) be the vector of \(N + M\) target values. Then it is clear that we can partition them

\[\begin{align} \mathbf{\Phi}_{N + M} = \begin{pmatrix} \mathbf{\Phi}_N \\ \mathbf{\Phi}_M \end{pmatrix} \quad \vec{t}_{N + M} = \begin{pmatrix} \vec{t}_N \\ \vec{t}_M \end{pmatrix} \end{align}\]

Thus the updates for \(S_{N + M}\) and \(m_{N + M}\) can be written as

\[\begin{align} \mathbf{S}_{N + M}^{-1} &= \mathbf{S}_0^{-1} + \beta \mathbf{\Phi}_{N + M}^T \mathbf{\Phi}_{N + M} \\ &= \mathbf{S}_0^{-1} + \beta \left(\mathbf{\Phi}_{N}^T\mathbf{\Phi}_{N} + \mathbf{\Phi}_{M}^T\mathbf{\Phi}_{M} \right) \\ &= \mathbf{S}_N^{-1} + \beta\mathbf{\Phi}_{M}^T\mathbf{\Phi}_{M}\\ \vec{m}_{N + M} &= \mathbf{S}_{N + M} \left(\mathbf{S}_0^{-1} \vec{m}_0 + \beta \mathbf{\Phi}_{N + M}^T \vec{t}_{N+M}\right) \\ &= \mathbf{S}_{N + M} \left(\mathbf{S}_0^{-1} \vec{m}_0\right) + \beta \mathbf{S}_{N + M} \left(\mathbf{\Phi}_{N}^T\vec{t}_{N} + \mathbf{\Phi}_{M}^T\vec{t}_{M}\right) \\ &= \mathbf{S}_{N + M} \left(\mathbf{S}_0^{-1} \vec{m}_0\right) + \mathbf{S}_{N + M} \left(\mathbf{S}_N^{-1} \vec{m}_N - \mathbf{S}_0^{-1}\vec{m}_0\right) + \beta \mathbf{S}_{N + M} \left(\mathbf{\Phi}_M^T \vec{t}_M\right) \\ &= \mathbf{S}_{N + M} \left(\mathbf{S}_N^{-1} \vec{m}_N\right) + \beta \mathbf{S}_{N + M} \left(\mathbf{\Phi}_M^T \vec{t}_M\right) \end{align}\]

These updates do not require past data to be saved, nor do they require the repeated inclusion of the past \(N\) observations when recomputing model parameters (which can be rather costly when data is high dimensional, \(N\) is large, or both).

	# we will loop through phi to mimic an incoming stream of data
	# same priors as before
	m_n = np.array([[0.0], [1.0]])
	S_n = np.eye(2)
	S_ninv = np.linalg.lstsq(S_n, np.eye(2), rcond = None)[0]
	beta = 0.25

	for i in range(len(phi[:, 0])):
		S_ninv_old = S_ninv
		
		S_ninv = S_ninv + beta * phi[np.newaxis, i, :].T @ phi[np.newaxis, i, :]
		S_n = np.linalg.lstsq(S_ninv, np.eye(2), rcond = None)[0]
		m_n = S_n @ (S_ninv_old @ m_n + beta * phi[np.newaxis, i, :].T * t[i])
	
	line_x_on = np.linspace(np.min(x)-5, np.max(x) + 5, 100)
	line_y_on = m_n[0] + m_n[1]*line_x_on

	fig, ax = plt.subplots(1, 2, figsize=(10,5))
	ax[0].title.set_text("Bayesian")
	ax[0].scatter(x, t)
	ax[0].plot(line_x_b, line_y_b, 'r')
	ax[1].title.set_text("Online Bayesian")
	ax[1].scatter(x, t)
	ax[1].plot(line_x_on, line_y_on, 'r')
	plt.show()