Data Oblivious Sketch for Logistic Regression

This is an annotated Python implementation of Oblivious Sketching for Logistic Regression from ICML 2021. The paper is also available on arXiv.

The proposed data oblivious method computes data sketch in input-sparsity time, and reduces dataset of size \(n \times d\) to a sketch consisting of \(\mathrm{poly}(\mu d \log n)\) weighted points (vectors) where

Code is adapted from the implementation from the original authors, and styling labml.ai.

The structure of the implementation is as follows:

#

Import necessary packages:

  • NumPy for vector/matrix manipulations
  • math for basic mathematical operations
  • SciPy for logistic sigmoid function and optimization
1import numpy as np
2import math
3from scipy.special import expit
4import scipy.optimize as so
#
5class Sketch:
#

A Sketch object is defined by the following parameters:

  • \(h_{\mathrm{max}}\) (h_max) ~ the number of levels
  • \(b\) (b) ~ branching parameter that determines how the (expected) number of coordinates changes between different levels
  • \(N\) (N) ~ number of buckets in each level
  • \(n\) (n) ~ size of the dataset (\(n\) observations)
  • \(d\) (d) ~ dimension of the dataset (each observation is represented by a \(d\)-dimensional vector; here an intercept is included)
 6  def __init__(self, h_max, b, N, n, d):
 7    self.h_max = h_max
 8    self.b = b
 9    self.N = N
10    self.n = n
11    self.d = d
#

The sketching matrix \(S\) first consists of \(h_{\mathrm{max}} + 1\) blocks, \(S_i\) for \(0 \leqslant i \leqslant h_{\mathrm{max}}\); and each block contains \(N\) rows (buckets). Matrix \(S\) is then complemented by a row-sampling matrix \(T\) populated by a small uniform sample of the data at the end.

\[ \begin{bmatrix} ~ S_0 ~ \\ ~ S_1 ~ \\ ~ \vdots~ \\ ~ S_{h_{\mathrm{max}}}~ \\ ~ T ~ \end{bmatrix} \]

Here \(A''\) (reduced_matrix) represents the unweighted resulting sketch that corresponds to the first \(h_{\mathrm{max}} + 1\) blocks of the sketching matrix (hence without \(T\)).

12    shape = (N * (h_max + 1), d)
13    self.reduced_matrix = np.zeros(shape)
#

To obtain normalizing constant \(\beta\) (beta), we compute the sum of the geometric series:

\[ \beta = \sum_{i=0}^{h_{\mathrm{max}}} b^{-i} = \frac{1 - (1/b)^{h_{\mathrm{max}} + 1}}{1 - (1/b)} = \frac{b - b^{-h_{\mathrm{max}}}}{b - 1} \]

Next we compute the probability associated with each level (p). Specifically, level \(h\) for \(0 \leqslant h \leqslant h_{\mathrm{max}}\) has \(1 / (\beta \cdot b^{h})\) chance of getting selected.

Then the weight of each level (w) is simply the inverse of its associated probability; that is \(\beta \cdot b^{h}\) for level \(h\).

Note: in authors’ published implementation \(\beta\) is computed as the sum of \(b^{i}\)‘s (instead of \(b^{-1}\)‘s), which is inconsistent with the expression presented in the paper.

14    self.beta = 0
15    self.p = np.zeros(h_max + 1)
16    for i in range(0, h_max + 1):
17      self.beta += b ** (-i)
18      self.p[i] = b ** (-i)
19    self.p = (1 / self.beta) * self.p
20    self.w = 1 / self.p
#

Since all buckets at one level have the same weight, we repeat level weights to obtain weights for all buckets (weights).

21    self.weights = np.repeat(self.w, N)
#

Each observation \(\vec{A}_i\) from data matrix \(A\) (A) is hashed uniformly at random to bucket \(g\) (g) at level \(h\), where level is chosen according to the level weights.

Note: in authors’ published implementation random \(h\) is obtained by utilizing \(\beta\) and log transform. Since \(\beta\) used in that implementation is different from the paper, that random level generation approach cannot be used if we follow the definitions in the paper. Here we opt for a more straightforward and general built-in function random.choice from NumPy.

Note: the original implementation also constructs an accelerated counterpart to speed up the operation (instructs Numba to translate the function to optimized machine code at runtime). Here we vectorize insertion operations and allow all operations to executed at once, instead of inserting observations one by one.

22  def insert(self, A):
23    n, d = A.shape
24    h = np.random.choice(self.h_max + 1, size=n, p=self.p)
25    g = np.random.randint(0, self.N, size=n)
26    indices = g + h * self.N
27    self.reduced_matrix[indices] = self.reduced_matrix[indices] + A
#
28  def get_reduced_matrix(self):
29    return self.reduced_matrix
#
30  def get_weights(self):
31    return self.weights
#

Here \(A \in \mathbb{R}^{n \times d}\) (A) is the data matrix (last column being the intercept), and each row \(\vec{A}_i\) of \(A\) is defined as

\[\vec{A}_i = -l_i \vec{x}_i\]

where \(\vec{x}_i\) is \(i\)th observation (with intercept included) and \(l_i \in \big\{-1, 1\big\}\) is the corresponding label.

Note that the labels are not \(\big\{0, 1\big\}\) as commonly seen in other contexts.

32def get_reduced_matrix_and_weights(A, sketch_size, kyfan_percent, h_max):
33  n, d = A.shape
#

To get the number of buckets \(N\) (N), we evenly split the sketch size \(r\) (sketch_size) into \(h_{\mathrm{max}} + 2\) blocks: \(h_{\mathrm{max}} + 1\) blocks for sketching, and \(1\) block for uniform sampling at the end.

34  N = max(int(sketch_size / (h_max + 2)), 1)
#

To get the branching parameter \(b\) (b), we compute \(\sqrt[h_{\mathrm{max}}]{n / N}\).

35  b = (n / N) ** (1.0 / h_max)
#

The size of the sketching blocks (sketch_blocks_size) is \(N \cdot (h_{\mathrm{max}} + 1)\).

36  sketch_blocks_size = N * (h_max + 1)
#

The size of the uniform sampling block (unif_block_size) is \(\max\{1, r - N \cdot (h_{\mathrm{max}} + 1)\}\).

37  unif_block_size = max(sketch_size - sketch_blocks_size, 1)
38
39  sketch_blocks = Sketch(h_max, b, N, n, d)
40  sketch_blocks.insert(A)
41  sketch_reduced_matrix = sketch_blocks.get_reduced_matrix()
42  sketch_weights = sketch_blocks.get_weights()
#

We generate both the unweighted resulting sketch \(A''\) (sketch_reduced_matrix) and corresponding weights (sketch_weights) for the sketch blocks.

43  rng = np.random.default_rng()
44  sample_indices = rng.choice(n, size=unif_block_size)
45  unif_samples = A[sample_indices]
46  reduced_matrix = np.vstack([sketch_reduced_matrix, unif_samples])
#

We uniformly take random samples from data matrix \(A\) (A), and append the samples to \(A''\) (sketch_reduced_matrix) to form the complete unweighted sketch reduced_matrix.

Note: in the original implementation random samples are drawn without replacement (by setting replace=False in rng.choice). Here to ensure samples are indeed i.i.d., we’ve made a change to the function call.

47  unif_weights = np.ones(unif_block_size) * (n / unif_block_size)
48  weights = np.concatenate([sketch_weights, unif_weights])
49  weights = weights / np.sum(weights)
#

We compute the weights for points (vectors) in the uniform sampling block (conclusion from Theorem 6 in the paper), and append the results to sketch_weights to form the complete weights vector weights.

We also normalize the weights.

50  K = int(N * kyfan_percent)
#

Loss and gradient are evaluated only on the \(K\) (K) largest entries on each level of the sketch blocks (Ky Fan \(K\)-norm).

51  return reduced_matrix, weights, N, K, sketch_blocks_size
#

This helper function keeps the K largest (smallest) elements for each block in a given vector vec.

  • vec ~ vector to be processed
  • block_size ~ number of entries in each block
  • K ~ number of largest/smallest to keep from each block
  • max_size ~ if provided, the function only looks at first max_size entries of vec (that is vec[:max_size])
  • largest ~ if set to True, the function keeps K largest entries from each block; otherwise the function keeps K smallest entries.

The function returns a tuple consisting of new vector, and corresponding indices.

52def only_keep_k(vec, block_size, K, max_size=None, largest=True):
#

When K is the same as the block_size, there is no need to do anything.

53  if K == block_size:
54    return vec, np.arange(len(vec))
#

We limit the scope of the operation to entries of vec that come before max_size.

55  vec_in_scope = vec
56  vec_out_of_scope = np.array([])
57  if max_size is not None:
58    vec_in_scope = vec[:max_size]
59    vec_out_of_scope = vec[max_size:]
#

We determine the number of blocks in vec_in_scope.

60  in_scope_size = vec_in_scope.shape[0]
61  out_of_scope_size = vec_out_of_scope.shape[0]
62  num_blocks = int(in_scope_size / block_size)
#

We split vec_in_scope into num_blocks blocks.

63  blocks = np.array_split(vec_in_scope, num_blocks)
#

We iteratively piece together new blocks with K largest/smallest entries from each original block.

64  new_blocks = []
65  keep_indices = []
66  for i, block in enumerate(blocks):
67    if largest:
68      indices = np.argpartition(block, -K)[-K:]
69    else:
70      indices = np.argpartition(block, K)[:K]
71    new_blocks.append(block[indices])
72    keep_indices.extend(indices + i * block_size)
#

Lastly we add back any additional entries beyond max_size of the original vector vec.

73  new_blocks.append(vec_out_of_scope)
74  keep_indices.extend(np.arange(out_of_scope_size) + in_scope_size)
#
75  return np.concatenate(new_blocks), np.array(keep_indices)
#

We compute the weighted logistic loss:

\[ \ell(\vec{\theta}) = \sum_{i} \vec{w}_i \cdot \ln(1 + \exp(\vec{B}_{i} \vec{\theta})) \]

where \(\vec{w}\) (weights) is the weight vector, \(B\) (reduced_matrix) is the unweighted resulting sketch, and \(\vec{\theta}\) (theta) is the coefficients of the logistic regression.

76def logistic_likelihood(theta, reduced_matrix, weights=None, block_size=None, K=None, max_size=None):
77  v = -reduced_matrix.dot(theta)
78  if (block_size is not None) and (K is not None):
79    v, indices = only_keep_k(v, block_size, K, max_size=max_size, largest=True)
80    if weights is not None:
81      weights = weights[indices]
82    else: 
83      weights = np.ones_like(v)
84  likelihoods = np.log1p(np.exp(v))
85  likelihoods = weights * likelihoods
86
87  return np.sum(likelihoods)
#
 88def logistic_likelihood_grad(theta, reduced_matrix, weights=None, block_size=None, K=None, max_size=None):
 89  v = reduced_matrix.dot(theta)
 90  if (block_size is not None) and (K is not None):
 91    v, indices = only_keep_k(v, block_size, K, max_size=max_size, largest=False)
 92    if weights is not None:
 93      weights = weights[indices]
 94    else: 
 95      weights = np.ones_like(v)
 96    reduced_matrix = reduced_matrix[indices, :]
 97  
 98  grad_weights = expit(-1.0 * v)
 99  grad_weights *= weights
100
101  grad = -1.0 * (grad_weights.dot(reduced_matrix))
102
103  return grad
#
104def optimize(reduced_matrix, weights=None, block_size=None, K=None, max_size=None):
105  if weights is None:
106    weights = np.ones(reduced_matrix.shape[0])
#
107  def objective_function(theta):
108    return logistic_likelihood(
109      theta, reduced_matrix, weights, block_size=block_size, K=K, max_size=max_size
110    )
#

We call the built-in L-BFGS-B algorithm from SciPy to optimize the weighted logistic loss.

111  def gradient(theta):
112    return logistic_likelihood_grad(
113      theta, reduced_matrix, weights, block_size=block_size, K=K, max_size=max_size
114    )
115
116  theta0 = np.zeros(reduced_matrix.shape[1])
117
118  return so.minimize(objective_function, theta0, method="L-BFGS-B", jac=gradient)