Statistics from Streaming Data
Many real-world data pipelines deal with streams; data that arrives continuously and cannot be stored in full. The challenge is to compute useful statistics using $O(1)$ or $O(k)$ memory, in a single pass.
Running Average
The naive approach is to accumulate the sum and divide but total number. But this approach leads to overflows for large streams. Instead, maintain a running average updated incrementally:
$$s_n = \frac{n-1}{n} s_{n-1} + \frac{1}{n} x_n$$
s = 0
n = 0
while True:
x = get_next()
s = (s*n)/(n+1) + x/(n+1)
n += 1
This is numerically more stable than accumulating a sum because $s$ stays in the same range as the data throughout.
Exponential Moving Average
The running average weights all samples equally. But for non-stationary streams where the distribution shifts over time, recent samples should matter more. EMA does this with a single parameter $\alpha \in (0, 1)$:
$$s_t = \alpha s_{t-1} + (1-\alpha) x_t$$
s = 0
alpha = 0.5
while True:
x = get_next()
s = alpha * s + (1-alpha) * x
Unrolling the recurrence, sample $x_{t-k}$ contributes an exponentially decaying influence $\alpha^k (1-\alpha)$. Larger $\alpha$ means a longer memory; smaller $\alpha$ reacts faster to recent changes. Unlike the running average, EMA never “forgets” old samples completely, but their influence becomes negligible.
Reservoir Sampling
Problem: sample $k$ elements uniformly at random from a stream of unknown length $n$, using $O(k)$ memory.
The key idea: when the $i$-th element arrives, include it with probability $\frac{k}{i}$, displacing a random existing element if selected.
import random
reservoir = []
for i, x in enumerate(stream()):
if i < k:
reservoir.append(x)
else:
j = random.randint(0, i) # uniform in [0, i]
if j < k:
reservoir[j] = x
Why it’s uniform: by induction. After seeing $k$ elements the reservoir holds all $k$, each with probability $1$. When the $i$-th element arrives ($i \geq k$), it is included with probability $\frac{k}{i+1}$. Each element already in the reservoir survives with probability $\frac{i}{i+1}$ (probability of not being evicted). If by induction each was present with probability $\frac{k}{i}$, after step $i$ it remains with probability:
$$\frac{k}{i} \cdot \frac{i}{i+1} = \frac{k}{i+1}$$
which matches the new element’s inclusion probability. So at all times, every element seen so far is in the reservoir with equal probability $\frac{k}{n}$.
Count-Min Sketch
Problem: estimate the frequency $f_x$ of any element $x$ in a stream over a large universe $\mathcal{U}$, using sublinear memory. Storing exact counts is infeasible when $|\mathcal{U}|$ is huge (e.g. IP addresses, n-grams).
A Count-Min Sketch maintains a $d \times w$ table of counters and $d$ independent hash functions $h_1, \ldots, h_d : \mathcal{U} \to [w]$. On seeing element $x$, increment all $d$ corresponding cells. To query, return the minimum across all $d$ cells.
d, w = 5, 2000 # depth, width
table = [[0] * w for _ in range(d)]
def update(x):
for i in range(d):
table[i][hash(x, i) % w] += 1
def query(x):
return min(table[i][hash(x, i) % w] for i in range(d))
The estimate $\hat{f}_x$ is always an overestimate ($\hat{f}_x \geq f_x$) — cells can only be inflated by hash collisions, never deflated. Taking the minimum across $d$ rows reduces this overestimation. With $w = \lceil e/\varepsilon \rceil$ columns and $d = \lceil \ln(1/\delta) \rceil$ rows:
$$P\left(\hat{f}_x \leq f_x + \varepsilon \cdot |f|_1\right) \geq 1 - \delta$$
where $|f|_1$ is the total stream length. So with $w = 2718, d = 5$ we get error within $0.1%$ of total stream size with probability $\geq 99.3%$, using only $\sim 100\text{K}$ counters regardless of $|\mathcal{U}|$.