Example 1: 1D Gaussian Mixture#

In this example we’ll take a 1D toy PDF (namely, a 3 component Gaussian mixture) and look at drawing samples from it in various ways.

It’s worth pointing out that this is, in some sense, a bad example. There are ways to easily and efficiently sample from Gaussian distributions without making the linear approximations employed by lintsampler. The point of this example is merely to demonstrate the ways lintsampler can be used with a 1D PDF.

Python preamble#

Imports

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

Matplotlib style file

plt.style.use('figstyle.mplstyle')

Set up RNG (this is optional, but it makes results reproducible)

rng = np.random.default_rng(42)

GMM model#

We’ll test a 3-component 1D Gaussian mixture model. Below we’ll define a function which evaluates the PDF as a function of \(x\). The means / widths / weights of the three Gaussian components are stored in the arrays mu, sig and w respectively. This implementation is vectorized (i.e., it takes a sequence of positions and spits out a sequence of densities), but in general lintsampler can take both vectorized and non-vectorized PDFs.

def gmm_pdf(x):
    mu = np.array([-3.0, 0.5, 2.5])
    sig = np.array([1.0, 0.25, 0.75])
    w = np.array([0.4, 0.25, 0.35])
    return np.sum([w[i] * norm.pdf(x, mu[i], sig[i]) for i in range(3)], axis=0)

Let’s visualise the PDF:

x = np.linspace(-10, 10, 200)
plt.plot(x, gmm_pdf(x), c='teal')
plt.xlabel('$x$', usetex=True);
../_images/29f6caf25e1054ef8bab1e309bd432b6a5e8491c5037df0254a87d84911ea73d.png

First approach: fixed grid#

Here we’ll take a fixed grid with 32 cells from \(x=-12\) to \(x=12\), and use LintSampler to draw a million samples. First import LintSampler:

from lintsampler import LintSampler

Set up the grid, and evaluate the PDF at the grid edges (note that the grid has 32 cells, so 33 edges):

N_cells = 32
x_min = -12
x_max = 12
fixedgrid = np.linspace(x_min,x_max,N_cells+1)

Create the LintSampler object, passing in the pdf and desired cell structure.

L = LintSampler(fixedgrid,pdf=gmm_pdf,vectorizedpdf=True,seed=rng)

Draw samples:

samples = L.sample(N=1000000)

Now we can visualise the samples, and compare with the true PDF.

bins = np.linspace(-12, 12, 200)
plt.hist(samples, bins=bins, density=True, label="Samples", fc='goldenrod')
plt.plot(bins, gmm_pdf(bins), label="True PDF", c='teal')
plt.xlabel('$x$', usetex=True)
plt.legend()
plt.xlim(-12, 12)
ylim = plt.gca().get_ylim()
for edge in fixedgrid:
    plt.plot([edge, edge], ylim, c='grey', lw=0.5)
plt.ylim(ylim);
../_images/0c7e8704b32f58aee34cf7c63c04563d5a3be20110f12b64acf5f55591dacc86.png

This does a pretty good job! Could be better, particularly close to the Gaussian peaks. The problem here is insufficient grid resolution. One could either increase the number of grid cells, or adopt a variable grid with more resolution where needed. We’ll try the latter.

Second approach: Variable grid#

Now we can set up a variable grid, again with 32 cells, but now with more resolution in the relevant regions and less resolution elsewhere. Here I just hard-code the array of edges, but of course one could imagine doing something smarter.

variablegrid = np.hstack([
    np.linspace(-12, -5, 4, endpoint=False),
    np.linspace(-5, -0, 8, endpoint=False),
    np.linspace(0, 5, 16, endpoint=False),
    np.linspace(5, 12, 5, endpoint=True),
])

L = LintSampler(variablegrid,pdf=gmm_pdf,vectorizedpdf=True,seed=rng)

Draw samples:

samples = L.sample(N=1000000)

Visualise:

bins = np.linspace(-12, 12, 200)
plt.hist(samples, bins=bins, density=True, label="Samples", fc='goldenrod')
plt.plot(bins, gmm_pdf(bins), label="True PDF", c='teal')
plt.xlabel('$x$', usetex=True)
plt.legend()
plt.xlim(-12, 12)
ylim = plt.gca().get_ylim()
for edge in variablegrid:
    plt.plot([edge, edge], ylim, c='grey', lw=0.5)
plt.ylim(ylim);
../_images/6459f3af4ca19775e5bbfcb17d9a3490d151e8fd5de4b57be4a2de1c14f485bb.png

That’s better!

Third Approach: Separate grids#

We are not required to have one continuous grid as the input to LintSampler. In fact, in this case, with three distinct peaks, we might want to consider grids that only sample around the peaks. We’ll pass a list of three grids, centred at the peaks.

peakgrids = [
    np.linspace(-5., -1., 12),
    np.linspace(0., 1., 12),
    np.linspace(1.25, 3.75, 12),
]

L = LintSampler(peakgrids,pdf=gmm_pdf,vectorizedpdf=True,seed=rng)

Draw samples:

samples = L.sample(N=1000000)

Visualise:

bins = np.linspace(-12, 12, 200)
plt.hist(samples, bins=bins, density=True, label="Samples", fc='goldenrod')
plt.plot(bins, gmm_pdf(bins), label="True PDF", c='teal')
plt.xlabel('$x$', usetex=True)
plt.legend()
plt.xlim(-12, 12)
ylim = plt.gca().get_ylim()
for edge in np.hstack(peakgrids):
    plt.plot([edge, edge], ylim, c='grey', lw=0.5)
plt.ylim(ylim);
../_images/ae4d4f6645fc7c01eed8c74d2df60029d20882cbd11287d8cf19e122fc01b513.png

It isn’t a better result than before, but it does demonstrate another way we could increase sampling near peaks in the pdf.

A completely different approach: Using a refined tree of cells#

lintsampler also includes a tree-based method to adaptively partition a domain. The basic idea is that the user provides an absolute error tolerance (as a fraction of the ‘mass’ of the tree), and a constructor builds an adaptive tree to meet the criteria. Let’s see a simple example in action:

from lintsampler import DensityTree

Now we construct the initial tree. The parameter min_openings sets the number of initial leaves on the tree: 2**min_openings in 1D.

tree = DensityTree(mins=x_min, maxs=x_max, pdf=gmm_pdf,vectorizedpdf=True,min_openings=4)    

Now we set a target error for the whole tree, and let the refinement take over (here 1%). We’ll turn on the verbose reporting to watch the tree construction in real time.

tree.refine(0.01,verbose=True)
Pre-loop: 16 leaves on tree. Total mass=0.8287939239124822
End of leaf iteration 1: 16 leaves on tree. Total mass=0.6827500800716951, with mean leaf mass=0.042671880004480946
End of tree iteration 1: 17 leaves on tree. Total mass=0.6827500800716951. Fractional error=0.07974421366700009.
End of tree iteration 2: 18 leaves on tree. Total mass=0.6785798437579177. Fractional error=0.0697335745395283.
End of tree iteration 3: 19 leaves on tree. Total mass=0.7024047133877867. Fractional error=0.05737445632694595.
End of tree iteration 4: 20 leaves on tree. Total mass=0.7434576142762839. Fractional error=0.044392266864005096.
End of tree iteration 5: 21 leaves on tree. Total mass=0.7202190360355167. Fractional error=0.0379954140957464.
End of tree iteration 6: 22 leaves on tree. Total mass=0.7513863158747637. Fractional error=0.03149506951718544.
End of tree iteration 7: 23 leaves on tree. Total mass=0.8339889812827559. Fractional error=0.030195154834775778.
End of tree iteration 8: 24 leaves on tree. Total mass=0.8306430529820299. Fractional error=0.027319816534651625.
End of tree iteration 9: 25 leaves on tree. Total mass=0.8214879324863464. Fractional error=0.024022305068063077.
End of tree iteration 10: 26 leaves on tree. Total mass=0.8212655371123205. Fractional error=0.020170878417001905.
End of tree iteration 11: 27 leaves on tree. Total mass=1.036600982070309. Fractional error=0.04561275498839591.
End of tree iteration 12: 28 leaves on tree. Total mass=0.9403724149369803. Fractional error=0.040115556753693306.
End of tree iteration 13: 29 leaves on tree. Total mass=1.0153738746357548. Fractional error=0.029641893611943086.
End of tree iteration 14: 30 leaves on tree. Total mass=0.9920301782839762. Fractional error=0.02550056875411536.
End of tree iteration 15: 31 leaves on tree. Total mass=0.9996099358258586. Fractional error=0.020629065815581528.
End of tree iteration 16: 32 leaves on tree. Total mass=0.9991518181346064. Fractional error=0.017421704112385743.
End of tree iteration 17: 33 leaves on tree. Total mass=1.007481377910349. Fractional error=0.014025866173345615.
End of tree iteration 18: 34 leaves on tree. Total mass=0.9991211437871177. Fractional error=0.012837793452107775.
End of tree iteration 19: 35 leaves on tree. Total mass=1.003017441434549. Fractional error=0.011304669166462753.
End of tree iteration 20: 36 leaves on tree. Total mass=1.006244913436298. Fractional error=0.010227186296841696.

As we know the ‘mass’ of this pdf (it’s 1!), we can see that the tree construction got us quite close.

Now, let’s sample.

samples = LintSampler(domain=tree,seed=rng).sample(N=1000000)

And visualise. Here, we’ll plot the histogram, as well as a background showing the tree refinement, from light (less refined), to dark (more refined.)

from matplotlib.patches import Rectangle
import matplotlib.cm as cm
fig = plt.figure()
ax = fig.add_subplot(111)
bins = np.linspace(-12, 12, 200)


for leafnum,leaf in enumerate(tree.leaves):
    rect = Rectangle((leaf.x[0], 0.), leaf.dx[0], 1., lw=1, facecolor=cm.Blues(leaf.level/10,1.), ec='none', alpha=0.6)
    ax.add_patch(rect)

ax.plot(bins, gmm_pdf(bins), label="True PDF", c='grey')
ax.hist(samples, bins=bins, density=True, label="Samples", fc='goldenrod')

ax.set_xlim(x_min,x_max)
ax.set_ylim(0.,0.4)
ax.set_xlabel('$x$', usetex=True)
Text(0.5, 0, '$x$')
../_images/a9104bef5cbbb348a46941b043fc2c9fb92bbe315906bcb7f3ec2805f161fcfa.png

Looks good!

Final checks#

Let’s just check that we do get near-perfect results if we increase the number of grid cells a lot (we could achieve the same by dramatically decreasing the error tolerance in the DensityTree method above). Here is outcome of a fixed grid with 512 cells:

N_cells = 512
x_min = -12
x_max = 12
fixedgridhires = np.linspace(x_min,x_max,N_cells)

L = LintSampler(fixedgridhires,pdf=gmm_pdf,vectorizedpdf=True,seed=rng)
samples = L.sample(N=1000000)
bins = np.linspace(-12, 12, 200)
plt.hist(samples, bins=bins, density=True, label="Samples", fc='goldenrod')
plt.plot(bins, gmm_pdf(bins), label="True PDF", c='teal')
plt.xlabel('$x$', usetex=True)
plt.legend()
plt.xlim(-12, 12);
../_images/4c9f9bb98e903470477a62a5919ca7f72aecba2bee1c936a1d480d81fcfce2e1.png

Just for fun, we can take the samples and fit a Gaussian mixture model just to check that we get results that roughly match the ‘truths’.

from sklearn.mixture import GaussianMixture
gm = GaussianMixture(n_components=3).fit(samples[:, None])
order = np.argsort(gm.means_.squeeze())
print(f"Fitted means:\n {gm.means_[order].squeeze()}")
print(f"True means:\n {[-3.0, 0.5, 2.5]}\n")

print(f"Fitted widths:\n {np.sqrt(gm.covariances_[order]).squeeze()}")
print(f"True widths:\n {[1.0, 0.25, 0.75]}\n")

print(f"Fitted weights:\n {gm.weights_[order]}")
print(f"True weights:\n {[0.4, 0.25, 0.35]}\n")
Fitted means:
 [-3.00061308  0.50435644  2.51712703]
True means:
 [-3.0, 0.5, 2.5]

Fitted widths:
 [0.99814541 0.2559285  0.73659458]
True widths:
 [1.0, 0.25, 0.75]

Fitted weights:
 [0.40016087 0.25339665 0.34644248]
True weights:
 [0.4, 0.25, 0.35]

So, the fitted GMM parameters very closely match the truths.

Comparison with importance sampling#

When you have an expensive PDF in low (say, 1 or 2) dimensions, lintsampler can be advantageous over alternative techniques because you can get a comparably representative sample with fewer PDF evaluations.

In this section we will demonstrate this by comparing lintsampler with importance sampling on the GMM example from above.

Throughout we’ll take 100000 samples:

N_SAMPLES = 100000

Let’s set up a function to draw samples from an arbitrary PDF via importance sampling (using a wide normal proposal distribution). There are a few points worth noting about this implementation of importance sampling (these points arise from the discussion here). First, in our implementation below we generate proposals, get their weights, then resample proposals from the weighted list (using a ‘systematic sampling’ technique). This process of resampling introduces additional variance. If, for instance, we wanted to compute expectations from the target PDF, we could attain lower-variance estimates by simply returning the proposals and weights (without resampling), and incorporating the weights in the Monte Carlo summation. We include this resampling step to provide a fair comparison with lintsampling; this way, both methods output a set of equal-weight samples. Note: unlike lintsampler, the importance sampling here uses a log-PDF (to guard against numerical over-/underflow).

There are also other ways we could improve this implementation of importance sampling. For example, we could choose proposal distribution more closely matching the target distribution. For a fair comparison with lintsampler we stick with a relatively agnostic PDF.

from scipy.special import logsumexp

def norm_logpdf(x, mu, sigma):
    return -((x - mu) / sigma)**2 / 2 - np.log(sigma) - np.log(2 * np.pi) / 2
    
def gmm_logpdf(x):
    mu = np.array([-3.0, 0.5, 2.5])
    sig = np.array([1.0, 0.25, 0.75])
    w = np.array([0.4, 0.25, 0.35])
    return logsumexp(np.log(w)[:, None] + norm_logpdf(np.asarray(x)[None], mu[:, None], sig[:, None]), axis=0)
def systematic_resampling(weights, rng, n_sample = None):
    if n_sample is None:
        n_sample = len(weights)
    u = rng.uniform() / n_sample
    indices = []
    cumulative_weights = np.cumsum(weights)
    i = 0
    while u <= 1:
        if u < cumulative_weights[i]:
            indices.append(i)
            u += 1 / n_sample
        else:
            i += 1
    return np.array(indices)

def generate_importance_samples(log_pdf, N, rng, mean, std):
    dist = norm(loc=mean, scale=std)
    proposals = dist.rvs(N)
    log_weights = log_pdf(proposals) - dist.logpdf(proposals)
    weights = np.exp(log_weights - logsumexp(log_weights))
    indices = systematic_resampling(weights, rng, N)
    return proposals[indices]

To test this out, let’s apply it to the GMM PDF above:

x_imp = generate_importance_samples(gmm_logpdf, N_SAMPLES, np.random.default_rng(42), 0, 6)

We can visualise the distribution of samples, alongside the true underlying distribution:

x_plot = np.linspace(-12, 12, 500)
plt.hist(x_imp, 80, density=True);
plt.plot(x_plot, gmm_pdf(x_plot));
../_images/c369dc7d7896f7b165243a9d698ccc3da99615c8acd809edcb8c7fda568292d9.png

For use below, let’s also compute the average log probability density of the samples:

lnp_imp = np.log(gmm_pdf(x_imp)).mean()

Now, we’ll draw samples from the GMM PDF with lintsampler. We’ll take a fixed grid from \(x=-12\) to \(x=12\) with evenly sized cells (although a variable grid would be easy to set up and better suited to the problem, as we show in the example notebook in the docs). We have to choose a grid size for this, and we want to find a grid size where the ‘performance’ of the samples (as quantified by their average log probability density) is comparable to the performance of importance sampling. So, let’s loop over grid sizes:

N_grid_arr = np.logspace(1, 17, 9, base=2, dtype=int) # 2**1, 2**3, 2**5, ... 2**17
lnp_ls = np.zeros(len(N_grid_arr))
for i, N_grid in enumerate(N_grid_arr):
    fixedgrid = np.linspace(-12, 12, N_grid)
    x_ls = LintSampler(fixedgrid, pdf=gmm_pdf, vectorizedpdf=True, seed=np.random.default_rng(42)).sample(N=N_SAMPLES)
    lnp_ls[i] = np.log(gmm_pdf(x_ls)).mean()

plt.plot(N_grid_arr, lnp_ls, label='Lintsampling')
plt.plot(N_grid_arr, lnp_imp * np.ones_like(N_grid_arr), ls='dashed', label='Importance Sampling')
plt.xscale('log')
plt.xlabel("LintSampler grid size")
plt.ylabel("Average log probability density")
plt.legend();
../_images/6ecb681b167a6d334bc9c01a55f83e12cf7268bea64211aadf9d0c4ac5cbffff.png

As expected, the performance of lintsampler is bad with coarse grids, but quickly improves, becoming comparable with importance sampling at around \(10^2\) grid points. This is a key point: drawing \(10^5\) samples with importance sampling required \(10^5\) PDF evaluations, but the same sample size only required \(10^2\) PDF evaluations with lintsampler to give just as ‘good’ a sample.

Of course, a caveat here is that the user needs to know what grid size will yield a good sample, but we argue that in most scenarios, the user has an a priori intuition for this. For example, in this (admittedly simplistic) test case, the width of the narrowest Gaussian is around 1/100th of the domain size, suggesting a natural grid size.

In this example, this reduction in evaluation count is not so important as the PDF is very cheap to evaluate, but one frequently encounters situations where this is not the case. In such situations, it is highly desirable to use a sampling technique which minimises the number of PDF evaluations.