Calibrated and Sharp Uncertainties in Deep Learning via Density Estimation

ICML 2022
Image 1 Description (e.g., Raw BNN Output or Overall Comparison)

Top: Credible intervals (90% & 99%) from a Bayesian neural network on a time series forecasting task. The 90% credible interval around the forecast (red) is miscalibrated and inaccurate: half of the points are outside of it. Middle: Quantile recalibration (Kuleshov et al., 2018) relabels the 99% credible interval as a 90% interval, which now correctly contains 9/10 points (orange). Bottom: Our recalibration method enforces distribution calibration—the 90% interval is unchanged at points where it is narrow and expands where it is wide. This yields an improved and narrower calibrated 90% interval (green)

Abstract

Accurate probabilistic predictions can be characterized by two properties - calibration and sharpness. However, standard maximum likelihood training yields models that are poorly calibrated and thus inaccurate - a 90% confidence interval typically does not contain the true outcome 90% of the time. This paper argues that calibration is important in practice and is easy to maintain by performing low-dimensional density estimation. We introduce a simple training procedure based on recalibration that yields calibrated models without sacrificing overall performance; unlike previous approaches, ours ensures the most general property of distribution calibration and applies to any model, including neural networks. We formally prove the correctness of our procedure assuming that we can estimate densities in low dimensions and we establish uniform convergence bounds. Our results yield empirical performance improvements on linear and deep Bayesian models and suggest that calibration should be increasingly leveraged across machine learning.

Accurate Predictive Uncertainties in Machine Learning

An accurate probabilistic forecast from a machine learning model should be both calibrated and sharp. Calibration means that if a model predicts a 90% confidence interval, the true outcome should indeed fall within that interval 90% of the time. Sharpness refers to how narrow or tight these intervals are. Unfortunately, standard training methods based on maximizing likelihood often result in models that are poorly calibrated – their confidence levels don't match real-world probabilities. In safety-critical applications like medicine or autonomous systems, where decisions rely on accurate uncertainty estimates, calibration is important. Our work, implemented in the DistCal library, demonstrates that calibration can be achieved by performing simple low-dimensional density estimation and implemented within a few lines of code without modifying the original model.

Quantile calibration implies that whenever a model predicts the $p$-th quantile of a continuous outcome, the true value should fall below this prediction $p$ fraction of the time. Distribution calibration requires that the entire predicted probability distribution over the outcome matches the true distribution of outcomes given that predicted distribution. Our DistCal library implements our methods to achieve distribution calibration, which also implies that quantile calibration will hold.

Our Approach: Recalibration as Density Estimation

The core idea behind DistCal is to treat recalibration as a simple density estimation problem. Given a base model $H$ that produces probabilistic forecasts $F_x = H(x)$, we aim to learn a recalibration model $R$ such that the composite model $R \circ H$ is distribution-calibrated. This is achieved by training $R$ to estimate the true conditional probability $P(Y | H(X)=F)$.

To make this tractable, we first define a featurization $\phi(F_x)$ that represents the original forecast $F_x$ with a low-dimensional vector (e.g., using its quantiles or parameters of the outcome distribution). The recalibrator $R$ then learns the mapping from these features $\phi(F_x)$ to a calibrated distribution over $Y$. This learning process minimizes a proper scoring rule on a dedicated calibration dataset, separate from the dataset used to train the model $H$. A proper scoring rule is minimized in expectation when the predicted distribution precisely matches the true data-generating distribution.

This general framework (Algorithm 1 in our paper) is then specialized to accommodate regression (continuous outcomes) and classification (discrete outcomes) settings.

Compared to previous methods, our approach is simpler, more broadly applicable (e.g., not limited to Gaussian outputs), and directly targets the strong notion of distribution calibration.

Using DistCal: A Demo

The following Python snippets demonstrate how to use the core calibrators from our DistCal library, showing a simplified end-to-end flow from a base model's predictions to calibrated outputs. For a more detailed walkthrough with actual dataset loading and comprehensive evaluation, please see the full demo notebook.

Installation

Proceed with the installation as follows.

git clone https://github.com/shachideshpande/DistCal.git
cd DistCal
conda env create -f env_nobuilds.yml
conda activate distcal

This will make the required DistCal package available within the 'distcal' environment. Ensure torchuq (part of DistCal) is accessible in your PYTHONPATH if not installing globally.

1. Discrete Distribution Calibration (Classification)

This example shows how to train a simple base classifier (Logistic Regression on UCI Digits) and then use DiscreteDistCalibrator to recalibrate its probabilistic outputs.

First, import necessary libraries:


import torch
import numpy as np
from sklearn.linear_model import LogisticRegression
from torchuq.transform.distcal_discrete import DiscreteDistCalibrator
from torchuq.dataset.classification import get_classification_datasets
from torchuq.evaluate.distribution_cal import discrete_cal_score

1. Load UCI Digits dataset (train, calibration, and test splits):


dataset = get_classification_datasets('digits', val_fraction=0.2, test_fraction=0.2, split_seed=0, normalize=True, verbose=False)
(X_train_d, y_train_d) = dataset[0][:][0], dataset[0][:][1]
(X_cal_d, y_cal_d) = dataset[1][:][0], dataset[1][:][1]
(X_test_d, y_test_d) = dataset[2][:][0], dataset[2][:][1]

2. Train a simple base classification model (Logistic Regression):


base_model_d = LogisticRegression(max_iter=200, solver='lbfgs', random_state=0).fit(X_train_d, y_train_d)

3. Get uncalibrated probabilities from the base model for the calibration and test sets:


probs_cal_discrete = torch.tensor(base_model_d.predict_proba(X_cal_d), dtype=torch.float32)
probs_test_discrete = torch.tensor(base_model_d.predict_proba(X_test_d), dtype=torch.float32)

4. Initialize and train the DiscreteDistCalibrator on the calibration data:


discrete_calibrator = DiscreteDistCalibrator(verbose=False)
discrete_calibrator.train(probs_cal_discrete, y_cal_d.long())

5. Apply the trained calibrator to test probabilities and calculate calibration scores. You can print these scores to observe the improvement.


calibrated_probs_test = discrete_calibrator(probs_test_discrete)
score_before = discrete_cal_score(y_test_d, probs_test_discrete)
score_after = discrete_cal_score(y_test_d, calibrated_probs_test)
# To see the scores:
# print(f"Calibration Score Before: {score_before:.4f}")
# print(f"Calibration Score After (DistCal): {score_after:.4f}")

2. Continuous Distribution Calibration (Regression)

This example shows how to train a simple base regression model (Bayesian Ridge on California Housing), convert its output to quantiles, and then use DistCalibrator to recalibrate these quantiles.

First, import necessary libraries:


import torch
import numpy as np
from sklearn.linear_model import BayesianRidge
from torchuq.transform.distcal_continuous import DistCalibrator
from torchuq.transform.calibrate import convert_normal_to_quantiles
from torchuq.dataset.regression import get_regression_datasets
from torchuq.evaluate import quantile as q_eval

1. Load California Housing dataset (train, calibration, and test splits) and define the number of quantile buckets:


dataset_c = get_regression_datasets('cal_housing', val_fraction=0.2, test_fraction=0.2, split_seed=0, normalize=True, verbose=False)
(X_train_c, y_train_c) = dataset_c[0][:][0], dataset_c[0][:][1]
(X_cal_c, y_cal_c) = dataset_c[1][:][0], dataset_c[1][:][1]
(X_test_c, y_test_c) = dataset_c[2][:][0], dataset_c[2][:][1]
num_quantile_buckets = 20

2. Train a simple base regression model (Bayesian Ridge):


base_model_c = BayesianRidge().fit(X_train_c, y_train_c)

3. Get uncalibrated predictions (mean, std) and convert them to quantiles for both calibration and test sets:


mean_cal_c, std_cal_c = base_model_c.predict(X_cal_c.numpy(), return_std=True)
quantiles_cal_c = convert_normal_to_quantiles(
    torch.tensor(mean_cal_c, dtype=torch.float32),
    torch.tensor(std_cal_c, dtype=torch.float32).clamp(min=1e-3),
    num_buckets=num_quantile_buckets
)

mean_test_c, std_test_c = base_model_c.predict(X_test_c.numpy(), return_std=True)
quantiles_test_c = convert_normal_to_quantiles(
    torch.tensor(mean_test_c, dtype=torch.float32),
    torch.tensor(std_test_c, dtype=torch.float32).clamp(min=1e-3),
    num_buckets=num_quantile_buckets
)

4. Initialize and train the continuous DistCalibrator on the calibration quantiles:


continuous_calibrator = DistCalibrator(num_buckets=num_quantile_buckets, quantile_input=True, verbose=False)
continuous_calibrator.train(quantiles_cal_c, y_cal_c.float(), num_epochs=10)

5. Apply the trained calibrator to test quantiles and calculate average check scores. You can print these scores to observe the improvement.


calibrated_quantiles_test = continuous_calibrator(quantiles_test_c)
quantiles_for_eval = torch.linspace(0.05, 0.95, 19) # Define quantile levels for evaluation
score_before = q_eval.check_score(quantiles_test_c, y_test_c.unsqueeze(-1), quantiles_for_eval).mean()
score_after = q_eval.check_score(calibrated_quantiles_test, y_test_c.unsqueeze(-1), quantiles_for_eval).mean()
# To see the scores:
# print(f"Avg. Check Score Before: {score_before:.4f}")
# print(f"Avg. Check Score After (DistCal): {score_after:.4f}")

As shown in our paper and the full demo notebook, applying these DistCal calibrators significantly improves standard calibration metrics (like discrete calibration scores or continuous check scores) often without negatively impacting task-specific accuracy or error metrics.

Concluding Thoughts

Accurate predictive uncertainties should maximize sharpness subject to being calibrated. Our work and the DistCal library provide a practical and theoretically grounded approach to achieve robust distribution calibration for a wide range of machine learning models. By making distribution recalibration easy to implement for both classification and regression, we aim to promote its broader adoption.

BibTeX

@InProceedings{pmlr-v162-kuleshov22a,
  title     = {Calibrated and Sharp Uncertainties in Deep Learning via Density Estimation},
  author    = {Kuleshov, Volodymyr and Deshpande, Shachi},
  booktitle = {Proceedings of the 39th International Conference on Machine Learning},
  pages     = {11683--11693},
  year      = {2022},
  editor    = {Chaudhuri, Kamalika and Jegelka, Stefanie and Song, Le and Sabato, Sivan and Niu, Gang and Szepesvari, Csaba},
  volume    = {162},
  series    = {Proceedings of Machine Learning Research},
  month     = {17--23 Jul},
  publisher = {PMLR},
  pdf       = {https://proceedings.mlr.press/v162/kuleshov22a/kuleshov22a.pdf},
  url       = {https://proceedings.mlr.press/v162/kuleshov22a.html}
}