CVVDP

Via the CVVDP GitHub README:

ColorVideoVDP is a full-reference visual quality metric that predicts the perceptual difference between pairs of images or videos. Similar to popular metrics like PSNR, SSIM, and DeltaE 2000 it is aimed at comparing a ground truth reference against a distorted (e.g. blurry, noisy, color-shifted) version.

This metric is unique because it is the first color-aware metric that accounts for spatial and temporal aspects of vision.

The description of the metric process below is borrowed from the fcvvdp documentation.

Metric Process

Initialization and Display Modeling

Before processing pixels, the metric models the viewing environment through a specified display's angular resolution (pixels per degree).

Using specified display parameters (resolution, diagonal size, viewing distance), CVVDP calculates Pixels Per Degree (PPD). This determines how large a pixel appears to the eye.
The Contrast Sensitivity Function (CSF) determines how sensitive the eye is to specific spatial frequencies. It maps the display's frequency bands (derived from PPD) to sensitivity values.

Available display model presets include:

Model	Description
`fhd`	24" FullHD monitor, 200 cd/m², office lighting (default)
`4k`	30" 4K monitor, 200 cd/m², office lighting
`hdr_pq`	30" 4K HDR, 1500 cd/m², low light
`hdr_hlg`	30" 4K HDR HLG, 1500 cd/m², low light
`hdr_linear`	30" 4K HDR linear, 1500 cd/m², low light
`hdr_dark`	30" 4K HDR, 1500 cd/m², dark room
`hdr_zoom`	30" 4K HDR, 10000 cd/m², close viewing

Input Loading and Display Mapping

Input images (uint8, uint16, or float) are converted to linear float RGB. If the input is integer-based, sRGB gamma decoding (approx. 2.4 power) is applied.
The linear RGB values are converted into absolute physical light units (nits) based on the display model.
SDR: clips values between 0 and 1, scales by max luminance, adds black level & reflected ambient light.
HDR: Performs tone mapping (PQ/HLG), clips to the display's peak luminance, adds black level and reflections.

Color Space Conversion

Linear RGB is converted to the CIE XYZ color space
XYZ is transformed to DKL (Derrington-Krauskopf-Lennie), an opponent color space that models what is used by the human brain.
- L: Luminance, a.k.a. achromatic brightness (L+M cones)
- RG: Chromatic difference (L-M cones)
- YV: S-cone opponent channel (S - (L+M))

Temporal Decomposition

If the input is video (FPS > 0), the metric analyzes how pixel values change over time. It maintains TemporalRingBuf to store previous frames.

The code applies Finite Impulse Response (FIR) filters to the history of DKL frames (so, temporal filtering)
Low temporal frequency information (static or slow-moving) is stored in the sustained channels. Calculated for Luminance (Y), Red-Green (RG), and Yellow-Violet (YV).
High temporal frequency information (flicker or fast motion) is in the transient channel. Only calculated for luminance.

We now have 4 channels to process spatially: Y_sus, RG_sus, YV_sus, Y_trans.

Spatial Decomposition

The visual system processes different sizes of features (frequencies) independently. CVVDP implements a Gaussian pyramid to simulate this.

The image is repeatedly downscaled (blurred and subsampled).
At each level in the pyramid, local contrast is computed.
The upscaled version of the next lower level is subtracted from from the current level (Gaussian difference)
This difference is normalized by the local background luminance (L_BKG) to get Weber Contrast.

CSF Weighting & Difference Calculation

For every pixel at every pyramid level, the contrast is multiplied by the CSF sensitivity. This scaling depends on:
- Spatial Frequency, which is determined by the pyramid level
- Background Luminance, where brighter areas generally have lower sensitivity to absolute differences
- Channel, because the eye is less sensitive to chroma changes (RG/YV) than luma changes.
The absolute difference between the Reference and Distorted contrast values is calculated.

Visual Masking

This is the most complex step. It is designed to account for the fact that artifacts are harder to see in textured areas.

The code computes the minimum activity between the reference and distorted signals
A Gaussian blur is applied to this activity map to simulate the spatial extent of masking
Activity in one channel can mask errors in another. The code computes a masking denominator using weighted sums of activity from all 4 channels
The final difference d is compressed using a non-linear sigmoid-like function (not going to explain it here, probably best to read the code for more details)

Pooling & Scoring

The masked differences are aggregated across the image using a Minkowski norm (Power of 4), then averaged
Scores from all pyramid levels and all four channels are summed
Scores are accumulated over frames using a power sum
The final raw quality metric (Q) is mapped to JOD (Just Objectionable Difference), which is a more meaningful perceptual score. 10.0 is a perfect match (no visible difference), and lower scores mean the quality is worse

Score	Interpretation
10.0	Images are identical
9.0 - 10.0	Barely visible difference
8.0 - 9.0	Slight visible difference
7.0 - 8.0	Noticeable but acceptable
5.0 - 7.0	Clearly visible, somewhat annoying
3.0 - 5.0	Very visible, annoying difference
< 3.0	Large, unacceptable difference

Implementations

Several different implementations of CVVDP are available. Third-party implementations tend to outperform the reference tools.

CVVDP

CVVDP is the first-party reference implementation of CVVDP by the University of Cambridge. It is implemented in Python, and makes use of libraries that allow utilization of the CPU or GPU.

Vship

Vship is a GPU-accelerated metrics toolkit compatible with Vapoursynth. It also features its own standalone FFVship binary, available independent of Vapoursynth. Vship's CVVDP implementation is an order of magnitude faster than the reference implementation.

fcvvdp

fcvvdp is a fast CPU-based CVVDP implementation by Halide Compression. It claims to be over 221% faster than the reference implementation with multithreading, making it useful in environments without access to GPUs.

Visualization

The graph below (from fcvvdp's docs) visualizes the CVVDP metric process:

CVVDP

Metric Process​

Initialization and Display Modeling​

Input Loading and Display Mapping​

Color Space Conversion​

Temporal Decomposition​

Spatial Decomposition​

CSF Weighting & Difference Calculation​

Visual Masking​

Pooling & Scoring​

Implementations​

CVVDP​

Vship​

fcvvdp​

Visualization​

Metric Process

Initialization and Display Modeling

Input Loading and Display Mapping

Color Space Conversion

Temporal Decomposition

Spatial Decomposition

CSF Weighting & Difference Calculation

Visual Masking

Pooling & Scoring

Implementations

CVVDP

Vship

fcvvdp

Visualization