What's In a JPEG?

Published December 29, 2022 • more posts

Recently, I set out to create my own photo organizer. In the process, I spent a lot of time experimenting with various image formats, which led me to an epiphany: JPEGs are ridiculously good at compressing image data, far beyond what you'd expect! Take this picture, for example:

a cormorant
This poor cormorant has no idea what he's about to go through.

This image is 299 by 400 pixels. Each pixel consists of three components, red, green, and blue. The brightness of each component is encoded as an 8-bit value, so each pixel contains three bytes of information. Multiply that by the number of pixels, and we get a file size of around 350 kilobytes. Yet the image shown above is actually only 43kB in size, just 12% of the value we just calculated. In other words, by encoding the image using the JPEG format, we can achieve a compression ratio of roughly 8:1. How does JPEG accomplish this astonishing feat? Let's dive in.

Chroma Subsampling §

The first trick that JPEG employs to remove unnecessary information is a technique called chroma subsampling. This step exploits the fact that our eyes are much more sensitive to changes in brightness than changes in color. After all, our eyes have about 100 million brightness-sensitive rod cells versus just 5 million color-sensitive cones. Thus, we can store the color info in the image at a lower resolution without losing too much quality.

In order to chroma subsample, we must first separate the color of a pixel from its brightness. JPEG accomplishes this by converting images to a color space called YCbCr. Like RGB, YCbCr pixels also consist of three values; however, the meaning of the values are different. Y is the luminance (brightness) of the pixel, and Cb and Cr together represent the color of the pixel without any luminance info.

But how is RGB mapped to YCbCr? In my opinion, the relationship between the two color spaces is best explained visually. Imagine a coordinate space where the x, y, and z axes represented R, G, and B, respectively. This would form a cube containing all possible RGB colors.

This cube has one important property: there exists a line through the cube where the R, G, and B values are all equal. One can imagine a coordinate system where we align the cube such that the luminance component (Y) extends along this line. We can then extract a slice of the cube for any given luminance, and assign the remaining two degrees of freedom to Cb and Cr.

This is essentially how YCbCr works, except the RGB cube is slightly deformed in order to fit all the values into a range of [0, 1]. Here's a demo that shows the Cb-Cr planes as we adjust Y.

No, that Y is definitely not backwards.

If you want a better view of what the Cb-Cr plane looks like, here is a slice of the YCbCr gamut at Y = 0.5.

ycbcr diagram

When an image is stored as JPEG, the first thing that happens is that the RGB colors are converted to YCbCr. The Cb and Cr channels are stored at half the resolution of the full image, a scheme which is referred to as 4:2:0.

Let's compare what the components of the image look like in the two color spaces. Here's what the image looks like in RGB.

rgb components of the image

And here's what the image looks like in YCbCr:

image in ycbcr color space

As you can see, the importance of the luminance channel really shines through here. There is very little appreciable detail in the Cb and Cr channels, unlike in RGB space, where each channel is perceived roughly equally in the final image. We can safely discard much of the detail in the chrominance channels without sacrificing too much quality in the final image. However, this still doesn't bring us to the astounding compression ratios that JPEG achieves on a regular basis. For that, we'll need to go deeper into the compression process.

Discrete Cosine Transform §

In the previous step, we reduced size by getting rid of color information. However, we can also drastically reduce size by getting rid of unnecessary spatial information. What does that mean? Consider this photo of a flower.

macro photograph of a pepper flower

If we zoom in close on the two highlighted regions, it becomes clear that not all image data is created equal. The region on the left contains much more detail than the one on the right, yet they are currently represented in the same number of bytes.

zoom in on sections of flower image

Like last time, the problem now becomes representing the image data in a way that lets us separate the important parts from the unimportant parts. JPEG accomplishes this by converting the image to the frequency domain.

Right now, the image exists in the spatial domain. We can think of it as a function that takes a pixel position and outputs an intensity value. However, thanks to the work of the French mathematician Joseph Fourier, we know that signals of this type can also be expressed as the sum of an infinite series of sinusoids with varying frequencies. From his insight came the concept of the Fourier transform, which converts our function from the spatial domain to the frequency domain (where it would map the frequency of each component to its amplitude). This allows us to discard the high-frequency components, which aren't as important in reconstructing the signal.

JPEG does not use the Fourier transform; instead, it uses the closely related discrete cosine transform, whose characteristics make it better suited for use in lossy compression. Before we go further, try drawing a waveform using your mouse in the box on the left. The DCT of the signal will be taken and the signal will be reconstructed and displayed on the right.

You can use the slider to change how many DCT coefficients are used to reconstruct the signal.

Notice how removing high-frequency components from the signal affects the quality much less than the lower-frequency components. JPEG essentially does the same thing, but in 2D. The image is broken up into 8x8 blocks, which are then decomposed into a matrix of 64 coefficients.

JPEG DCT matrix
Every 8x8 block in a JPEG is expressed as a sum of these components.

Note that DCT itself doesn't actually save us any bytes. We started with 64 pixels, and ended with 64 DCT coefficients.

Quantization §

Instead of simply removing all high frequency components after a certain cutoff like we did in our 1D DCT example, JPEG uses a strategy called quantization. Essentially, each DCT coefficient is divided by the corresponding value in an 8x8 quantization table, and then rounded down. The result is that many components of the quantized DCT coefficients will become zero, making the data more conducive for lossless compression.

Let's see how quantization works for ourselves. Say we've just performed DCT on some image data:

example of DCT on a cat's eye

We now have a table of DCT coefficients.

DCT Coefficients

-13868192269-24-7-29
-4731-745-6180
16613-40-60-7320-1642
103-38-234829-22-7
95-1031-614926-10
-5-71924-179-42
49-6-683130-913-16
9-6-1115-3-7-2-3

Next, it's time to quantize. We will be using the Independent JPEG Group's recommended quantization table for this example.

IJG Quantization Table

1611101624405161
1212141926586055
1413162440576956
1417222951878062
182237566810910377
243555648110411392
49647887103121120101
7292959811210010399

Dividing our matrix of coefficients and rounding down the results yields the following quantized values:

Quantized DCT Coefficients

-861902000
060-30000
111-2-2-1000
00-100000
00000000
00000000
10000000
00000000

As you can see, many of the high-frequency components have become zero. This is what is accomplished by quantization; it hasn't actually reduced the size of our data yet (we still have 64 numbers), but by getting rid of unnecessary precision it has made the DCT coefficients more easily compressed by a lossless compression algorithm that will be applied next.

Lossless Compression §

jpeg zigzag pattern

In the final step of JPEG compression, the DCT matrix is unraveled according to a zigzag pattern, and Huffman coding is applied to compress the data. How Huffman coding works is beyond the scope of this article, but if you want a concise, accessible explanation of the technique, I encourage you to check out Tom Scott's video on the subject.

Conclusion §

If you've reached this point, congratulations; you now have a high-level understanding of the processes that take place every time you save an image as a JPEG.

There is a lot that isn't covered here; if you want to know more about the mathematics of the Fourier transform, or explore the actual implementation details of the JPEG standard, I've included some links that you can follow to continue learning about the fascinating enigma that is the JPEG format.

Further Reading §