Skip to content

Add jxl chunked save #4174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
May 8, 2025
Merged

Add jxl chunked save #4174

merged 46 commits into from
May 8, 2025

Conversation

jcupitt
Copy link
Member

@jcupitt jcupitt commented Sep 29, 2024

This PR adds support for the new libjxl chunked save API.

We were preparing a whole frame of pixels then writing in one shot, but that's obviously not great for large images. This adds a new save path using JxlEncoderAddChunkedFrame(). With a 30k x 30k sample image I see:

$ /usr/bin/time -f %M:%e vips copy st-francis.jpg x.jxl
5719092:586.74
$ /usr/bin/time -f %M:%e vips copy st-francis.jpg x.jxl[chunked]
854360:573.48

It's very slightly faster, since it's able to overlap JPEG decode with JXL encode, and needs 7x less memory.

This is version 3 of this PR I think :( and is quite simple. The NOTES file outlines earlier (and crazily complicated) schemes I tried.

TODO:

  • this uses the iteration in JxlEncoderAddChunkedFrame() to drive libvips evaluation (and not vips_sink_*()), so it's missing things like preeval, eval etc. -- we should add these
  • the old sequential save path can go once testing is over
  • we'll need to bump the min libjxl version to at least 0.11
  • add chunked load as well
  • add extra channel support (more than 4 bands)

wrote some NOTES on what we need to do next
instead, try to drive libvips from data_at

it should be A LOT simpler
same speed, about half the memory use for a 10k x 10k pixel image

still missing the signals you'd expect for a sink, eg. "preeval", "eval"
etc.
@jcupitt jcupitt marked this pull request as draft September 29, 2024 14:28
Add preeval, eval, posteval, minimise signals.

We have to move vips_image_preeval(), vips_image_eval(), vips_image_posteval()
to the public API.
@jcupitt
Copy link
Member Author

jcupitt commented Sep 29, 2024

It seems (from #libjxl on discord) that we'll need to make a threadpool and generate libjxl tiles in parallel. We can't rely on _data_at() being executed in parallel.

experiment with various caching strategies ...  but it looks like none will
work, since we will always throw away the cache on the original jpegload
on minimise

we're going to have to use the super-complicated sink_disc plan
@jcupitt
Copy link
Member Author

jcupitt commented Oct 1, 2024

We are doing parallel pixel generation by running a libvips pipeline for each 2k x 2k JXL tile. But this means that we emit a minimise signal at the end of each crop, and this drops the cache from the sequential attached to the jpegload at the end of each tile.

This PR works, but only by making sequential() persist over minimise. This will cause huge memory use.

We should probably bite the bullet and implement the incredibly complicated jxlsave outlined in NOTES.

@jcupitt
Copy link
Member Author

jcupitt commented Oct 1, 2024

Ah, a better fix! Don't emit minimise for sub-evaluations, just as we don't emit preeval, eval, etc.

@jcupitt
Copy link
Member Author

jcupitt commented Oct 5, 2024

Sample JXL HDR image for testing:

https://github.com/immich-app/immich/files/14755028/JPEG.XL.HDR.Test.zip

libvips loads this as a ushort, it looks fine, I guess the HDR element is in the transfer function. There's a synthetic profile attached, but it only supports perceptual intent (libvips is relative by default).

$ vipsheader -f icc-profile-data JPEG_XL_HDR.jxl | base64 -d > hdr.icc
$ file hdr.icc 
hdr.icc: ColorSync color profile 4.4, type jxl, RGB/Lab-mntr device by jxl, 4344 bytes, 1-12-2019, 0xaa027a64eabe3514 MD5 'RGB_D65_DCI_Per_PeQ'

You can import to make a LAB image:

$ vips icc_import JPEG_XL_HDR.jxl x.v --intent perceptual
$ vipsheader x.v
x.v: 6000x4000 float, 3 bands, lab, jxlload

It looks nice:

image

But I can't see any out of range brightnesses:

$ vips stats x.v .csv
-75.852142333984375	94.802780151367188	2672661272.8074112	136308783384.3548	37.120295455658486	22.699366058239452	5923	3900	2379	2770
0	94.802780151367188	1401669949.0775206	89046135071.592117	58.402914544896689	17.301884669527556	4779	3376	2379	2770
-75.852142333984375	76.128402709960938	430761739.44445068	11741448443.999996	17.948405810185445	12.926010766460045	5923	3900	5390	3426
-58.478599548339844	89.867706298828125	840229584.28544009	35521199868.76268	35.009566011893334	15.949303834018592	4781	3346	2876	3245

(the second column in the second row is peak L*, 94.8 here, I'd expect to see values over 100 for specular highlights)

You can get the LAB->sRGB conversion automatically like this:

$ vips icc_import JPEG_XL_HDR.jxl x.jpg --intent perceptual

Making this image:

x

That will use the libvips LAB->sRGB conversion. You could also write:

$ vips icc_transform JPEG_XL_HDR.jxl x.jpg srgb --intent perceptual 

That will convert to sRGB using LCMS and the srgb profile bundled with libvips.

@jcupitt
Copy link
Member Author

jcupitt commented Oct 5, 2024

We'll should probably make libvips ICC import fall back to perceptual if relative is not supported.

@jcupitt
Copy link
Member Author

jcupitt commented May 7, 2025

I've been looking into the macos crashes and it seems to be g_mutex_clear().

For example, running vips magickload will make a magick7 object, then dispose it without running _build(). This makes this code run:

    ...
    VIPS_FREEF(magick_destroy_exception, magick7->exception);
    g_mutex_clear(&magick7->lock);

And g_mutex_clear() will crash inside pthread_mutex_destroy:

			default enum: none
			allowed enums: none, truncated, error, warning
   revalidate   - Don't use a cached result for this operation, input gboolean
			default: false
operation flags: nocache untrusted 
vips_foreign_load_magick7_dispose: 0x600002318700
Process 59255 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000018b08e2d0 libsystem_pthread.dylib`pthread_mutex_destroy + 40
libsystem_pthread.dylib`pthread_mutex_destroy:
->  0x18b08e2d0 <+40>: casa   w8, w9, [x19]

->lock has the value:

  lock = {
    p = 0x0000000000000000
    i = ([0] = 0, [1] = 0)
  }

ie. just zero (as you'd expect).

If I move the g_mutex_init() from vips_foreign_load_magick7_build() to vips_foreign_load_magick7_init(), breaking on _dispose() shows that lock now has the value:

  lock = {
    p = 0x000060000000e980
    i = ([0] = 59776, [1] = 24576)
  }

And g_mutex_clear() now runs successfully.

The docs say:

Notice that the GMutex is not initialised to any particular value. Its placement in static storage ensures that it will be initialised to all-zeros, which is appropriate.

Which doesn't seem to be working. I'll try to make a small reproducer and report upstream.

@jcupitt
Copy link
Member Author

jcupitt commented May 7, 2025

This seems to reproduce it:

// compile with
//  cc mutex.c `pkg-config glib-2.0 --cflags --libs`

#include <stdio.h>
#include <glib.h>

GMutex lock_static;

int
main(void)
{   
    GMutex lock_stack = { 0 };

    printf("clearing lock in static storage ...\n");
    g_mutex_clear(&lock_static);

    printf("clearing lock on stack ...\n");
    g_mutex_clear(&lock_stack);

    return 0;
}

I see:

17:25 $ cc mutex.c `pkg-config glib-2.0 --cflags --libs`
✔ ~/try 
17:27 $ ./a.out
clearing lock in static storage ...
Segmentation fault: 11
✘-SEGV ~/try 

System:

17:27 $ brew info glib
==> glib: stable 2.84.1 (bottled)
17:28 $ uname -a
Darwin cheese-2.local 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:45:49 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6020 arm64

@jcupitt
Copy link
Member Author

jcupitt commented May 7, 2025

I made an issue here: https://gitlab.gnome.org/GNOME/glib/-/issues/3682

@jcupitt
Copy link
Member Author

jcupitt commented May 7, 2025

I found a bit more out: it happens when glib is NOT configured to use the native glib mutex and instead uses the platform POSIX mutex. It doesn't check for a null pointer before calling g_mutex_impl_free().

It looks like g_cond() etc. probably have the same issue.

@jcupitt
Copy link
Member Author

jcupitt commented May 7, 2025

The glib devs clarified the docs: if you have a mutex in static storage, it's ok to not init or clear it. It'll leak with some mutex backends, but since it's in static storage, the leak is bounded and doesn't matter.

For dynamically allocated memory (ie. stack or heap), you must init exactly once and you must clear exactly once. Init to zero is not enough.

I think we're going to have to double-check all our GMutex usage :(

@jcupitt
Copy link
Member Author

jcupitt commented May 8, 2025

It looks like we must only call g_mutex_clear() once, so it should be called from _finalize(), not from _dispose() :(

@kleisauke
Copy link
Member

Thanks for clarifying this upstream and fixing this!

It looks like we must only call g_mutex_clear() once, so it should be called from _finalize(), not from _dispose() :(

Ah, bummer. This reminds me of this suboptimal approach:

if (fits->line)
g_mutex_clear(&fits->lock);

@jcupitt
Copy link
Member Author

jcupitt commented May 8, 2025

Yes, I saw that haha. Safe, but a little ugly.

@jcupitt jcupitt marked this pull request as ready for review May 8, 2025 14:09
@jcupitt
Copy link
Member Author

jcupitt commented May 8, 2025

I've not revised jxlload or added more than four channel support, but let's try to merge, this PR has taken far too long.

We can do another round of work in a follow-up.

Comment on lines 969 to 988
/* Tinkering with output transforms. Experiment with this plus
*
* https://sneyers.info/hdrtest/
*
* and see if we can get values out of SDR range
*/

JxlColorEncoding enc = {
.color_space = JXL_COLOR_SPACE_RGB,
.white_point = JXL_WHITE_POINT_D65,
.primaries = JXL_PRIMARIES_SRGB,
.transfer_function = JXL_TRANSFER_FUNCTION_SRGB,
.rendering_intent = JXL_RENDERING_INTENT_RELATIVE,
};
if (JxlDecoderSetPreferredColorProfile(jxl->decoder, &enc)) {
vips_foreign_load_jxl_error(jxl,
"JxlDecoderSetOutputColorProfile");
return -1;
}
if (JxlDecoderSetDesiredIntensityTarget(jxl->decoder, 10000)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we revert this to its original?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, let's do the HDR stuff in a follow up PR.

@kleisauke
Copy link
Member

With libjxl 0.11.1 in release mode and the current version of this PR I see:

$ /usr/bin/time -f %M:%e vips copy st-francis.jpg x.jxl
memory: high-water mark 3.99 GB
5475176:59.19
$ /usr/bin/time -f %M:%e vips copy st-francis.jpg x.jxl[chunked]
memory: high-water mark 1.69 GB
2151556:66.88

FWIW, I see similar results on my AMD Ryzen 9 7900 workstation.

master branch:

$ /usr/bin/time -f %M:%e vips copy ~/st-francis.jpg x.jxl
5533800:50.35

This PR:

$ /usr/bin/time -f %M:%e vips copy st-francis.jpg x.jxl
1877692:53.78

So, ~3x less memory, similar performance and byte-identical images. Nice!

Copy link
Member

@kleisauke kleisauke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I noticed a couple of unused things, but that can also be cleaned up in a follow-up PR.

@jcupitt jcupitt merged commit 214a23f into master May 8, 2025
14 checks passed
@jcupitt
Copy link
Member Author

jcupitt commented May 8, 2025

Phew! Thanks for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants