Skip to content

colour: use floating-point APIs of lcms #3373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 13, 2023

Conversation

kleisauke
Copy link
Member

Avoids the need to pack the buffer of floats into lcms's fixed-point formats and ensures full precision internally in lcms.

Resolves: #3150.
Supersedes: #3368.

Note: this PR targets the 8.14 branch.

Avoids the need to pack the buffer of floats into lcms's fixed-point
formats and ensures full precision internally in lcms.

Resolves: libvips#3150.
Supersedes: libvips#3368.
Copy link
Member

@jcupitt jcupitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm amazed lcms has such different behaviour on this code path.

Well done for finding this Kleis.

@jcupitt
Copy link
Member

jcupitt commented Mar 5, 2023

I tested this PR with:

$ vips icc_import input.jpg t1.v --pcs xyz
$ vips icc_export t1.v output.jpg

And it now round-trips correctly, though there's some gamut clipping in the extreme reds, as you'd expect.

Open these in two browser tabs and flip. Before:

input

After round-trip:

output

@jcupitt
Copy link
Member

jcupitt commented Mar 5, 2023

I benchmarked like this. Before:

$ time vips icc_import input.jpg t1.v --pcs xyz
real	0m0.096s
user	0m0.112s
sys	0m0.076s

After:

$ time vips icc_import input.jpg t1.v --pcs xyz
real	0m0.097s
user	0m0.345s
sys	0m0.090s

So we're maybe 4x more CPU with this change, which is unfortunate :( However, the accuracy improvement is so dramatic, I don't think we have a choice.

The high CPU cost seems alarming, but it's interesting that overall run time is unaffected, since libvips hides the extra processing with extra concurrency (phew!).

@lovell
Copy link
Member

lovell commented Mar 5, 2023

I've done some quick performance testing/profiling and can confirm the CPU cost (ignoring concurrency) of both approaches is ~4x, with ~30% of the time spent calculating pow (gamma calc?). Perhaps the original cmsFLAGS_NOOPTIMIZE approach is a better/simpler change after all, especially if we can make it conditional somehow e.g. XYZ-only, or maybe via a new argument?

Copy link
Member

@lovell lovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having slept on it, I agree this PR is the better solution, especially as it doesn't prevent lcms from internally optimising the icc_transform operation.

@kleisauke
Copy link
Member Author

Ah, I was unaware of the additional CPU cost (I only tested overall run time). Let me check if there's another solution.

I remembered that one of the libjxl maintainers said that it did a pow() per pixel, which is probably the reason why libjxl seems to prefer skcms over lcms. This was tracked in issue mm2/Little-CMS#2 and the fast float plugin resolves that (unfortunately GPL-licensed).

@jcupitt
Copy link
Member

jcupitt commented Mar 6, 2023

I had a quick go with fast float using this diff to this patch:

diff --git a/libvips/colour/icc_transform.c b/libvips/colour/icc_transform.c
index 31f0c954c..e7ed52d32 100644
--- a/libvips/colour/icc_transform.c
+++ b/libvips/colour/icc_transform.c
@@ -91,6 +91,7 @@
 /* Has to be before VIPS to avoid nameclashes.
  */
 #include <lcms2.h>
+#include <lcms2_fast_float.h>
 
 #include <vips/vips.h>
 
diff --git a/libvips/iofuncs/init.c b/libvips/iofuncs/init.c
index 93f0b17ba..25089d0bc 100644
--- a/libvips/iofuncs/init.c
+++ b/libvips/iofuncs/init.c
@@ -114,6 +114,11 @@
 #include <vips/vips7compat.h>
 #endif
 
+#ifdef HAVE_LCMS2
+#include <lcms2.h>
+#include <lcms2_fast_float.h>
+#endif /*HAVE_LCMS2*/
+
 /* abort() on the first warning or error.
  */
 int vips__fatal = 0;
@@ -469,6 +474,10 @@ vips_init( const char *argv0 )
                return( 0 );
        started = TRUE;
 
+#ifdef HAVE_LCMS2
+       cmsPlugin(cmsFastFloatExtensions());
+#endif /*HAVE_LCMS2*/
+
        /* Try to set a minimum stacksize, default 2mb. We need to do this
         * before any threads start.
         */

But I don't see a speedup :(

$ time vips icc_import input.jpg t1.v --pcs xyz
real	0m0.113s
user	0m0.360s
sys	0m0.085s
$ ldd $(which vips) | grep fast
	liblcms2_fast_float.so => /home/john/vips/lib/liblcms2_fast_float.so (0x00007fd96cbe2000)

I've probably messed up somewhere.

@jcupitt
Copy link
Member

jcupitt commented Mar 6, 2023

Heh callgrind shows 50% of CPU spent evaluating tone curves:

image

I must have missed something.

@kleisauke
Copy link
Member Author

kleisauke commented Mar 7, 2023

You're right, I don't see any speedup here either with the fast float plugin :(

I did some digging and it looks like this is being caused by the OptimizeByResampling builtin optimization in lcms. With this patch:

--- a/src/cmsopt.c
+++ b/src/cmsopt.c
@@ -1803,8 +1803,7 @@ static _cmsOptimizationCollection DefaultOptimization[] = {
 
     { OptimizeByJoiningCurves,            &DefaultOptimization[1] },
     { OptimizeMatrixShaper,               &DefaultOptimization[2] },
-    { OptimizeByComputingLinearization,   &DefaultOptimization[3] },
-    { OptimizeByResampling,               NULL }
+    { OptimizeByComputingLinearization,   NULL },
 };
 
 // The linked list head

I see (click to enlarge):

$ vipsthumbnail php-vips-192.jpg -s 1536x1024 -e srgb -o x.jpg
Original 8.14 8.14 + patch This PR Commit kleisauke@3e93bb0 + patch
php-vips-192 8.14 lcms-patch pr-3373 lcms-simplify-2-patch

And:

$ vipsthumbnail php-vips-169.jpg --linear -s 1920x1080 -o x.jpg
Original 8.14 8.14 + patch This PR Commit kleisauke@3e93bb0 + patch
php-vips-169 8.14 lcms-patch pr-3373 lcms-simplify-2-patch

So, it looks like commit kleisauke@3e93bb0 might also help here, but only when that builtin lcms optimization is disabled.

@jcupitt
Copy link
Member

jcupitt commented Mar 7, 2023

Oh very cool! And does changing the lcms optimisation table improve performance significantly?

@kleisauke
Copy link
Member Author

It seems that the OptimizeByResampling optimization decreases CPU cost (i.e., disabling it makes the timings more in line with the TYPE_XYZ_FLT API).

For example:
8.14:

$ time vips icc_import php-vips-169.jpg t1.v --pcs xyz

real	0m0.106s
user	0m0.105s
sys	0m0.055s

8.14 + lcms patch:

$ time vips icc_import php-vips-169.jpg t1.v --pcs xyz

real	0m0.122s
user	0m0.247s
sys	0m0.043s

This PR:

$ time vips icc_import php-vips-169.jpg t1.v --pcs xyz

real	0m0.124s
user	0m0.257s
sys	0m0.046s

Commit kleisauke@3e93bb0 + lcms patch:

$ time vips icc_import php-vips-169.jpg t1.v --pcs xyz

real	0m0.123s
user	0m0.267s
sys	0m0.034s

@lovell
Copy link
Member

lovell commented Mar 7, 2023

Great detective work Kleis. Did you see that the accuracy of the LUTs in OptimizeByResampling is based on nGridPoints which is optionally increased by the cmsFLAGS_HIGHRESPRECALC flag?

@kleisauke
Copy link
Member Author

Yes, I also noticed that. I initially tried these two patches:

--- a/libvips/colour/icc_transform.c
+++ b/libvips/colour/icc_transform.c
@@ -444,6 +444,9 @@ vips_icc_build( VipsObject *object )
 	if( icc->black_point_compensation ) 
 		flags |= cmsFLAGS_BLACKPOINTCOMPENSATION;
 
+	if( icc->pcs == VIPS_PCS_XYZ )
+		flags |= cmsFLAGS_HIGHRESPRECALC;
+
 	if( !(icc->trans = cmsCreateTransform( 
 		icc->in_profile, icc->in_icc_format,
 		icc->out_profile, icc->out_icc_format, 
--- a/libvips/colour/icc_transform.c
+++ b/libvips/colour/icc_transform.c
@@ -444,6 +444,9 @@ vips_icc_build( VipsObject *object )
 	if( icc->black_point_compensation ) 
 		flags |= cmsFLAGS_BLACKPOINTCOMPENSATION;
 
+	if( icc->pcs == VIPS_PCS_XYZ )
+		flags |= cmsFLAGS_GRIDPOINTS( 49 );
+
 	if( !(icc->trans = cmsCreateTransform( 
 		icc->in_profile, icc->in_icc_format,
 		icc->out_profile, icc->out_icc_format, 

But both didn't work, unfortunately.

@jcupitt
Copy link
Member

jcupitt commented Mar 7, 2023

I tried removing OptimizeByResampling() but it didn't reduce the number of pow() calls (according to callgrind). It's still doing 3 pow() calls per pixel processed :(

@jcupitt
Copy link
Member

jcupitt commented Mar 7, 2023

libvips uses a LUT for tone curves -- 1500 elements (from memory?) then linear piecewise interpolation between them, which is accurate enough, I think.

Maybe we could make our own lcms plugin that does something similar? Or perhaps we need to swap to skcms :(

@jcupitt
Copy link
Member

jcupitt commented Mar 7, 2023

Oh heh no we don't, it's much more complex, I'd forgotten.

@kleisauke
Copy link
Member Author

IIRC, skcms doesn't support conversion to LUT-based or CMYK profiles, so switching to that is not an option.

@jcupitt
Copy link
Member

jcupitt commented Mar 7, 2023

Ah shame.

@kleisauke
Copy link
Member Author

Commit 4e16c13 fixes a small color regression with the php-vips-192.jpg testcase.

$ vipsthumbnail php-vips-192.jpg -s 1536x1024 -e srgb -o x.jpg
Original Before (caf5aa4) After (4e16c13)
php-vips-192 pr-3373 pr-3373

@kleisauke
Copy link
Member Author

It looks like this issue occurs only when exporting (vips_icc_export()) an image from CIE XYZ PCS to device space, this means that vips_icc_import() can still make use of the 16-bit XYZ/LAB APIs. Commit 1a4cdbe incorporates this, PTAL.

Here are some vipsthumbnail timings:

Details

8.14:

$ time vipsthumbnail php-vips-169.jpg --linear -s 1920x1080 -o x.jpg

real	0m0.107s
user	0m0.168s
sys	0m0.024s
$ time vipsthumbnail php-vips-169.jpg -s 1920x1080 -e srgb -o x.jpg

real	0m0.095s
user	0m0.094s
sys	0m0.028s
$ time vipsthumbnail php-vips-192.jpg -s 1536x1024 -e srgb -o x.jpg

real	0m0.099s
user	0m0.121s
sys	0m0.025s
$ time vipsthumbnail php-vips-192.jpg -s 1536x1024 -i srgb -e srgb -o x.jpg

real	0m0.091s
user	0m0.086s
sys	0m0.023s

This PR:

$ time vipsthumbnail php-vips-169.jpg --linear -s 1920x1080 -o x.jpg

real	0m0.132s
user	0m0.344s
sys	0m0.023s
$ time vipsthumbnail php-vips-169.jpg -s 1920x1080 -e srgb -o x.jpg

real	0m0.095s
user	0m0.094s
sys	0m0.029s
$ time vipsthumbnail php-vips-192.jpg -s 1536x1024 -e srgb -o x.jpg

real	0m0.104s
user	0m0.174s
sys	0m0.027s
$ time vipsthumbnail php-vips-192.jpg -s 1536x1024 -i srgb -e srgb -o x.jpg

real	0m0.091s
user	0m0.083s
sys	0m0.026s

So, only processing in --linear space or exporting untagged images (i.e. those without ICC profile or when -i is omitted) are slightly worse in terms of performance.

Ensures compilers may optimise.
@kleisauke kleisauke merged commit 2903876 into libvips:8.14 Mar 13, 2023
@kleisauke kleisauke deleted the lcms-simplify branch March 13, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants