Skip to content

out of order read error #639

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcupitt opened this issue Apr 12, 2017 · 26 comments
Closed

out of order read error #639

jcupitt opened this issue Apr 12, 2017 · 26 comments

Comments

@jcupitt
Copy link
Member

jcupitt commented Apr 12, 2017

Since migrating from 8.4.4 to 8.5.2 we see this issue (can you re-open it?), it isn't reproducible in test environments, not even in production when requesting the same images. It looks like it has to do with high load and lots of concurrent requests. Errors outputted look like

VipsJpeg: out of order read at line 3016
VipsJpeg: out of order read at line 3016
VipsJpeg: out of order read at line 3016
VipsJpeg: out of order read at line 3016

for ~0.03% of the requests (hence the not reproducible part). Only for JPEG's it seems. Really strange, and we cannot debug it at this time..

@jcupitt
Copy link
Member Author

jcupitt commented Apr 12, 2017

Related issue: #261

Test program from the previous issue: https://gist.github.com/jcupitt/7de81323573665d8aafb

@jcupitt
Copy link
Member Author

jcupitt commented Apr 12, 2017

I've updated the gist slightly, but it runs cleanly here. That's 50 parallel workers, each running vips_resize() 20 times, the whole thing repeated 10 times.

@andrieslouw, what is your workload? Is there some other operator we could test?

@kleisauke
Copy link
Member

After some debugging and testing, I could make a reproducible test case (in PHP) which gives sporadic out of order read errors. Sometimes it outputs this as log:

[2017-04-12T17:22:01+00:00] debug: reducev {"instance":"{\"width\":768,\"height\":1024,\"bands\":3,\"format\":\"uchar\",\"interpretation\":\"srgb\"}","arguments":[1.4422535211267606,{"kernel":"lanczos3","centre":true}]}
[2017-04-12T17:22:01+00:00] debug: reducev {"result":"{\"width\":768,\"height\":710,\"bands\":3,\"format\":\"uchar\",\"interpretation\":\"srgb\"}"}
[2017-04-12T17:22:01+00:00] debug: reduceh {"instance":"{\"width\":768,\"height\":710,\"bands\":3,\"format\":\"uchar\",\"interpretation\":\"srgb\"}","arguments":[1.4422535211267606,{"kernel":"lanczos3","centre":true}]}
[2017-04-12T17:22:01+00:00] debug: reduceh {"result":"{\"width\":533,\"height\":710,\"bands\":3,\"format\":\"uchar\",\"interpretation\":\"srgb\"}"}
[2017-04-12T17:22:01+00:00] error: VipsJpeg: out of order read at line 1024
VipsJpeg: out of order read at line 1024
VipsJpeg: out of order read at line 1024
VipsJpeg: out of order read at line 1024
 {"exception":{}}

And we see these errors in our PHP-FPM log:

WARNING: [pool www] child 6768 said into stderr: "NOTICE: PHP message: PHP Fatal error:  Uncaught Jcupitt\Vips\Exception: VipsJpeg: out of order read at line 1024"
WARNING: [pool www] child 6768 said into stderr: "VipsJpeg: out of order read at line 1024"
WARNING: [pool www] child 6768 said into stderr: "VipsJpeg: out of order read at line 1024"
WARNING: [pool www] child 6768 said into stderr: "VipsJpeg: out of order read at line 1024"
WARNING: [pool www] child 6768 said into stderr: " in /var/www/t0.nl/public_html/test-bug/vendor/jcupitt/vips/src/Image.php:673"
WARNING: [pool www] child 6768 said into stderr: "Stack trace:"
WARNING: [pool www] child 6768 said into stderr: "#0 /var/www/t0.nl/public_html/test-bug/vendor/jcupitt/vips/src/Image.php(857): Jcupitt\Vips\Image::errorVips()"
WARNING: [pool www] child 6768 said into stderr: "#1 /var/www/t0.nl/public_html/test-bug/test.php(82): Jcupitt\Vips\Image->writeToBuffer('.jpg')"
WARNING: [pool www] child 6768 said into stderr: "#2 {main}"
WARNING: [pool www] child 6768 said into stderr: "  thrown in /var/www/t0.nl/public_html/test-bug/vendor/jcupitt/vips/src/Image.php on line 673"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 104"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 112"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 120"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 128"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 80"
WARNING: [pool www] child 6758 said into stderr: ""
WARNING: [pool www] child 6758 said into stderr: "(banana:6758): VIPS-WARNING **: error in tile 0 x 90"

Attachment: reproducible test case
test-bug.zip

@andrieslouw
Copy link

@jcupitt: The testcase from @kleisauke is based on our workload. Production does 45 req/s average, with 26 simultaneous threads running (~500ms response time, including fetching of images). Peakload is 80 req/s with 85 simultaneous requests (~1100ms response time including fetching). Processing by libvips takes ~100ms at average, and issue looks independent of load (percentage wise). I guess it could be reproduced with 40 requests per second, and 4 or 5 simultaneous threads.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 12, 2017

Hey great, I'll test here. Thanks!

@jcupitt
Copy link
Member Author

jcupitt commented Apr 12, 2017

I tried:

$ parallel "php test-loop.php > /dev/null" ::: {1..10}

Which (on this laptop) will run four copies of test-loop.php in parallel and it seems OK. I'll try a few more cases.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 13, 2017

I think I've found a problem in buffer sizing. Hang on, I'll try a patch.

jcupitt referenced this issue Apr 13, 2017
we had output buffers too large, input caches too small

see https://github.com/jcupitt/libvips/issues/639
@jcupitt
Copy link
Member Author

jcupitt commented Apr 13, 2017

Could you try with HEAD in the 8.5 branch? I've revised the cache sizing, it might fix your issue.

@kleisauke
Copy link
Member

Tested it with the 8.5 branch (in our test environment) and it seems that this problem has been fixed, thanks John! 🎉

We can't test this on our production server (yet) because we depend on the libvips provided by Remi's RPM repository.

I've also updated the reproducible test case (to only output if there's a libvips exception). You can test it with:

parallel "php test-loop.php" ::: {1..10}

With the 8.5 branch there are no errors and with the 8.5.2 release there are many error in tile and out of order read errors.

Attachment: updated reproducible test case
test-bug.zip

@jcupitt
Copy link
Member Author

jcupitt commented Apr 13, 2017

\o/ Hey, that's great! I pushed another small improvement, but your test still passes. I'll do 8.5.3 tomorrow.

Have you looked at using the new thumbnail operator instead of shrink / reduce? You should see a good improvement in speed, eg.:

$ time vips resize wtc.jpg x.jpg 0.01
real	0m0.707s
user	0m0.680s
sys	0m0.024s
$ time vips thumbnail wtc.jpg x.jpg 94
real	0m0.282s
user	0m0.268s
sys	0m0.012s

That's shrinking a 9400 x 9400 pixel image to 94 x 94. The quality is the same.

@kleisauke
Copy link
Member

Thanks for releasing 8.5.3! The test case still passes in our test environment.

I'll test it out in production if 8.5.3 appears in the RPM repository.

We've considered the thumbnail operator but it’s not possible to implement it in our current setting. Because it's always cropping in the centre and it always maintains the aspect ratio.

I think the speed up (you've mentioned) is due to shrink-on-load. We've already implemented this in our resize class, it's indeed a nice improvement in speed.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 15, 2017

Ah, OK, good to hear you're doing that.

I don't see code for doing the last 200%+ with reduce, do you have that? You need to do at least the final 200% with reduce or you'll get aliasing, perhaps you saw the section on the wiki:

https://github.com/jcupitt/libvips/wiki/HOWTO----Image-shrinking#block-shrink

Or maybe you have this and I'm being dumb.

Do you do auto-rotate? That has a slightly tricky thing where you'll need to render to a memory buffer before the rotate operation. Same for smartcrop, if you do that.

@kleisauke
Copy link
Member

kleisauke commented Apr 15, 2017

Ah I see, we're not leaving at least a factor of two for the final resize. The thumbnail operator does. sharp doesn't do this either. I'll do some experimenting to see if there's a big difference.

About the auto-rotate:
We're doing auto-rotate, but without using vips_autorot (because it doesn't support rotate + mirror EXIF). Instead we've build our own desired function. Also, when we detect a rotation we're setting the access method to random (see here). We're doing the same thing if we detect a smart crop or if the user wants to trim his image (see here).

The trick to render to a memory buffer is currently not possible for us because the vips_image_copy_memory operator is not in the PHP binding.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 15, 2017

You're right, the php API would need some more things to allow render to memory, I'll try to think of the simplest. That test image of the house is a good one for exposing aliasing problems.

You don't need to render to memory for trim with 8.5, the new sequential mode can do this case.

@kleisauke
Copy link
Member

There's indeed an aliasing problem due to shrink-on-load (see here and the fixed one here). Not sure why we haven't previously encountered this problem. Thanks for the heads-up.

I've tried to implement the thumbnail operator in our current setting (on this branch). I found a few problems (should I open a new issue?):

  • Our trim operator occurs before any resize operation. Is it possible to thumbnail on an existing image (and thereby skipping the shrink-on-load)? Writing the existing image to a buffer and do the thumbnail_buffer operation on the buffer fixes this. But that feels a little hackish (and I don't know if this is wise in terms of performance).
  • I couldn't find a page property on the thumbnail operator. For example, this can be useful when thumbnailing a specific PDF page.
  • Ignoring the aspect ratio isn't possible with thumbnail operator. Do you plan to implement this feature in the thumbnail operator? We still have many people who use it.

For the rest, thanks for moving the vips_thumbnail operator into libvips core. I think it's a lot easier for us to maintain it, after we have implemented this.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 16, 2017

You could trim the output of thumbnail, scaling the coordinates appropriately. Would that work?

For pages, you could embed the page number in the filename, for example:

$shrunk_image = Vips\Image\thumbnail($filename + "[page=] + $page_no + "]", 100);

Not ideal I guess. You'd need to sniff the file format before doing this, of course.

You're right, there's no way to break the aspect ratio. Could you explain why people want to do this? It seems puzzling to me.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 17, 2017

I added $new_image = $image->copyMemory(); to php, so you should be able to do that just before smartcrop or rotate for a nice speed up.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 18, 2017

I think we're done, I'll close. Thanks for reporting this!

@jcupitt jcupitt closed this as completed Apr 18, 2017
@kleisauke
Copy link
Member

@jcupitt Today we've updated libvips to 8.5.3 on our production server. Unfortunately we're still seeing out of order read errors for some images. I was able to reproduce this error and create a reproducible test case that you can run with:

parallel "php test-loop.php" ::: {1..10}

You'll see this appear in the shell:

(banana:11044): VIPS-WARNING **: error in tile 0 x 60
(banana:11044): VIPS-WARNING **: error in tile 0 x 70
(banana:11044): VIPS-WARNING **: error in tile 0 x 64
(banana:11044): VIPS-WARNING **: error in tile 0 x 72
(banana:11044): VIPS-WARNING **: error in tile 0 x 80
(banana:11044): VIPS-WARNING **: error in tile 0 x 60
(banana:11044): VIPS-WARNING **: error in tile 0 x 70
(banana:11044): VIPS-WARNING **: error in tile 0 x 32
(banana:11044): VIPS-WARNING **: error in tile 0 x 40
(banana:11044): VIPS-WARNING **: error in tile 0 x 48
(banana:11044): VIPS-WARNING **: error in tile 0 x 30
(banana:11044): VIPS-WARNING **: error in tile 0 x 40

(looks like x coordinate is always 0)

If you want, I can send you all images and target dimensions (which generate these errors) by email.

Attachment: updated reproducible test case
test-bug.zip

@jcupitt
Copy link
Member Author

jcupitt commented Apr 21, 2017

Argh! Sorry @kleisauke, I was sure this had been nailed.

Yes, I see the error here too, thank you for the test case. I'll have a look.

@jcupitt jcupitt reopened this Apr 21, 2017
@jcupitt
Copy link
Member Author

jcupitt commented Apr 22, 2017

I've found the problem: two bits of vips are disagreeing about cache sizing, so one ends up smaller than the other expects. I should have a fix in a few hours.

jcupitt referenced this issue Apr 23, 2017
we were sizing buffers partly by image width, which could cause caches
to be too small if width changed down a pipeline

see https://github.com/jcupitt/libvips/issues/639
@jcupitt
Copy link
Member Author

jcupitt commented Apr 23, 2017

OK, pushed a fix, it seems to work now. I've added a lot more tests too, so hopefully this can't happen again.

You could trigger the problem on a 4 thread machine with just:

$ vips reduce image.jpg x.v 1.5 1.5

Where image.jpg is your 539 x 960 pixel image. The problem was:

  • the output image was 359 pixels across
  • vips would allocate an output buffer two tiles high in order to be able to fit the four 128 x 128 threads in
  • the input image was 539 pixels across, so vips would allocate a buffer only one tile high, since it was wide enough for all four tiles to fit across
  • during computation, caches are 3 x buffer height to allow for double buffered output plus some non-local reference
  • if the first tile in a buffer was delayed sufficiently, non-local references would grow to 2 x 2 x tile height, since output is double-buffered
  • ... which is obviously more than the 3x the input cache is seeing, hence sequentiality breaks

To fix this, caches are now sized the same everywhere, independent of image width.

I've added a new thing too, VIPS_STALL. If this environment variable is set, the first tile in every buffer will sleep for 0.5s before starting. This puts terrible stress on the caching and buffering system: if it can survive this, it should survive anything.

The tests for make check run with stall enabled, so any regressions on this should show up immediately in future.

I'll do an 8.5.4 later today.

@jcupitt
Copy link
Member Author

jcupitt commented Apr 23, 2017

... and thanks for the great bug report! It was (relatively) easy to track down thanks to your testcase!

@jcupitt
Copy link
Member Author

jcupitt commented Apr 23, 2017

OK, 8.5.4 is up, I'll nudge homebrew too.

@jcupitt jcupitt closed this as completed Apr 23, 2017
@andrieslouw
Copy link

Remi-repo has been updated, code is running in production, @kleisauke is monitoring our production-servers for errors. Huge thanks already!

@kleisauke
Copy link
Member

After monitoring our production server for errors, it seems that this issue has been solved. Thanks John!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants