Skip to content

segv under heavy load #161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jcupitt opened this issue Apr 23, 2015 · 13 comments
Closed

segv under heavy load #161

jcupitt opened this issue Apr 23, 2015 · 13 comments
Labels
defect format-vms Hamamatsu VMS format-vmu Hamamatsu VMU

Comments

@jcupitt
Copy link
Contributor

jcupitt commented Apr 23, 2015

Hi, I'm seeing a segv with 3.4.0 under heavy load. This is 50 threads all calling openslide_detect_vendor() in parallel:

https://gist.github.com/jcupitt/8ec613adcd6c000e82d7

I'm seeing:

$ for i in {1..10}; do ./a.out ~/pics/k2.jpg; done
Segmentation fault (core dumped)
Segmentation fault (core dumped)
*** Error in `./a.out': free(): invalid next size (normal): 0x00007f47000036e0 ***
Aborted (core dumped)
... etc.

I'll try with master as well.

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 23, 2015

I see similar signs of memory corruption with master.

$ for i in {1..10}; do ./a.out ~/pics/k2.jpg; done

(a.out:32152): GLib-ERROR **: creating thread '': Error creating thread: Resource temporarily unavailable
Trace/breakpoint trap (core dumped)

(a.out:310): GLib-ERROR **: creating thread '': Error creating thread: Resource temporarily unavailable
Trace/breakpoint trap (core dumped)
*** Error in `./a.out': free(): invalid next size (normal): 0x00007f725c0036c0 ***
Aborted (core dumped)

Not sure what's causing the glib error.

@agoode
Copy link
Member

agoode commented Apr 23, 2015

Hi John,

Can you try building with asan? Try adding this gcc flag:
-fsanitize=address

@agoode
Copy link
Member

agoode commented Apr 23, 2015

You can also do tsan (it may need to be in a separate run):
-fsanitize=thread.

For both of these, you have to pass the flag during both the compile and link steps.

@bgilbert
Copy link
Member

Hi John, thanks for your report. I can't reproduce the segfault on Fedora 21 x86_64, with either 3.4.0 or master, but I am seeing the SIGTRAP (from inside glib, on a failed pthread_setspecific() call). However, the trap also happens if I comment out the openslide_detect_vendor() call.

Your test program is leaking GThread *. If I modify it to g_thread_unref() each new thread, it fixes the SIGTRAP for me. Do you still see the problem with that modification in place?

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 24, 2015

Oop, you're right. I've updated the test program to remove the vips dependency and clean up. It fixes the glib errors, but I still see random heap corruption:

$ for i in {1..10}; do ./a.out ~/pics/k2.jpg; done
*** Error in `./a.out': free(): invalid next size (fast): 0x00007fc504001580 ***
Aborted (core dumped)
*** Error in `./a.out': free(): invalid next size (normal): 0x00007f58ec003220 ***
Aborted (core dumped)
*** Error in `./a.out': double free or corruption (out): 0x00007fb9c40015a0 ***
Aborted (core dumped)

Roughly one run in three. That's with git master openslide. The crashes go if I comment out openslide_detect_vendor().

I'll try with asan, as you suggested @agoode.

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 24, 2015

I got an error from asan:

$ for i in {1..10}; do ../a.out ~/pics/k2.jpg; done
=================================================================
==4314== ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x60240002fac0
    #0 0x7fd1ac27133a (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x1533a)
    #1 0x7fd1abf8ba87 (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x37a87)
0x60240002fac0 is located 0 bytes inside of 183328-byte region [0x60240002fac0,0x60240005c6e0)
==4314== AddressSanitizer CHECK failed: ../../../../src/libsanitizer/asan/asan_thread_registry.cc:153 "((tid < n_threads_)) != (0)" (0x0, 0x0)
    #0 0x7fd1ac26e31d (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x1231d)
    #1 0x7fd1ac275133 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x19133)
    #2 0x7fd1ac26f8c8 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x138c8)
    #3 0x7fd1ac272e01 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x16e01)
    #4 0x7fd1ac273733 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x17733)
    #5 0x7fd1ac264a3f (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x8a3f)
    #6 0x7fd1ac271367 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x15367)
    #7 0x7fd1abf8ba87 (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x37a87)
    #8 0x7fd1abf8c003 (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x38003)
    #9 0x7fd1abf8c106 (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x38106)
    #10 0x7fd1abf7a1ff (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x261ff)
    #11 0x7fd1abf931fb (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x3f1fb)
    #12 0x7fd1abf94cfc (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x40cfc)
    #13 0x7fd1abd07461 (/home/john/vips/lib/libopenslide.so.0.4.1+0x36461)
    #14 0x7fd1abce3ffd (/home/john/vips/lib/libopenslide.so.0.4.1+0x12ffd)
    #15 0x7fd1abce43f8 (/home/john/vips/lib/libopenslide.so.0.4.1+0x133f8)
    #16 0x400ac2 (/home/john/try/a.out+0x400ac2)
    #17 0x7fd1abfc1f04 (/lib/x86_64-linux-gnu/libglib-2.0.so.0.4002.0+0x6df04)
    #18 0x7fd1ac274b97 (/usr/lib/x86_64-linux-gnu/libasan.so.0.0.0+0x18b97)
    #19 0x7fd1ab6f6181 (/lib/x86_64-linux-gnu/libpthread-2.19.so+0x8181)
    #20 0x7fd1aba0647c (/lib/x86_64-linux-gnu/libc-2.19.so+0xfa47c)

Not sure how helpful that is.

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 24, 2015

I forgot to say, I'm on Ubuntu 14.04, the current long-term support release.

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 24, 2015

I tried with valgrind, but that ran cleanly, it's not very good with threads. -fsanitize=thread seems to not work on 14.04.

@bgilbert
Copy link
Member

Reproduced on Fedora 21; thanks for the update. The problem is inside g_get_language_names(), called during GKeyFile initialization via the VMS/VMU sniffer. Trying to track it down.

@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 25, 2015

Oh, possibly a glib error? That's interesting. It took me a week to track it down this far :-( good luck.

@bgilbert
Copy link
Member

It turns out that a helper function of g_get_language_names() performs unsynchronized initialization of a static GHashTable the first time it is called, and has been doing so since the code was added to glib in 2004. The bug can be avoided by calling g_get_language_names() before any threads are started. I will add that call to OpenSlide's shared library constructor, but you may also want to add it to VIPS to avoid the bug with older OpenSlide.

jcupitt added a commit to libvips/libvips that referenced this issue Apr 26, 2015
works around a crash under heavy load, see

openslide/openslide#161
@jcupitt
Copy link
Contributor Author

jcupitt commented Apr 26, 2015

I've added a g_get_language_names() to vips_init() and it's fixed my crash too. Thanks for finding this @bgilbert.

@bgilbert bgilbert added defect format-vms Hamamatsu VMS format-vmu Hamamatsu VMU labels Sep 24, 2015
@bgilbert
Copy link
Member

glib 2.48.1 fixes the thread safety of g_get_language_names().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect format-vms Hamamatsu VMS format-vmu Hamamatsu VMU
Development

No branches or pull requests

3 participants