The story of a hard system segfault


March 2015.
One of my servers suddenly got victim of a lot a random segfaults this month.

The problem was easy to reproduce. Using vi, vim, or many other programs would trigger the fault.

Let's open the dump and see who's the culprit.

# gdb vim vim.core
[...]
[New process 101653]
Core was generated by `vim'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000000080142c39a in kill () from /lib/libc.so.7
(gdb) bt
#0 0x000000080142c39a in kill () from /lib/libc.so.7
#1 0x00000000004ceeac in ?? ()
#2 0x00007ffffffff193 in ?? ()
#3 0x00000000004cdb50 in ?? ()
#4 0x0000000000000000 in ?? ()

Huh? A problem in libc? The system has been working perfectly for a while, and its system binaries are mounted in read-only, so there's no way the file was changed. How can there be a problem there?

I tried rebuilding both the system and all the ports (twice, once with gcc and once with clang, in case it was a compiler problem). Nothing changed.

After days of searching and compiling, I found the culprit: OpenSSL 1.0.2. I've uninstalled the port version and rebuilt my ports to use the system one (which was up to date since I recompiled the base system in the story). Everything was working fine again afterwards.

But then why would vi crash? It's not linked to OpenSSL in any way.
It was hard to pin down that problem, but once you know the answer, it's absolutely obvious.

I was using LDAP to sync my users across my systems. OpenSSL would be used by OpenLDAP by NSS_LDAP to query information about the owner of files. That's how to get to have OpenSSL and vi in the same concept.

Image




I think there's a lesson here.

Edit: (2015-09-01): this bug is related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=198788.