Fix memory corruption when unbinding properties
Using g_signal_connect_data() to attach a destroy notifier that gets called when either the channel or object gets destroyed is clever, but the code is a bit convoluted, as disconnecting one triggers the other. I can't figure out exactly why, but I get memory corruption when unbinding a bunch of properties. For some reason, for one of the properties, the object's destroy notifier appears to run twice. Valgrind reports a bunch of invalid reads and writes.
This new code uses g_object_weak_ref() to watch for the channel and object themselves to be destroyed, and cleans things up that way. The code is more straightforward and I don't get any more memory errors or g_critical messages.
Merge request reports
Activity
Could you quote Nick's commit that you talked about on IRC? Also, could you give a Valgrind output and some steps to reproduce the problem?
I haven't really looked at the code yet, but just a question first:
g_object_weak_ref()
is known to be not thread-safe, can't that be a problem in this context? (I see there areG_LOCK()
…)Sure, let me repro this and will post logs. Regarding:
I haven't really looked at the code yet, but just a question first:
g_object_weak_ref()
is known to be not thread-safe, can't that be a problem in this context? (I see there areG_LOCK()
…)That should be fine. The user of xfconf already shouldn't be calling
xfconf_g_property_unbind()
(or a variant) from a thread that doesn't own the object that the property is bound to (or has to be using their own locking) -- that's just not safe regardless (as gobject itself isn't thread-safe when dealing with individual objects). Same for whatever might happen to cause the object to get finalized, which would trigger the weak ref handler. And right, there is already locking around__bindings
, so multiple threads can end up in the unbind code on different objects without issues.Here's Nick's commit: f5cbc11c
And here's the valgrind log: valgrind-issue.log
Just to track here other stuff:
With the original code,
xfconf_g_property_object_disconnect()
is for some reason triggered an extra time (in my case, 8 properties needed to be unbound, but it gets called 9 times). The extra call is where the valgrind memory errors come from, and there's a g_warning printed for the assertion in that function failing (XFCONF_IS_CHANNEL(binding->channel)
). I also tried running in gdb, breaking inxfconf_g_property_object_disconnect()
. For the first 8 calls, everything inside theXfconfGBinding
looks fine, but on the "extra" call, the setting and property name strings are filled with garbage.With this MR, none of that happens: I get only 8 calls into the new disconnection code, no g_warnings, and no valgrind memory errors at all.
I do wonder if there's something specific to what xfdesktop is doing here that triggers this, where usual auto-unbinds don't do it. Essentially what is happening is this:
- In the settings dialog, I change the icon style from "minimized app icons" to "none".
- Xfconfd sends a
PropertyChanged
over DBus to xfdesktop. - Libxfconf in xfdesktop sees the property change and runs
xfconf_g_property_channel_notify()
. - That calls
g_object_set_property()
on theXfceDesktop
instance to update itsicons-style
property. - That calls
xfce_desktop_set_icon_style()
, which unrefs theXfdesktopWindowIconManager
instance. - This is the bit that's new with my in-progress refactor: the manager instance now manages the
XfdesktopIconView
instance: the manager callsgtk_widget_destroy()
on the icon view instance. - Libxfconf notices the icon view instance signal handlers being disconnected, and starts removing 8 property bindings.
- But the property unbind code runs 9 times, the last of which has corrupt data.
So I'm wondering if there's something that isn't properly handled when the property binding auto-disconnect code is run in response to a notification on a different property binding. But after staring at the code for a while, I can't identify what that could be... but I'm also just not super comfortable with the auto-unbind code, as it's a little too "clever" for my taste.
Oh, another thing: I also tried manually each property in the
XfdesktopIconView
finalizer (and in dispose, to see if that would make a difference), which works (8 properties get unbound), but then there is still somehow an extra "automatic" call through the samexfconf_g_property_object_disconnect()
code path for a 9th call, which causes an assertion failure in gobject (something like "signal handler id 112 does not exist", presumably when something is trying to disconnect it). I don't recall if there are memory errors there too, but I expect there probably are.Edited by Brian TarriconeIt seems I need more info to reproduce the problem. I tried running
xfdesktop
,xfdesktop-settings
andxfconfd
invalgrind
and changing the icon style with no result. As in the above logs you seem to run the development version in xfdesktop!43 (merged), I also tried it on X11 and Wayland without result either.Do you see what I am missing?
Ah, there's no MR for it because it's nowhere near ready. It's this branch: https://gitlab.xfce.org/kelnos/xfdesktop/-/tree/refactor-icon-view (a lot of things are broken in it, but enough should work to repro the issue). All you should need to do is start it up with "Minimized App Icons" selected, and after it starts, switch it to "None".
Except... I just double-checked, and now I can't reproduce this against either xfconf master or 4.16.0. So maybe this really was a memory corruption issue in xfdesktop itself, but somehow wasn't triggering any issues until the unbind code got activated? And at some point in the past day and a half I accidentally fixed what was wrong in xfdesktop? Hrm...
Edited by Brian TarriconeOk, I can repro again -- sorta.
I can't see any valgrind memory errors, and, strangely, when running through valgrind I don't see any g_warning assertion failures either. But when running standalone or in gdb, I can get(xfdesktop:2968222): xfconf-CRITICAL **: 17:08:26.465: xfconf_g_property_channel_disconnect: assertion 'XFCONF_IS_CHANNEL(binding->channel)' failed
to happen pretty reliably by switching from Min Apps to None. Interestingly though, it only seems to happen once, the first time I switch, and then doesn't happen again.
And it's also strange I don't see it under valgrind anymore.Also it only seems to happen if I start xfdesktop when Minimized App Icons mode is already selected, and then change it to None. If I start in None, an then change to Minimized App icons, and then back to None, nothing bad seems to happen.
Can't repro this at all with this patch.
Update: not sure what I was doing before, but the valgrind memory errors are actually still there. Just I don't see the g_warning assertion failure when running through valgrind for some reason.
Edited by Brian TarriconeOk I can reproduce the trace in Valgrind as well as the critical now, using your repo at commit https://gitlab.xfce.org/kelnos/xfdesktop/-/commit/73ccebe616df2c6204e375a6685e049c86891f0c.
Trace in gdb for the critical:
(xfdesktop:170670): xfconf-CRITICAL **: 12:34:13.295: xfconf_g_property_channel_disconnect: assertion 'XFCONF_IS_CHANNEL(binding->channel)' failed Thread 1 "xfdesktop" received signal SIGTRAP, Trace/breakpoint trap. 0x00007ffff6d258e5 in g_logv () from /usr/lib/libglib-2.0.so.0 #0 0x00007ffff6d258e5 in g_logv () at /usr/lib/libglib-2.0.so.0 #1 0x00007ffff6d25b64 in g_log () at /usr/lib/libglib-2.0.so.0 #2 0x00007ffff6e2cb08 in g_closure_unref () at /usr/lib/libgobject-2.0.so.0 #3 0x00007ffff6e432b3 in () at /usr/lib/libgobject-2.0.so.0 #4 0x00007ffff6e5adef in () at /usr/lib/libgobject-2.0.so.0 #5 0x00007ffff6e4af75 in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0 #6 0x00007ffff6e4b204 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0 #7 0x00007ffff6e2d210 in g_closure_invoke () at /usr/lib/libgobject-2.0.so.0 #8 0x00007ffff6e5aea8 in () at /usr/lib/libgobject-2.0.so.0 #9 0x00007ffff6e4af75 in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0 #10 0x00007ffff6e4b204 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0 #11 0x00007ffff7de5084 in xfconf_cache_handle_property_changed (parameters=<optimized out>, cache=0x555555969680) at xfconf-cache.c:523 #12 xfconf_cache_proxy_signal_received_cb (proxy=<optimized out>, sender_name=<optimized out>, signal_name=<optimized out>, parameters=<optimized out>, user_data=0x555555969680) at xfconf-cache.c:574 #13 0x00007ffff6e2d210 in g_closure_invoke () at /usr/lib/libgobject-2.0.so.0 #14 0x00007ffff6e5aea8 in () at /usr/lib/libgobject-2.0.so.0 #15 0x00007ffff6e4af75 in g_signal_emit_valist () at /usr/lib/libgobject-2.0.so.0 #16 0x00007ffff6e4b204 in g_signal_emit () at /usr/lib/libgobject-2.0.so.0 #17 0x00007ffff6f874f6 in () at /usr/lib/libgio-2.0.so.0 #18 0x00007ffff6f74218 in () at /usr/lib/libgio-2.0.so.0 #19 0x00007ffff6d1c87b in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0 #20 0x00007ffff6d73299 in () at /usr/lib/libglib-2.0.so.0 #21 0x00007ffff6d1bddf in g_main_loop_run () at /usr/lib/libglib-2.0.so.0 #22 0x00007ffff75d8ebf in gtk_main () at /usr/lib/libgtk-3.so.0 #23 0x0000555555586c70 in xfdesktop_application_start (app=0x555555606950) at xfdesktop-application.c:815 #24 0x0000555555586e5e in cb_wait_for_window_manager_destroyed (data=0x7fffe4012980) at xfdesktop-application.c:608 #25 0x00007ffff6d1719d in () at /usr/lib/libglib-2.0.so.0 #26 0x00007ffff6d19830 in () at /usr/lib/libglib-2.0.so.0 #27 0x00007ffff6d1c9b0 in g_main_context_dispatch () at /usr/lib/libglib-2.0.so.0 #28 0x00007ffff6d73299 in () at /usr/lib/libglib-2.0.so.0 #29 0x00007ffff6d1b132 in g_main_context_iteration () at /usr/lib/libglib-2.0.so.0 #30 0x00007ffff6f5810e in g_application_run () at /usr/lib/libgio-2.0.so.0 #31 0x000055555558821d in xfdesktop_application_run (app=<optimized out>, argc=<optimized out>, argv=<optimized out>) at xfdesktop-application.c:824 #32 0x0000555555574978 in main (argc=1, argv=0x7fffffffdf68) at main.c:54
I'm not sure what to do here. I can't figure out where the problem really comes from, and I don't see anything problematic in the current Xfconf code. So I'm hesitant to validate a change that, although it seems to be good and fix the problem, doesn't seem to be necessary (or I can't convince myself that it is).
Don't you think that the problem could come from your refactoring, since it's the only one causing this problem? (of course it's not a proof, it can also reveal a problem in Xfconf)
I noticed for example that this line was duplicated, which seems to me to be a mistake. But anyway, deleting it doesn't change anything... (at least concerning the critical)
Ah, man, when I first read about the duplicate binding I thought that would be it, but you're right that it doesn't fix the issue.
I'm struggling with the same thing: I can't find anything that looks wrong about the existing xfconf code either. I'm definitely a little concerned about it; the whole "one signal handler disconnect triggers the other" bit gives me a bad feeling that it could cause problems (but nothing I can prove). But, as you said, we don't see this issue anywhere else except in my refactor.
Ok, here's something I just thought of to try (spoiler: it worked!). My gut feeling was that there was something weird going on because I was removing bindings inside another binding/property-change handler (the new code unbinds those properties as a result of the xfconf icon style setting changing, when the binding mechanism changes a gobject icon-style property on
XfceDesktop
). I changed it so instead of immediately changing the icon style ("inside" the xfconf setting change handler), it just usesg_idle_add()
to queue the change (which will then remove the old icon view manager and icon view and apply the new setting). And this worked! No more warning, no more memory errors, no more "extra" unbind.To me, this is definitely suggestive that there is an issue, but it only triggers when removing bindings when we're already inside a handler for another binding. I think the next step would be to prove it by writing a standalone, minimal reproducing case.
Success! Created a minimal reproducer: https://gitlab.xfce.org/-/snippets/12 (instructions are in the snippet).
Thanks, that helps a lot in tracing the problem. Adding some traces also in the Xfconf code, it shows that this code is based on the false assumption that the GClosureNotify is always called immediately after disconnecting a signal handler connected via
g_signal_connect_data()
.For some reason (we would have to dig into the GObject code to see if this is normal or not…), in a nested case like your test case, the GClosureNotify invocation is delayed. So the binding is freed in
xfconf_g_property_object_disconnect()
before enteringxfconf_g_property_channel_disconnect()
, hence the corrupted data.The change you propose in this MR certainly fixes the problem and I'd be willing to validate it, but here's a less invasive one that more specifically targets the problem at hand. What do you think of it? (I only tested it on your minimal test case, not on your Xfdesktop branch, but I think it's good too)
Edited by Gaël BonithonOh interesting! I wonder if there is some data integrity issue internal to gobject when that particular case is hit, so they queue the second invocation with
g_idle_add()
or similar to work around that.Your change seems much better -- it should keep the same semantics as the current code, but only frees the binding struct once it's sure both closures have run. If you've tested it on my minimal test case, I think it's safe to say that will work with xfdesktop as well. I can give it a look later today or this evening, though, if that helps (just about lunchtime here now!).
Oh, regarding the patch, maybe you noticed, but if not: there is a comment above one of the
g_signal_handler_disconnect()
calls that is maybe no longer quite exactly correct after your patch, so probably good to delete it or change it.If you've tested it on my minimal test case, I think it's safe to say that will work with xfdesktop as well.
I actually also tested with Xfdesktop in between, at the same commit as above, first reproducing the critical and then verifying that it disappears with this patch (and verifying by the traces added to Xfconf that the same process takes place). So yes, I think it's good.
Oh, regarding the patch, maybe you noticed, but if not: there is a comment above
Good catch, patch updated. I think it's safe to push this for 4.18, do you agree? @alexxcons is this good for you too?
Your patch is much simpler and easier to review than mine is, too
😄 Uh, I first reviewed your patch, and then found the one from @Tamaranch. Yea, that one is indeed much more simple :D .... I think it would be nice to as well have a small comment inside the code, to make things more clear E.g:
if(binding->channel) { g_signal_handler_disconnect(G_OBJECT(binding->channel), binding->channel_handler); } else { /* Only release the binding if 'channel' already was unset in xfconf_g_property_channel_disconnect */ g_free(binding->xfconf_property); g_free(binding->object_property); g_slice_free(XfconfGBinding, binding);
... if(binding->object) { g_signal_handler_disconnect(G_OBJECT(binding->object), binding->object_handler); } else { /* Only release the binding if 'object' already was unset in xfconf_g_property_object_disconnect */ g_free(binding->xfconf_property); g_free(binding->object_property); g_slice_free(XfconfGBinding, binding);
Besides that, fine for me to merge it, so we have it in 4.18.0.
mentioned in commit 59747b46