[RFC,v2,06/13] Fix: unregister app notify socket on sessiond tear down
Commit Message
A race between the sessiond tear down and applications initialization
can lead to a deadlock.
Applications try to communicate via the notify sockets while sessiond
does not listen anymore on these sockets since the thread responsible
for reception/response is terminated (ust_thread_manage_notify). These
sockets are never closed hence an application could hang on
communication.
Sessiond hang happen during call to cmd_destroy_session during
sessiond_cleanup. Sessiond is trying to communicate with the app while
the app is waiting for a response on the app notification socket.
To prevent this situation a call to ust_app_notify_sock_unregister is
performed on all entry of the ust_app_ht_by_notify_sock hash table at
the time of termination. This ensure that any pending communication
initiated by the application will be terminated since all sockets will
be closed at the end of the grace period via call_rcu inside
ust_app_notify_sock_unregister. The use of ust_app_ht_by_notify_sock
instead of the ust_app_ht prevent a double call_rcu since entries are
removed from ust_app_ht_by_notify_sock during ust_app_notify_sock_unregister.
This can be reproduced using the sessiond_teardown_active_session
scenario provided by [1].
[1] https://github.com/PSRCode/lttng-stress
Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
---
src/bin/lttng-sessiond/ust-thread.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
Comments
On 18 September 2017 at 18:51, Jonathan Rajotte
<jonathan.rajotte-julien at efficios.com> wrote:
> A race between the sessiond tear down and applications initialization
> can lead to a deadlock.
>
> Applications try to communicate via the notify sockets while sessiond
the sessiond
> does not listen anymore on these sockets since the thread responsible
> for reception/response is terminated (ust_thread_manage_notify). These
> sockets are never closed hence an application could hang on
> communication.
>
> Sessiond hang happen during call to cmd_destroy_session during
The sessiond hang happens during the call to cmd_destroy_session initiated
by sessiond_cleanup().
> sessiond_cleanup. Sessiond is trying to communicate with the app while
> the app is waiting for a response on the app notification socket.
>
> To prevent this situation a call to ust_app_notify_sock_unregister is
> performed on all entry of the ust_app_ht_by_notify_sock hash table at
entry -> entries
> the time of termination. This ensure that any pending communication
ensure -> ensures
> initiated by the application will be terminated since all sockets will
> be closed at the end of the grace period via call_rcu inside
> ust_app_notify_sock_unregister. The use of ust_app_ht_by_notify_sock
> instead of the ust_app_ht prevent a double call_rcu since entries are
prevent -> prevents
Do I understand the "big picture" of what you are trying to solve here?
- App communicates on its 'notify' socket and waits for a reply
- Sessiond never sends that reply since apps_notify_thread (the
thread that manages application 'notify' sockets) is dead.
- Sessiond is sending a command through the application 'command'
socket during its clean-up
- Sessiond never receives a reply since the application is still
waiting for a reply on the 'notify' socket.
The 'simple' fix I see would be that the apps_notify_thread should
close all the FDs it manages (and set them to -1) as it is the only
thread interacting with those FDs (to my knowledge). That would
unblock the applications and prevent further notification from
being sent.
What I understand you are trying to do is to perform the clean-up
in the same way it would have occurred had the application simply
died. At first glance, that sounds like a clean way to handle the
problem.
Going back to your patch:
Why does using ust_app_ht_by_notify_sock instead of ust_app_ht
prevent a double call_rcu?
I see how this thread performs a deferred clean-up using call_rcu().
What is causing the second one?
> removed from ust_app_ht_by_notify_sock during ust_app_notify_sock_unregister.
>
> This can be reproduced using the sessiond_teardown_active_session
> scenario provided by [1].
>
> [1] https://github.com/PSRCode/lttng-stress
>
> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
> ---
> src/bin/lttng-sessiond/ust-thread.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/src/bin/lttng-sessiond/ust-thread.c b/src/bin/lttng-sessiond/ust-thread.c
> index 1e7a8229..8f11133a 100644
> --- a/src/bin/lttng-sessiond/ust-thread.c
> +++ b/src/bin/lttng-sessiond/ust-thread.c
> @@ -27,6 +27,19 @@
> #include "health-sessiond.h"
> #include "testpoint.h"
>
> +
> +static
> +void notify_sock_unregister_all()
> +{
> + struct lttng_ht_iter iter;
> + struct ust_app *app;
Missing empty line here.
> + rcu_read_lock();
> + cds_lfht_for_each_entry(ust_app_ht_by_notify_sock->ht, &iter.iter, app, notify_sock_n.node) {
This line exceeds 80 chars.
> + ust_app_notify_sock_unregister(app->notify_sock);
> + }
> + rcu_read_unlock();
> +}
> +
> /*
> * This thread manage application notify communication.
> */
> @@ -53,7 +66,7 @@ void *ust_thread_manage_notify(void *data)
>
> ret = lttng_poll_create(&events, 2, LTTNG_CLOEXEC);
> if (ret < 0) {
> - goto error;
> + goto error_poll_create;
> }
>
> /* Add quit pipe */
> @@ -197,6 +210,8 @@ error_poll_create:
> error_testpoint:
> utils_close_pipe(apps_cmd_notify_pipe);
> apps_cmd_notify_pipe[0] = apps_cmd_notify_pipe[1] = -1;
> + notify_sock_unregister_all();
> +
> DBG("Application notify communication apps thread cleanup complete");
> if (err) {
> health_error();
> --
> 2.11.0
>
@@ -27,6 +27,19 @@
#include "health-sessiond.h"
#include "testpoint.h"
+
+static
+void notify_sock_unregister_all()
+{
+ struct lttng_ht_iter iter;
+ struct ust_app *app;
+ rcu_read_lock();
+ cds_lfht_for_each_entry(ust_app_ht_by_notify_sock->ht, &iter.iter, app, notify_sock_n.node) {
+ ust_app_notify_sock_unregister(app->notify_sock);
+ }
+ rcu_read_unlock();
+}
+
/*
* This thread manage application notify communication.
*/
@@ -53,7 +66,7 @@ void *ust_thread_manage_notify(void *data)
ret = lttng_poll_create(&events, 2, LTTNG_CLOEXEC);
if (ret < 0) {
- goto error;
+ goto error_poll_create;
}
/* Add quit pipe */
@@ -197,6 +210,8 @@ error_poll_create:
error_testpoint:
utils_close_pipe(apps_cmd_notify_pipe);
apps_cmd_notify_pipe[0] = apps_cmd_notify_pipe[1] = -1;
+ notify_sock_unregister_all();
+
DBG("Application notify communication apps thread cleanup complete");
if (err) {
health_error();