diff mbox series

[RFC,v2,10/13] Teardown apps_notify_thread before thread_manage_apps

Message ID 20170918225206.17725-11-jonathan.rajotte-julien@efficios.com
State Superseded, archived
Delegated to: Jérémie Galarneau
Headers show
Series Sessiond teardown overhaul | expand

Commit Message

Jonathan Rajotte Sept. 18, 2017, 10:52 p.m. UTC
RFC:
This is necessary since we use the ust_app_ht_by_notify_sock to
perform the cleanup operation. Since both ust_app_unregister and
ust_app_ust_app_notify_unregister perform a remove of the app on the
ust_ust_app_by_notify_sock but only ust_app_notify_unregister actually
end up performing a close(call_rcu) on the socket.

Other way to do fix this problem?

Could we simply not remove it on a ust_app_unregister? And always defer
to the apps_notify_thread for cleanup?

Update the value in the hash table to -1 and emit a close and remove
from the hash table if the value is -1?

We could also keep a local list of fd in apps_notify_thread and use it for
cleanup instead of relying on ust_ust_app_by_notify_sock.

I'm not sure what is the best/elegant solution here. I am not a fan of
the current solution but it working.

Obviously this commit will be reworded and modified accordingly.

Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
---
 src/bin/lttng-sessiond/main.c | 55 ++++++++++++++++++++++++-------------------
 1 file changed, 31 insertions(+), 24 deletions(-)

Comments

Jérémie Galarneau Dec. 13, 2017, 5:58 p.m. UTC | #1
On 18 September 2017 at 18:52, Jonathan Rajotte
<jonathan.rajotte-julien at efficios.com> wrote:
> RFC:
> This is necessary since we use the ust_app_ht_by_notify_sock to
> perform the cleanup operation. Since both ust_app_unregister and
> ust_app_ust_app_notify_unregister perform a remove of the app on the
> ust_ust_app_by_notify_sock but only ust_app_notify_unregister actually

There are some extra ust_ and ust_app_ on the last two lines ;)

> end up performing a close(call_rcu) on the socket.

a close(call_rcu) ?

>

It took me a while to wrap my head around what the problem is given this
description. Please, correct me if I'm wrong.

Two threads are performing a clean-up step based on the content of the
ust_app_ht_by_notify_sock hash table.

The apps_notify_thread, on shutdown:
  - invokes notify_sock_unregister_all()
    - iterates through the ust_app_ht_by_notify_sock hash table
      - calls ust_app_notify_sock_unregister() on every application
        notification socket
        - removes the app from the ust_app_ht_by_notify_sock hash table
        - closes the notify socket through a deferred call_rcu()


The apps_thread, on shutdown:
  - iterates through the ust_app_ht hash table
    - calls ust_app_unregister() on every application notification
      socket
      - This function's header comment mentions "The socket is already
        closed at this point so no close to sock.", by which the author
        most likely meant that "there is no need to close the socket".

        This is no longer true with your patch (#09). This thread
        only reacts to errors on the application sockets and then
 tears
        down

      - Flushes that app's metadata and buffers
      - Removes the app from the ust_app_ht_by_sock hash table
      - Removes the app from the ust_app_ht_by_notify_sock hash table
      - Removes the app from the ust_app_ht hash table


Given this, it already seems weird that both threads try to remove
the entry from the ust_app_ht_by_notify_sock hash table.

In the case where we are not shutting down,


> Other way to do fix this problem?
>
> Could we simply not remove it on a ust_app_unregister? And always defer
> to the apps_notify_thread for cleanup?
>
> Update the value in the hash table to -1 and emit a close and remove
> from the hash table if the value is -1?
>
> We could also keep a local list of fd in apps_notify_thread and use it for
> cleanup instead of relying on ust_ust_app_by_notify_sock.
>
> I'm not sure what is the best/elegant solution here. I am not a fan of
> the current solution but it working.
>
> Obviously this commit will be reworded and modified accordingly.
>
> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
> ---
>  src/bin/lttng-sessiond/main.c | 55 ++++++++++++++++++++++++-------------------
>  1 file changed, 31 insertions(+), 24 deletions(-)
>
> diff --git a/src/bin/lttng-sessiond/main.c b/src/bin/lttng-sessiond/main.c
> index 4a2a661f..216d0da6 100644
> --- a/src/bin/lttng-sessiond/main.c
> +++ b/src/bin/lttng-sessiond/main.c
> @@ -6209,16 +6209,6 @@ int main(int argc, char **argv)
>         }
>         notification_thread_running = true;
>
> -       /* Create thread to manage application notify socket */
> -       ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
> -                       ust_thread_manage_notify, (void *) NULL);
> -       if (ret) {
> -               errno = ret;
> -               PERROR("pthread_create notify");
> -               retval = -1;
> -               stop_threads();
> -               goto exit_apps_notify;
> -       }
>
>         /* Create thread to manage application socket */
>         ret = pthread_create(&apps_thread, default_pthread_attr(),
> @@ -6231,6 +6221,17 @@ int main(int argc, char **argv)
>                 goto exit_apps;
>         }
>
> +       /* Create thread to manage application notify socket */
> +       ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
> +                       ust_thread_manage_notify, (void *) NULL);
> +       if (ret) {
> +               errno = ret;
> +               PERROR("pthread_create notify");
> +               retval = -1;
> +               stop_threads();
> +               goto exit_apps_notify;
> +       }
> +
>         /* Create thread to dispatch registration */
>         ret = pthread_create(&dispatch_thread, default_pthread_attr(),
>                         thread_dispatch_ust_registration, (void *) NULL);
> @@ -6358,20 +6359,6 @@ exit_reg_apps:
>         }
>
>  exit_dispatch:
> -       /* Instruct the apps thread to quit */
> -       ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
> -       if (ret < 0) {
> -               ERR("write error on thread quit pipe");
> -       }
> -
> -       ret = pthread_join(apps_thread, &status);
> -       if (ret) {
> -               errno = ret;
> -               PERROR("pthread_join apps");
> -               retval = -1;
> -       }
> -
> -exit_apps:
>         /* Instruct the apps_notify thread to quit */
>         ret = notify_thread_pipe(thread_apps_notify_teardown_trigger_pipe[1]);
>         if (ret < 0) {
> @@ -6386,6 +6373,26 @@ exit_apps:
>         }
>
>  exit_apps_notify:
> +       /*
> +        * The barrier ensure that all previous resources, notify sockets in
> +        * particular, are freed/closed.
> +        */
> +       rcu_barrier();
> +
> +       /* Instruct the apps thread to quit */
> +       ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
> +       if (ret < 0) {
> +               ERR("write error on thread quit pipe");
> +       }
> +
> +       ret = pthread_join(apps_thread, &status);
> +       if (ret) {
> +               errno = ret;
> +               PERROR("pthread_join apps");
> +               retval = -1;
> +       }
> +
> +exit_apps:
>  exit_notification:
>  exit_health:
>  exit_init_data:
> --
> 2.11.0
>
Jérémie Galarneau Dec. 14, 2017, 1:57 a.m. UTC | #2
Sorry, I fat-finger-sent an incomplete reply, disregard it. This e-mail
contains all the comments.

On 18 September 2017 at 18:52, Jonathan Rajotte
<jonathan.rajotte-julien at efficios.com> wrote:
> RFC:
> This is necessary since we use the ust_app_ht_by_notify_sock to
> perform the cleanup operation. Since both ust_app_unregister and
> ust_app_ust_app_notify_unregister perform a remove of the app on the
> ust_ust_app_by_notify_sock but only ust_app_notify_unregister actually

There are some extra ust_ and ust_app_ on the last two lines ;)

> end up performing a close(call_rcu) on the socket.

a close(call_rcu) ?

>
> Other way to do fix this problem?
>

Let me walk through the problem as I understand it and
we'll see if/where I am going off-course.

An ust_app has two sockets:
  - command (through which the sessiond sends commands to the app)
  - notify (through which the application notifies the sessiond of
    various conditions such as new events being available, etc.)

These two sockets are received from applications by the
reg_apps_thread [1] which posts them to the dispatch_thread [2].
Once the dispatch thread is aware of both sockets for a given app,
the two sockets are "dispatched" to their respective management
threads.

* The "command" socket is sent to the "apps_thread" [3]
* The "notify" socket is sent to the "apps_notify_thread" [4]

Let's look at what happens when an application dies.

The "apps_thread" (handles application command sockets):
- wakes up from its poll() and notices an error/hang-up has
  occurred on an application's command socket
  - calls ust_app_unregister() on with that socket's FD
    - retrieves the ust_app from the socket's FD through
      ust_app_ht_by_sock
      - flushes that application's buffers and metadata
      - removes the app from ust_app_ht_by_sock
      - removes the app from ust_app_ht_by_notify_sock
        (it's okay for this to fail)
      - removes the app from ust_app_ht (pid <-> ust_app)
      - enqueues a call_rcu() job to close the command socket

The "apps_notify_thread" (handles application notification sockets):
- wakes up from its poll() and notices an error/hand-up has
  occurred on an application's notification socket
  - calls ust_app_notify_sock_unregister()
    - removes the app from ust_app_ht_by_notify_sock
      (it's okay for this to fail since both threads are trying
       to clean this up)
    - call_rcu() to close the notify socket


Now, provided that I understand the problem correctly
(as stated in my reply to patch #06), you want to clean-up
the command and notify sockets in the same way that they
would be if their associated apps had died. That's fair.

However, as seen above, cleaning up the "apps_thread" will cause
it to empty all the ust_app data structures:
  * ust_app_ht_by_sock
  * ust_app_ht_by_notify_sock
  * ust_app_ht

However, the "apps_notify_thread" needs at least one of those
data structures to be present to iterate on all of the
applications and perform its clean-up.

Hence, you want to make sure that the "apps_notify_thread" can
complete its shutdown before the "apps_thread" starts its own
clean-up. Correct?

I would be okay with reordering the threads' teardown to
provide that guarantee. My only gripe is that it needs to
be documented _extensively_ as it is not obvious at all that
such a dependency exists between those threads.

On the other hand, it seems to me that it would be simpler
to _not_ perform the clean-up you added in the "apps_thread"
and leave that to the sessiond_cleanup(), once all threads
have been torn down.

Not performing the clean-up in the "apps_thread" allows the
clean-up (that you added) to occur in the "apps_notify_thread"
as the data structures (such as ust_app_ht) are still in place.

As a result of the "apps_notify_thread" clean-up, the notify
sockets would eventually be closed by through the call_rcu()'s
during the next grace period. This will then unblock the
applications that are stuck waiting on their notify socket.

Would that solve the problem or am I missing something?

Jérémie

[1] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L2005
[2] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L1749
[3] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/main.c#L1450
[4] https://github.com/lttng/lttng-tools/blob/master/src/bin/lttng-sessiond/ust-thread.c#L33


> Could we simply not remove it on a ust_app_unregister? And always defer
> to the apps_notify_thread for cleanup?
>
> Update the value in the hash table to -1 and emit a close and remove
> from the hash table if the value is -1?
>
> We could also keep a local list of fd in apps_notify_thread and use it for
> cleanup instead of relying on ust_ust_app_by_notify_sock.
>
> I'm not sure what is the best/elegant solution here. I am not a fan of
> the current solution but it working.
>
> Obviously this commit will be reworded and modified accordingly.
>
> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
> ---
>  src/bin/lttng-sessiond/main.c | 55 ++++++++++++++++++++++++-------------------
>  1 file changed, 31 insertions(+), 24 deletions(-)
>
> diff --git a/src/bin/lttng-sessiond/main.c b/src/bin/lttng-sessiond/main.c
> index 4a2a661f..216d0da6 100644
> --- a/src/bin/lttng-sessiond/main.c
> +++ b/src/bin/lttng-sessiond/main.c
> @@ -6209,16 +6209,6 @@ int main(int argc, char **argv)
>         }
>         notification_thread_running = true;
>
> -       /* Create thread to manage application notify socket */
> -       ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
> -                       ust_thread_manage_notify, (void *) NULL);
> -       if (ret) {
> -               errno = ret;
> -               PERROR("pthread_create notify");
> -               retval = -1;
> -               stop_threads();
> -               goto exit_apps_notify;
> -       }
>
>         /* Create thread to manage application socket */
>         ret = pthread_create(&apps_thread, default_pthread_attr(),
> @@ -6231,6 +6221,17 @@ int main(int argc, char **argv)
>                 goto exit_apps;
>         }
>
> +       /* Create thread to manage application notify socket */
> +       ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
> +                       ust_thread_manage_notify, (void *) NULL);
> +       if (ret) {
> +               errno = ret;
> +               PERROR("pthread_create notify");
> +               retval = -1;
> +               stop_threads();
> +               goto exit_apps_notify;
> +       }
> +
>         /* Create thread to dispatch registration */
>         ret = pthread_create(&dispatch_thread, default_pthread_attr(),
>                         thread_dispatch_ust_registration, (void *) NULL);
> @@ -6358,20 +6359,6 @@ exit_reg_apps:
>         }
>
>  exit_dispatch:
> -       /* Instruct the apps thread to quit */
> -       ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
> -       if (ret < 0) {
> -               ERR("write error on thread quit pipe");
> -       }
> -
> -       ret = pthread_join(apps_thread, &status);
> -       if (ret) {
> -               errno = ret;
> -               PERROR("pthread_join apps");
> -               retval = -1;
> -       }
> -
> -exit_apps:
>         /* Instruct the apps_notify thread to quit */
>         ret = notify_thread_pipe(thread_apps_notify_teardown_trigger_pipe[1]);
>         if (ret < 0) {
> @@ -6386,6 +6373,26 @@ exit_apps:
>         }
>
>  exit_apps_notify:
> +       /*
> +        * The barrier ensure that all previous resources, notify sockets in
> +        * particular, are freed/closed.
> +        */
> +       rcu_barrier();
> +
> +       /* Instruct the apps thread to quit */
> +       ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
> +       if (ret < 0) {
> +               ERR("write error on thread quit pipe");
> +       }
> +
> +       ret = pthread_join(apps_thread, &status);
> +       if (ret) {
> +               errno = ret;
> +               PERROR("pthread_join apps");
> +               retval = -1;
> +       }
> +
> +exit_apps:
>  exit_notification:
>  exit_health:
>  exit_init_data:
> --
> 2.11.0
>
diff mbox series

Patch

diff --git a/src/bin/lttng-sessiond/main.c b/src/bin/lttng-sessiond/main.c
index 4a2a661f..216d0da6 100644
--- a/src/bin/lttng-sessiond/main.c
+++ b/src/bin/lttng-sessiond/main.c
@@ -6209,16 +6209,6 @@  int main(int argc, char **argv)
 	}
 	notification_thread_running = true;
 
-	/* Create thread to manage application notify socket */
-	ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
-			ust_thread_manage_notify, (void *) NULL);
-	if (ret) {
-		errno = ret;
-		PERROR("pthread_create notify");
-		retval = -1;
-		stop_threads();
-		goto exit_apps_notify;
-	}
 
 	/* Create thread to manage application socket */
 	ret = pthread_create(&apps_thread, default_pthread_attr(),
@@ -6231,6 +6221,17 @@  int main(int argc, char **argv)
 		goto exit_apps;
 	}
 
+	/* Create thread to manage application notify socket */
+	ret = pthread_create(&apps_notify_thread, default_pthread_attr(),
+			ust_thread_manage_notify, (void *) NULL);
+	if (ret) {
+		errno = ret;
+		PERROR("pthread_create notify");
+		retval = -1;
+		stop_threads();
+		goto exit_apps_notify;
+	}
+
 	/* Create thread to dispatch registration */
 	ret = pthread_create(&dispatch_thread, default_pthread_attr(),
 			thread_dispatch_ust_registration, (void *) NULL);
@@ -6358,20 +6359,6 @@  exit_reg_apps:
 	}
 
 exit_dispatch:
-	/* Instruct the apps thread to quit */
-	ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
-	if (ret < 0) {
-		ERR("write error on thread quit pipe");
-	}
-
-	ret = pthread_join(apps_thread, &status);
-	if (ret) {
-		errno = ret;
-		PERROR("pthread_join apps");
-		retval = -1;
-	}
-
-exit_apps:
 	/* Instruct the apps_notify thread to quit */
 	ret = notify_thread_pipe(thread_apps_notify_teardown_trigger_pipe[1]);
 	if (ret < 0) {
@@ -6386,6 +6373,26 @@  exit_apps:
 	}
 
 exit_apps_notify:
+	/*
+	 * The barrier ensure that all previous resources, notify sockets in
+	 * particular, are freed/closed.
+	 */
+	rcu_barrier();
+
+	/* Instruct the apps thread to quit */
+	ret = notify_thread_pipe(thread_apps_teardown_trigger_pipe[1]);
+	if (ret < 0) {
+		ERR("write error on thread quit pipe");
+	}
+
+	ret = pthread_join(apps_thread, &status);
+	if (ret) {
+		errno = ret;
+		PERROR("pthread_join apps");
+		retval = -1;
+	}
+
+exit_apps:
 exit_notification:
 exit_health:
 exit_init_data: