diff mbox series

回复: 回复:回复: 回复: 回复: 回复: shm leak in traced application?

Message ID 401d796b-8f3c-453f-82f3-bf79e01a25d5.zhenyu.ren@aliyun.com
State New
Headers show
Series 回复: 回复:回复: 回复: 回复: 回复: shm leak in traced application? | expand

Commit Message

zhenyu.ren March 10, 2022, 4:24 a.m. UTC
Oh, I see. I have an old ust(2.7). So I have no FD_CLOEXEC in ustcomm_recv_fds_unix_sock(). 

Thanks very much!!!
zhenyu.ren
------------------------------------------------------------------
????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
?????2022?3?10?(???) 11:24
????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
????lttng-dev <lttng-dev at lists.lttng.org>
????[lttng-dev] ?????? ??? ??? ??? shm leak in traced application?

>When this happpens, is the process holding a single (or very few) shm file references, or references to many shm files ?

It is holding "all" of shm files' reference , neither a single one nor some few ones.

In fact, yesterday, I tried to fix it as the following and it seems work.

?????2022?3?10?(???) 00:46
????zhenyu.ren <zhenyu.ren at aliyun.com>
????Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>; lttng-dev <lttng-dev at lists.lttng.org>
????Re: ???[lttng-dev] ??? ??? ??? shm leak in traced application?

When this happpens, is the process holding a single (or very few) shm file references, or references to many
shm files ?

I wonder if you end up in a scenario where an application very frequently performs exec(), and therefore
sometimes the exec() will happen in the window between the unix socket file descriptor reception and
call to fcntl FD_CLOEXEC.

Thanks,

Mathieu

----- On Mar 8, 2022, at 8:29 PM, zhenyu.ren <zhenyu.ren at aliyun.com> wrote:
Thanks a  lot for reply. I do not reply it in bug tracker since I have not gotten a reliable way to reproduce the leak case. 
------------------------------------------------------------------
????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
?????2022?3?8?(???) 23:26
????zhenyu.ren <zhenyu.ren at aliyun.com>
????Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>; lttng-dev <lttng-dev at lists.lttng.org>
????Re: [lttng-dev] ??? ??? ??? shm leak in traced application?



----- On Mar 8, 2022, at 12:18 AM, lttng-dev lttng-dev at lists.lttng.org wrote:

> Hi,
> In shm_object_table_append_shm()/alloc_shm()? why not calling FD_CLOEXEC fcntl()
> to shmfds? I guess this omission leads to shm fds leak.

Those file descriptors are created when received by ustcomm_recv_fds_unix_sock, and
immediately after creation they are set as FD_CLOEXEC.

We should continue this discussion in the bug tracker as suggested by Jonathan.
It would greatly help if you can provide a small reproducer.

Thanks,

Mathieu


> Thanks
> zhenyu.ren

>> ------------------------------------------------------------------
>> ????Jonathan Rajotte-Julien <jonathan.rajotte-julien at efficios.com>
>> ?????2022?2?25?(???) 22:31
>> ????zhenyu.ren <zhenyu.ren at aliyun.com>
>> ? ??lttng-dev <lttng-dev at lists.lttng.org>
>> ? ??Re: [lttng-dev] ??? ??? shm leak in traced application?

>> Hi zhenyu.ren,

>> Please open a bug on our bug tracker and provide a reproducer against the latest
>> stable version (2.13.x).

>> https://bugs.lttng.org/

>> Please follow the guidelines: https://bugs.lttng.org/#Bug-reporting-guidelines

>> Cheers

>> On Fri, Feb 25, 2022 at 12:47:34PM +0800, zhenyu.ren via lttng-dev wrote:
>> > Hi, lttng-dev team
>>> When lttng-sessiond exits, the ust applications should call
>>> lttng_ust_objd_table_owner_cleanup() and clean up all shm resource(unmap and
>>> close). Howerver I do find that the ust applications keep opening "all" of the
>> > shm fds("/dev/shm/ust-shm-consumer-81132 (deleted)") and do NOT free shm.
>>> If we run lttng-sessiond again, ust applications can get a new piece of shm and
>>> a new list of shm fds so double shm usages. Then if we kill lttng-sessiond,
>>> what the mostlikely happened is ust applications close the new list of shm fds
>>> and free new shm resource but keeping old shm still. In other word, we can not
>> > free this piece of shm unless we killing ust applications!!!
>>> So Is there any possilbe that ust applications failed calling
>>> lttng_ust_objd_table_owner_cleanup()? Do you have ever see this problem? Do you
>>> have any advice to free the shm without killling ust applications(I tried to
>> > dig into kernel shm_open and /dev/shm, but not found any ideas)?

>> > Thanks in advance
>> > zhenyu.ren



>> > ------------------------------------------------------------------
>> > ????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
>> > ?????2022?2?23?(???) 23:09
>> > ????lttng-dev <lttng-dev at lists.lttng.org>
>> > ? ??[lttng-dev] ??? shm leak in traced application?

>>> >"I found these items also exist in a traced application which is a long-time
>> > >running daemon"
>> > Even if lttng-sessiond has been killed!!

>> > Thanks
>> > zhenyu.ren
>> > ------------------------------------------------------------------
>> > ????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
>> > ?????2022?2?23?(???) 22:44
>> > ????lttng-dev <lttng-dev at lists.lttng.org>
>> > ? ??[lttng-dev] shm leak in traced application?

>> > Hi,
>>> There are many items such as "/dev/shm/ust-shm-consumer-81132 (deleted)" exist
>>> in lttng-sessiond fd spaces. I know it is the result of shm_open() and
>> > shm_unlnik() in create_posix_shm().
>>> However, today, I found these items also exist in a traced application which is
>>> a long-time running daemon. The most important thing I found is that there
>> > seems no reliable way to release share memory.
>>> I tried to kill lttng-sessiond but not always release share memory. Sometimes I
>>> need to kill the traced application to free share memory....But it is not a
>> > good idea to kill these applications.
>> > My questions are:
>>> 1. Is there any way to release share memory without killing any traced
>> > application?
>>> 2. Is it normal that many items such as "/dev/shm/ust-shm-consumer-81132
>> > (deleted)" exist in the traced application?

>> > Thanks
>> > zhenyu.ren



>> > _______________________________________________
>> > lttng-dev mailing list
>> > lttng-dev at lists.lttng.org
>> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

>> --
>> Jonathan Rajotte-Julien
>> EfficiOS
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Comments

Mathieu Desnoyers March 10, 2022, 2:31 p.m. UTC | #1
Hi Zhenyu, 

This is exactly why Jonathan and I asked you to fill a bug report on the bug tracker 
and follow the bug reporting guidelines ( [ https://lttng.org/community/#bug-reporting-guidelines | https://lttng.org/community/#bug-reporting-guidelines ] ). 

This saves time for everyone. 

Thanks, 

Mathieu 

----- On Mar 9, 2022, at 11:24 PM, zhenyu.ren <zhenyu.ren at aliyun.com> wrote: 

> Oh, I see. I have an old ust(2.7). So I have no FD_CLOEXEC in
> ustcomm_recv_fds_unix_sock().

> Thanks very much!!!
> zhenyu.ren

>> ------------------------------------------------------------------
>> ????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
>> ?????2022?3?10?(???) 11:24
>> ????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
>> ? ??lttng-dev <lttng-dev at lists.lttng.org>
>> ? ??[lttng-dev] ?????? ??? ??? ??? shm leak in traced application?

>>> When this happpens, is the process holding a single (or very few) shm file
>> > references, or references to many shm files ?

>> It is holding "all" of shm files' reference , neither a single one nor some few
>> ones.

>> In fact, yesterday, I tried to fix it as the following and it seems work.

>> --- a/lttng-ust/libringbuffer/shm.c

>> +++ b/lttng-ust/libringbuffer/shm.c

>> @@ -32,7 +32,6 @@

>> #include <lttng/align.h>

>> #include <limits.h>

>> #include <helper.h>

>> -

>> /*

>> * Ensure we have the required amount of space available by writing 0

>> * into the entire buffer. Not doing so can trigger SIGBUS when going

>> @@ -122,6 +121,12 @@ struct shm_object *_shm_object_table_alloc_shm(struct
>> shm_object_table *table,

>> /* create shm */

>> shmfd = stream_fd;

>> + if (shmfd >= 0) {

>> + ret = fcntl(shmfd, F_SETFD, FD_CLOEXEC);

>> + if (ret < 0) {

>> + PERROR("fcntl shmfd FD_CLOEXEC");

>> + }

>> + }

>> ret = zero_file(shmfd, memory_map_size);

>> if (ret) {

>> PERROR("zero_file");

>> @@ -272,15 +277,22 @@ struct shm_object *shm_object_table_append_shm(struct
>> shm_object_table *table,

>> obj->shm_fd = shm_fd;

>> obj->shm_fd_ownership = 1;

>> + if (shm_fd >= 0) {

>> + ret = fcntl(shm_fd, F_SETFD, FD_CLOEXEC);

>> + if (ret < 0) {

>> + PERROR("fcntl shmfd FD_CLOEXEC");

>> + //goto error_fcntl;

>> + }

>> + }

>> ret = fcntl(obj->wait_fd[1], F_SETFD, FD_CLOEXEC);

>> if (ret < 0) {

>> As it shows, wait_fd[1] has been set FD_CLOEXEC by fcntl() but not shm_fd. Why
>> your patch do with wait_fd but not shm_fd? As far as I know, wait_fd is just a
>> pipe and it seems not related to shm resource.

>> ------------------------------------------------------------------
>> ????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
>> ?????2022?3?10?(???) 00:46
>> ????zhenyu.ren <zhenyu.ren at aliyun.com>
>> ? ??Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>; lttng-dev
>> <lttng-dev at lists.lttng.org>
>> ? ??Re: ???[lttng-dev] ??? ??? ??? shm leak in traced application?

>> When this happpens, is the process holding a single (or very few) shm file
>> references, or references to many
>> shm files ?

>> I wonder if you end up in a scenario where an application very frequently
>> performs exec(), and therefore
>> sometimes the exec() will happen in the window between the unix socket file
>> descriptor reception and
>> call to fcntl FD_CLOEXEC.

>> Thanks,

>> Mathieu

>> ----- On Mar 8, 2022, at 8:29 PM, zhenyu.ren <zhenyu.ren at aliyun.com> wrote:
>> Thanks a lot for reply. I do not reply it in bug tracker since I have not gotten
>> a reliable way to reproduce the leak case.
>> ------------------------------------------------------------------
>> ????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
>> ?????2022?3?8?(???) 23:26
>> ????zhenyu.ren <zhenyu.ren at aliyun.com>
>> ? ??Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>; lttng-dev
>> <lttng-dev at lists.lttng.org>
>> ? ??Re: [lttng-dev] ??? ??? ??? shm leak in traced application?

>> ----- On Mar 8, 2022, at 12:18 AM, lttng-dev lttng-dev at lists.lttng.org wrote:

>> > Hi,
>> > In shm_object_table_append_shm()/alloc_shm()? why not calling FD_CLOEXEC fcntl()
>> > to shmfds? I guess this omission leads to shm fds leak.

>> Those file descriptors are created when received by ustcomm_recv_fds_unix_sock,
>> and
>> immediately after creation they are set as FD_CLOEXEC.

>> We should continue this discussion in the bug tracker as suggested by Jonathan.
>> It would greatly help if you can provide a small reproducer.

>> Thanks,

>> Mathieu

>> > Thanks
>> > zhenyu.ren

>> >> ------------------------------------------------------------------
>> >> ????Jonathan Rajotte-Julien <jonathan.rajotte-julien at efficios.com>
>> >> ?????2022?2?25?(???) 22:31
>> >> ????zhenyu.ren <zhenyu.ren at aliyun.com>
>> >> ? ??lttng-dev <lttng-dev at lists.lttng.org>
>> >> ? ??Re: [lttng-dev] ??? ??? shm leak in traced application?

>> >> Hi zhenyu.ren,

>> >> Please open a bug on our bug tracker and provide a reproducer against the latest
>> >> stable version (2.13.x).

>> >> [ https://bugs.lttng.org/ | https://bugs.lttng.org/ ]

>>>> Please follow the guidelines: [ https://bugs.lttng.org/#Bug-reporting-guidelines
>> >> | https://bugs.lttng.org/#Bug-reporting-guidelines ]

>> >> Cheers

>> >> On Fri, Feb 25, 2022 at 12:47:34PM +0800, zhenyu.ren via lttng-dev wrote:
>> >> > Hi, lttng-dev team
>> >>> When lttng-sessiond exits, the ust applications should call
>> >>> lttng_ust_objd_table_owner_cleanup() and clean up all shm resource(unmap and
>> >>> close). Howerver I do find that the ust applications keep opening "all" of the
>> >> > shm fds("/dev/shm/ust-shm-consumer-81132 (deleted)") and do NOT free shm.
>> >>> If we run lttng-sessiond again, ust applications can get a new piece of shm and
>> >>> a new list of shm fds so double shm usages. Then if we kill lttng-sessiond,
>> >>> what the mostlikely happened is ust applications close the new list of shm fds
>> >>> and free new shm resource but keeping old shm still. In other word, we can not
>> >> > free this piece of shm unless we killing ust applications!!!
>> >>> So Is there any possilbe that ust applications failed calling
>> >>> lttng_ust_objd_table_owner_cleanup()? Do you have ever see this problem? Do you
>> >>> have any advice to free the shm without killling ust applications(I tried to
>> >> > dig into kernel shm_open and /dev/shm, but not found any ideas)?

>> >> > Thanks in advance
>> >> > zhenyu.ren

>> >> > ------------------------------------------------------------------
>> >> > ????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
>> >> > ?????2022?2?23?(???) 23:09
>> >> > ????lttng-dev <lttng-dev at lists.lttng.org>
>> >> > ? ??[lttng-dev] ??? shm leak in traced application?

>> >>> >"I found these items also exist in a traced application which is a long-time
>> >> > >running daemon"
>> >> > Even if lttng-sessiond has been killed!!

>> >> > Thanks
>> >> > zhenyu.ren
>> >> > ------------------------------------------------------------------
>> >> > ????zhenyu.ren via lttng-dev <lttng-dev at lists.lttng.org>
>> >> > ?????2022?2?23?(???) 22:44
>> >> > ????lttng-dev <lttng-dev at lists.lttng.org>
>> >> > ? ??[lttng-dev] shm leak in traced application?

>> >> > Hi,
>> >>> There are many items such as "/dev/shm/ust-shm-consumer-81132 (deleted)" exist
>> >>> in lttng-sessiond fd spaces. I know it is the result of shm_open() and
>> >> > shm_unlnik() in create_posix_shm().
>> >>> However, today, I found these items also exist in a traced application which is
>> >>> a long-time running daemon. The most important thing I found is that there
>> >> > seems no reliable way to release share memory.
>> >>> I tried to kill lttng-sessiond but not always release share memory. Sometimes I
>> >>> need to kill the traced application to free share memory....But it is not a
>> >> > good idea to kill these applications.
>> >> > My questions are:
>> >>> 1. Is there any way to release share memory without killing any traced
>> >> > application?
>> >>> 2. Is it normal that many items such as "/dev/shm/ust-shm-consumer-81132
>> >> > (deleted)" exist in the traced application?

>> >> > Thanks
>> >> > zhenyu.ren

>> >> > _______________________________________________
>> >> > lttng-dev mailing list
>> >> > lttng-dev at lists.lttng.org
>>>> > [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>> >> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]

>> >> --
>> >> Jonathan Rajotte-Julien
>> >> EfficiOS
>> > _______________________________________________
>> > lttng-dev mailing list
>> > lttng-dev at lists.lttng.org
>>> [ https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev |
>> > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev ]
>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> [ http://www.efficios.com/ | http://www.efficios.com ]

>> --
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> [ http://www.efficios.com/ | http://www.efficios.com ]
diff mbox series

Patch

--- a/lttng-ust/libringbuffer/shm.c
+++ b/lttng-ust/libringbuffer/shm.c
@@ -32,7 +32,6 @@ 
 #include <lttng/align.h>
 #include <limits.h>
 #include <helper.h>
-
 /*
  * Ensure we have the required amount of space available by writing 0
  * into the entire buffer. Not doing so can trigger SIGBUS when going
@@ -122,6 +121,12 @@  struct shm_object *_shm_object_table_alloc_shm(struct shm_object_table *table,
        /* create shm */

        shmfd = stream_fd;
+    if (shmfd >= 0) {
+     ret = fcntl(shmfd, F_SETFD, FD_CLOEXEC);
+     if (ret < 0) {
+   PERROR("fcntl shmfd FD_CLOEXEC");
+     }
+    }
        ret = zero_file(shmfd, memory_map_size);
        if (ret) {
                PERROR("zero_file");
@@ -272,15 +277,22 @@  struct shm_object *shm_object_table_append_shm(struct shm_object_table *table,
        obj->shm_fd = shm_fd;
        obj->shm_fd_ownership = 1;

+    if (shm_fd >= 0) {
+     ret = fcntl(shm_fd, F_SETFD, FD_CLOEXEC);
+     if (ret < 0) {
+   PERROR("fcntl shmfd FD_CLOEXEC");
+   //goto error_fcntl;
+     }
+    }
        ret = fcntl(obj->wait_fd[1], F_SETFD, FD_CLOEXEC);
        if (ret < 0) {

    As it shows, wait_fd[1] has been set FD_CLOEXEC by fcntl() but not shm_fd. Why your patch do with wait_fd but not shm_fd? As far as I know, wait_fd is just a pipe and it seems not related to shm resource.







------------------------------------------------------------------
????Mathieu Desnoyers <mathieu.desnoyers at efficios.com>