[CI] DLM -CI integration

Discussion:

Aneesh Kumar K.V

2002-03-19 13:32:03 UTC

Hi,

Today i ran a small test case that will acquire a lock and sleep for
some time and then unlock . I tested it running simultaneously on both
the nodes. It appear to work fine. This makes it clear that the basic
functionality of DLM ( which is distributed locking ) is there with DLM
CI-Linux integration :).

Now the serious part comes nodeup and node down. When i reboot the
second node my node down routine get called properly and there after it
calls the handle new topology routine also. The following are the part
of dmesg ouput.

Node 2 has gone down!!!
Aneesh inside node down
Aneesh before get_upnode_list
[haDLM] Allocated [52/0xc629aca0/Aneesh CMGR topology block]
[haDLM] Allocated [20/0xc62c5500/Aneesh WorkUnit for topology block]
Aneesh before dlm_workqueue_put_work
Aneesh before clms callback
Error 512 on socket read
--------------------------------------------------------------

See this error. Details below the log

-----------------------------------------------
cccp_poll_loop termination, CCCP shutdown
Read MQ_DLM_TOP_INFO_MSG(12) from CM
[/proc/haDLM:handle_new_topology] node count 2 this nodeid 1 event
0x1 msg ver 1
1 0.1 16.138.251.48
2 0.0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
[/proc/haDLM:cccp_node_down] id [2] failed [1]
[/proc/haDLM:sched_queue] clms start node down (code=3,node=2)
current node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0

The error I indicated above comes from readPacket which is getting
called from cccp_poll_loop. I guess DLM is trying to read some data from
node2 . I guess what is happening here is the Difference in the FREQ
with with CI_Linux and DLM does their respective job . Correct me if i
am wrong. When the node2 reboots, immediately the DLM or CLMS master
is not notified. So DLM tried to read from the packet where it obviously
fail( because the node is down ) and since it failed it goes further and
shutdown the CCCP.

Here i have another doubt. Towards the end of the dmesg there is
something like

[HSM:hsm_execute_transition] DLM recovery state machine (CLM_ST_TOP ->
CLM_ST_RUN)
[HSM:hsm_process_event] exit
[/proc/haDLM:dlm_recov_cleanup] destroying DLM recovery msg cache (0)
Entering kill_clients
[/proc/haDLM:sched_queue_cleanup] destroying DLM transition cache
free_all no function for now.
Empty clm_delete_heap for now.
DLM Exiting
-----------------------------------------------------------
Note the above message .
----------------------
cccp_msg_delivery_loop termination, CCCP shutdown.
cccp_retransmit_loop termination, CCCP shutdown.

What's that I am missing? Any guess ?

After this the state of DLM on node1 is
DLM recovery state: CLM_ST_RUN

Now after rebooting the node2 and starting the dlm, the dlm on node2 is
at state
DLM recovery state: RC_DIR_INIT

I will be really thankful is someone help me to understand what is
happening inside DLM. ( I guess i am missing some function call or is it
all due to node monitoring FREQ )

NOTE: I added dlmdu to ignore SIGCLUSTER.

-aneesh

Bruce Walker

2002-03-19 21:50:02 UTC

Permalink

great

Post by Aneesh Kumar K.V
Now the serious part comes nodeup and node down. When i reboot the
second node my node down routine get called properly and there after it
calls the handle new topology routine also. The following are the part
of dmesg ouput.

Kai and I started to look at this yesterday. It may not be a co-incidence
that 512 is the value of ERESTARTSYS, which is generated if a system call
is signalled while it is sleeping. Kai was going to try to determine if
signals could be a factor or whether it is just natural to get the error
because the other node is down. DLM is using UDP, I understand, so it is
less likely that an error would be generated on a nodedown. Maybe DLM
is doing something on purpose to the socket to create the error to get
the reader out of the read?

Post by Aneesh Kumar K.V
Node 2 has gone down!!!
Aneesh inside node down
Aneesh before get_upnode_list
[haDLM] Allocated [52/0xc629aca0/Aneesh CMGR topology block]
[haDLM] Allocated [20/0xc62c5500/Aneesh WorkUnit for topology block]
Aneesh before dlm_workqueue_put_work
Aneesh before clms callback
Error 512 on socket read
--------------------------------------------------------------
See this error. Details below the log
-----------------------------------------------
cccp_poll_loop termination, CCCP shutdown
Read MQ_DLM_TOP_INFO_MSG(12) from CM
[/proc/haDLM:handle_new_topology] node count 2 this nodeid 1 event
0x1 msg ver 1
1 0.1 16.138.251.48
2 0.0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
[/proc/haDLM:cccp_node_down] id [2] failed [1]
[/proc/haDLM:sched_queue] clms start node down (code=3,node=2)
current node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
The error I indicated above comes from readPacket which is getting
called from cccp_poll_loop. I guess DLM is trying to read some data from
node2 . I guess what is happening here is the Difference in the FREQ
with with CI_Linux and DLM does their respective job .

I'm not sure what you mean here w.r.t. FREQ. Hopefully DLM is not doing
nodedown detection and is relying on CI. Otherwise there could be some
conflict.

Post by Aneesh Kumar K.V
Correct me if i
am wrong. When the node2 reboots, immediately the DLM or CLMS master
is not notified. So DLM tried to read from the packet where it obviously
fail( because the node is down ) and since it failed it goes further and
shutdown the CCCP.

I believe we need to understand how the DLM expects to join new nodes.
I wouldn't think they would need a clms_nodeup routine to do it because
the joining node should be sending messages to one or more other nodes, which
can then co-ordinate the joining. Kai is out today but we will look at
it more tomorrow if we don't hear from the DLM folks.

Post by Aneesh Kumar K.V
Here i have another doubt. Towards the end of the dmesg there is
something like
[HSM:hsm_execute_transition] DLM recovery state machine (CLM_ST_TOP ->
CLM_ST_RUN)
[HSM:hsm_process_event] exit
[/proc/haDLM:dlm_recov_cleanup] destroying DLM recovery msg cache (0)
Entering kill_clients
[/proc/haDLM:sched_queue_cleanup] destroying DLM transition cache
free_all no function for now.
Empty clm_delete_heap for now.
DLM Exiting
-----------------------------------------------------------
Note the above message .
----------------------
cccp_msg_delivery_loop termination, CCCP shutdown.
cccp_retransmit_loop termination, CCCP shutdown.
What's that I am missing? Any guess ?
After this the state of DLM on node1 is
DLM recovery state: CLM_ST_RUN
Now after rebooting the node2 and starting the dlm, the dlm on node2 is
at state
DLM recovery state: RC_DIR_INIT
I will be really thankful is someone help me to understand what is
happening inside DLM. ( I guess i am missing some function call or is it
all due to node monitoring FREQ )
NOTE: I added dlmdu to ignore SIGCLUSTER.
-aneesh

bruce

Post by Aneesh Kumar K.V
_______________________________________________
ci-linux-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ci-linux-devel

Kai-Min Sung

2002-03-21 05:59:12 UTC

Permalink

After some more debugging today I think I've determined the problem.
Basically, there is an incompatability between CI and applications built
with libpthread. Both attempt to use the real-time signal space (signals
<=32) for their own purposes. libpthread applications register signal
handlers for signals 32, 32, 33 to schedule its threads, and CI uses these
for SIGMIGRATE (32) and SIGCLUSTER (33). During a nodedown/nodeup event
CLMS will send SIGCLUSTERs to all processes/kernel threads. Normally these
signals are ignored, __unless__ there is a signal handler installed. So,
all libpthread applications are going to receive the SIGCLUSTER by default,
unless you explicity ignore it. Because the dlmdu user daemon is built
with pthreads, it has handlers installed for the SIGCLUSTER (SIGMIGRATE)
signals. When dlmdu loads the DLM kernel modules, these kernel modules and
the kernel threads they spawn also will inherit the same signal handlers.
So, on a nodedown/nodeup event the cccp_poll_thread() kernel thread gets
interrupted by SIGCLUSTER while reading from the UDP socket. The
cccp_poll_thread() doesn't like receiving any signals so it bombs out
immediately. That's why we're seeing the behaviour described below.
I've come up with a quickfix to the problem, which is to make
cccp_poll_thread() smarter about incoming signals. Looking at the code, it
seems like the only signal that cccp_poll_thread expects to get is SIGINT,
a signal telling it to die. I think it's safe to say that all other
signals should be ignored and upon receiving one, cccp_poll_thread should
flush the signal (if it isn't SIGINT) and loop back to the socket read.
I've attached a patch file which implements this behaviour. DLM folks,
please take a look at it and let me know what you think.

Thanks,
-Kai

Post by Aneesh Kumar K.V
Hi,
Today i ran a small test case that will acquire a lock and sleep for
some time and then unlock . I tested it running simultaneously on both
the nodes. It appear to work fine. This makes it clear that the basic
functionality of DLM ( which is distributed locking ) is there with DLM
CI-Linux integration :).
Now the serious part comes nodeup and node down. When i reboot the
second node my node down routine get called properly and there after it
calls the handle new topology routine also. The following are the part
of dmesg ouput.
Node 2 has gone down!!!
Aneesh inside node down
Aneesh before get_upnode_list
[haDLM] Allocated [52/0xc629aca0/Aneesh CMGR topology block]
[haDLM] Allocated [20/0xc62c5500/Aneesh WorkUnit for topology block]
Aneesh before dlm_workqueue_put_work
Aneesh before clms callback
Error 512 on socket read
--------------------------------------------------------------
See this error. Details below the log
-----------------------------------------------
cccp_poll_loop termination, CCCP shutdown
Read MQ_DLM_TOP_INFO_MSG(12) from CM
[/proc/haDLM:handle_new_topology] node count 2 this nodeid 1 event
0x1 msg ver 1
1 0.1 16.138.251.48
2 0.0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
[/proc/haDLM:cccp_node_down] id [2] failed [1]
[/proc/haDLM:sched_queue] clms start node down (code=3,node=2)
current node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
The error I indicated above comes from readPacket which is getting
called from cccp_poll_loop. I guess DLM is trying to read some data from
node2 . I guess what is happening here is the Difference in the FREQ
with with CI_Linux and DLM does their respective job . Correct me if i
am wrong. When the node2 reboots, immediately the DLM or CLMS master
is not notified. So DLM tried to read from the packet where it obviously
fail( because the node is down ) and since it failed it goes further and
shutdown the CCCP.
Here i have another doubt. Towards the end of the dmesg there is
something like
[HSM:hsm_execute_transition] DLM recovery state machine (CLM_ST_TOP ->
CLM_ST_RUN)
[HSM:hsm_process_event] exit
[/proc/haDLM:dlm_recov_cleanup] destroying DLM recovery msg cache (0)
Entering kill_clients
[/proc/haDLM:sched_queue_cleanup] destroying DLM transition cache
free_all no function for now.
Empty clm_delete_heap for now.
DLM Exiting
-----------------------------------------------------------
Note the above message .
----------------------
cccp_msg_delivery_loop termination, CCCP shutdown.
cccp_retransmit_loop termination, CCCP shutdown.
What's that I am missing? Any guess ?
After this the state of DLM on node1 is
DLM recovery state: CLM_ST_RUN
Now after rebooting the node2 and starting the dlm, the dlm on node2 is
at state
DLM recovery state: RC_DIR_INIT
I will be really thankful is someone help me to understand what is
happening inside DLM. ( I guess i am missing some function call or is it
all due to node monitoring FREQ )
NOTE: I added dlmdu to ignore SIGCLUSTER.
-aneesh
_______________________________________________
Dlm-devel mailing list
http://www-124.ibm.com/developerworks/oss/mailman/listinfo/dlm-devel

Bruce Walker

2002-03-21 17:47:05 UTC

Permalink

Post by Kai-Min Sung
After some more debugging today I think I've determined the problem.
Basically, there is an incompatability between CI and applications built
with libpthread. Both attempt to use the real-time signal space (signals
<=32) for their own purposes. libpthread applications register signal
=32
handlers for signals 32, 32, 33 to schedule its threads, and CI uses these

33, 34

Post by Kai-Min Sung
for SIGMIGRATE (32) and SIGCLUSTER (33).

We redefined SIGRTMIN in the kernel headers to accomodate SIGMIGRATE
and SIGCLUSTER but glibc and libpthread have their own header.

Post by Kai-Min Sung
During a nodedown/nodeup event
CLMS will send SIGCLUSTERs to all processes/kernel threads. Normally these
signals are ignored, __unless__ there is a signal handler installed. So,
all libpthread applications are going to receive the SIGCLUSTER by default,
unless you explicity ignore it. Because the dlmdu user daemon is built
with pthreads, it has handlers installed for the SIGCLUSTER (SIGMIGRATE)
signals. When dlmdu loads the DLM kernel modules, these kernel modules and
the kernel threads they spawn also will inherit the same signal handlers.

My understanding is that kernel daemons don't actually have signal handlers
but they are woken up from interruptible sleeps if they aren't ignoring
a signal that is sent to them.

Post by Kai-Min Sung
So, on a nodedown/nodeup event the cccp_poll_thread() kernel thread gets
interrupted by SIGCLUSTER while reading from the UDP socket. The
cccp_poll_thread() doesn't like receiving any signals so it bombs out
immediately. That's why we're seeing the behaviour described below.
I've come up with a quickfix to the problem, which is to make
cccp_poll_thread() smarter about incoming signals. Looking at the code, it
seems like the only signal that cccp_poll_thread expects to get is SIGINT,
a signal telling it to die. I think it's safe to say that all other
signals should be ignored and upon receiving one, cccp_poll_thread should
flush the signal (if it isn't SIGINT) and loop back to the socket read.
I've attached a patch file which implements this behaviour. DLM folks,
please take a look at it and let me know what you think.

I suspect the DLM developers had no intention of having the kernel thread
see signals 32,33,34 so another option is to just ignoring those signals
early in the kernel thread.

Longer term we must either get assigned signals that don't conflict
with anything or eliminate the need for new signals. W.r.t. SIGCLUSTER, I
am leaning toward allowing cluster-aware processes to specify what
signal they want to receive, rather than having a dedicated signal.
How do others feel about that?

For SIGMIGRATE, we could "migrate" to the notification strategy used
in MOSIX.

bruce

Post by Kai-Min Sung
Thanks,
-Kai

*** old-dlm/source/dlmcccp/cccp_udp.c Tue Oct 9 13:23:31 2001
--- dlm/source/dlmcccp/cccp_udp.c Wed Mar 20 16:18:59 2002
***************
*** 163,171 ****
--- 163,184 ----
msg_header.msg_controllen = 0;
msg_header.msg_flags = 0;
bytes = CCCP_SOCK_RECVMSG( cccp_our_sock, & msg_header, bufferSize, 0 );
if ( bytes < 0 )
{
+ /* If we get any signals, other than SIGINT, just flush them and
+ * restart the socket read.
+ */
+ if ( ((-bytes == EINTR) || (-bytes == ERESTARTSYS))
+ && !sigismember(&current->pending.signal, SIGINT)) {
+ unsigned long _flags;
+ spin_lock_irqsave(&current->sigmask_lock, _flags);
+ flush_signals(current);
+ spin_unlock_irqrestore(&current->sigmask_lock, _flags);
+ goto recvmsg;
+ }
+
/* The poll task gets hit with a signal and cccp_time_to_die gets
* set when it's time for CCCP to shutdown. In this case, we should
* NOT log message that the recvmsg call failed. Otherwise, log a