Aneesh Kumar K.V
2002-03-19 13:32:03 UTC
Hi,
Today i ran a small test case that will acquire a lock and sleep for
some time and then unlock . I tested it running simultaneously on both
the nodes. It appear to work fine. This makes it clear that the basic
functionality of DLM ( which is distributed locking ) is there with DLM
CI-Linux integration :).
Now the serious part comes nodeup and node down. When i reboot the
second node my node down routine get called properly and there after it
calls the handle new topology routine also. The following are the part
of dmesg ouput.
Node 2 has gone down!!!
Aneesh inside node down
Aneesh before get_upnode_list
[haDLM] Allocated [52/0xc629aca0/Aneesh CMGR topology block]
[haDLM] Allocated [20/0xc62c5500/Aneesh WorkUnit for topology block]
Aneesh before dlm_workqueue_put_work
Aneesh before clms callback
Error 512 on socket read
--------------------------------------------------------------
See this error. Details below the log
-----------------------------------------------
cccp_poll_loop termination, CCCP shutdown
Read MQ_DLM_TOP_INFO_MSG(12) from CM
[/proc/haDLM:handle_new_topology] node count 2 this nodeid 1 event
0x1 msg ver 1
1 0.1 16.138.251.48
2 0.0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
[/proc/haDLM:cccp_node_down] id [2] failed [1]
[/proc/haDLM:sched_queue] clms start node down (code=3,node=2)
current node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
The error I indicated above comes from readPacket which is getting
called from cccp_poll_loop. I guess DLM is trying to read some data from
node2 . I guess what is happening here is the Difference in the FREQ
with with CI_Linux and DLM does their respective job . Correct me if i
am wrong. When the node2 reboots, immediately the DLM or CLMS master
is not notified. So DLM tried to read from the packet where it obviously
fail( because the node is down ) and since it failed it goes further and
shutdown the CCCP.
Here i have another doubt. Towards the end of the dmesg there is
something like
[HSM:hsm_execute_transition] DLM recovery state machine (CLM_ST_TOP ->
CLM_ST_RUN)
[HSM:hsm_process_event] exit
[/proc/haDLM:dlm_recov_cleanup] destroying DLM recovery msg cache (0)
Entering kill_clients
[/proc/haDLM:sched_queue_cleanup] destroying DLM transition cache
free_all no function for now.
Empty clm_delete_heap for now.
DLM Exiting
-----------------------------------------------------------
Note the above message .
----------------------
cccp_msg_delivery_loop termination, CCCP shutdown.
cccp_retransmit_loop termination, CCCP shutdown.
What's that I am missing? Any guess ?
After this the state of DLM on node1 is
DLM recovery state: CLM_ST_RUN
Now after rebooting the node2 and starting the dlm, the dlm on node2 is
at state
DLM recovery state: RC_DIR_INIT
I will be really thankful is someone help me to understand what is
happening inside DLM. ( I guess i am missing some function call or is it
all due to node monitoring FREQ )
NOTE: I added dlmdu to ignore SIGCLUSTER.
-aneesh
Today i ran a small test case that will acquire a lock and sleep for
some time and then unlock . I tested it running simultaneously on both
the nodes. It appear to work fine. This makes it clear that the basic
functionality of DLM ( which is distributed locking ) is there with DLM
CI-Linux integration :).
Now the serious part comes nodeup and node down. When i reboot the
second node my node down routine get called properly and there after it
calls the handle new topology routine also. The following are the part
of dmesg ouput.
Node 2 has gone down!!!
Aneesh inside node down
Aneesh before get_upnode_list
[haDLM] Allocated [52/0xc629aca0/Aneesh CMGR topology block]
[haDLM] Allocated [20/0xc62c5500/Aneesh WorkUnit for topology block]
Aneesh before dlm_workqueue_put_work
Aneesh before clms callback
Error 512 on socket read
--------------------------------------------------------------
See this error. Details below the log
-----------------------------------------------
cccp_poll_loop termination, CCCP shutdown
Read MQ_DLM_TOP_INFO_MSG(12) from CM
[/proc/haDLM:handle_new_topology] node count 2 this nodeid 1 event
0x1 msg ver 1
1 0.1 16.138.251.48
2 0.0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
new node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
[/proc/haDLM:cccp_node_down] id [2] failed [1]
[/proc/haDLM:sched_queue] clms start node down (code=3,node=2)
current node list: nnodes 2 low_ver 1 high_ver 1
1 1 16.138.251.48
2 0 0.0.0.0
The error I indicated above comes from readPacket which is getting
called from cccp_poll_loop. I guess DLM is trying to read some data from
node2 . I guess what is happening here is the Difference in the FREQ
with with CI_Linux and DLM does their respective job . Correct me if i
am wrong. When the node2 reboots, immediately the DLM or CLMS master
is not notified. So DLM tried to read from the packet where it obviously
fail( because the node is down ) and since it failed it goes further and
shutdown the CCCP.
Here i have another doubt. Towards the end of the dmesg there is
something like
[HSM:hsm_execute_transition] DLM recovery state machine (CLM_ST_TOP ->
CLM_ST_RUN)
[HSM:hsm_process_event] exit
[/proc/haDLM:dlm_recov_cleanup] destroying DLM recovery msg cache (0)
Entering kill_clients
[/proc/haDLM:sched_queue_cleanup] destroying DLM transition cache
free_all no function for now.
Empty clm_delete_heap for now.
DLM Exiting
-----------------------------------------------------------
Note the above message .
----------------------
cccp_msg_delivery_loop termination, CCCP shutdown.
cccp_retransmit_loop termination, CCCP shutdown.
What's that I am missing? Any guess ?
After this the state of DLM on node1 is
DLM recovery state: CLM_ST_RUN
Now after rebooting the node2 and starting the dlm, the dlm on node2 is
at state
DLM recovery state: RC_DIR_INIT
I will be really thankful is someone help me to understand what is
happening inside DLM. ( I guess i am missing some function call or is it
all due to node monitoring FREQ )
NOTE: I added dlmdu to ignore SIGCLUSTER.
-aneesh