[CI] Re: [SSI] Converting a non-shared root SSI cluster to use CFS

Discussion:

David B. Zafman

2002-06-10 16:00:13 UTC

Below is something I wrote up last week, but was waiting for Bruce to
comment on it before sending it out. Now that I see what you've done
with /etc/init.d/clusterinit, I thought I'd send this out. I will
examine what you've done today.

Last week I did something similiar. I wasn't as concerned with the
dependent node networking, but I wanted to replace rc.sysinit for
dependent nodes only. I copied the redhat rc.sysinit to
rc.sysinit.nodeup and removed all the things which the dependent nodes
should not be duplicating. I also removed the execution of rc for run
level 3 from rc.nodeup. Keep in mind that only the first booting node
runs rc.sysinit just like base linux. Since only dependent nodes run
rc.nodeup, only the dependent nodes run rc.sysinit.nodeup.

---------

You've brought up an important architectural issue. Once there is a
single root it requires clusterization to have duplicate services
running. One way to clusterize things is adding context dependent links
(i.e. /var/run as you proposed for the *.pid files).

The current set-up of having rc.nodeup call rc.sysinit then running
complete rc 3 runlevel processing was fine when we had a non-shared
root. Now with CFS and GFS we really need to NOT do this. Looking at
rc.sysinit on a redhat install, I see that it does all sorts of stuff
which should NOT be done again on a joining node in the shared-root case.

In a cluster there would generally be two kinds of services. The first
kind is a single instance of the service (single process or set of
processes on one node) running with keepalive to restart it on node
failures. The second kind is the service that is cluster aware, so that
processes could exist on multiple nodes, but they cooperate with each
other. In non-stop clusters we parallelized inetd, for example. It
maintained processes on all nodes, and kept a list of pids which it
updated as nodes came and went.

The whole /var/run/service_name.pid mechanism I would propose is only
used for non-cluster aware serives which are restricted to running on
the root node, but may be restarted on node failure. It is assumed that
to restart the service we might have to remove the .pid file and then on
(re)start the service would create the file again with the new pid.

Hi,
I guess we need to have node specific /var/run directory also.
Otherwise on debian some sevices may not come up on node2. They check
/var/run/service_name.pid file to see whether the service is already
running or not.
That make it for debian /etc/init.d/rcS add these lines before doing
the for loop show below
#
# Cluster specific remounts.
#
#
mount --bind /etc/network-`/usr/sbin/clusternode_num` /etc/network
mount --bind /run-`/usr/sbin/clusternode_num` /var/run
#
# Call all parts in order.
#
for i in /etc/rcS.d/S??*
-aneesh
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -- http://devcon.sprintpcs.com/adp/index.cfm
_______________________________________________
ssic-linux-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel

--
David B. Zafman | Hewlett-Packard Company
Linux Kernel Developer | Open SSI Clustering Project
mailto:***@hp.com | http://www.hp.com
"Thus spake the master programmer: When you have learned to snatch
the error code from the trap frame, it will be time for you to leave."

Greg Freemyer

2002-06-10 18:06:03 UTC

Permalink

David,

If you guys have any political clout, you might want to support Robin Holt's proposal to Redhat for rc.sysinit restructuring.

I just quoted her/his summarization e-mail in a response to Aneesh. The basic idea is to restructure rc.sysinit in a way similar to how rc*.d are done.

If it was in place, it would be relatively easy to patch it to support an overall cluster sysinit.d that only gets invoked on the first booting node and a separate sysinit.nodeup.d that contains scripts the get invoked for each node that comes up.

Greg Freemyer
Internet Engineer
Deployment and Integration Specialist
Compaq ASE - Tru64
Compaq Master ASE - SAN Architect
The Norcross Group
www.NorcrossGroup.com

Post by David B. Zafman
Below is something I wrote up last week, but was waiting for Bruce to
comment on it before sending it out. Now that I see what you've done
with /etc/init.d/clusterinit, I thought I'd send this out. I will
examine what you've done today.
Last week I did something similiar. I wasn't as concerned with the
dependent node networking, but I wanted to replace rc.sysinit for
dependent nodes only. I copied the redhat rc.sysinit to
rc.sysinit.nodeup and removed all the things which the dependent nodes
should not be duplicating. I also removed the execution of rc for run
level 3 from rc.nodeup. Keep in mind that only the first booting node
runs rc.sysinit just like base linux. Since only dependent nodes run
rc.nodeup, only the dependent nodes run rc.sysinit.nodeup.
---------
You've brought up an important architectural issue. Once there is a
single root it requires clusterization to have duplicate services
running. One way to clusterize things is adding context dependent links
(i.e. /var/run as you proposed for the *.pid files).
The current set-up of having rc.nodeup call rc.sysinit then running
complete rc 3 runlevel processing was fine when we had a non-shared
root. Now with CFS and GFS we really need to NOT do this. Looking at
rc.sysinit on a redhat install, I see that it does all sorts of stuff
which should NOT be done again on a joining node in the shared-root case.
In a cluster there would generally be two kinds of services. The first
kind is a single instance of the service (single process or set of
processes on one node) running with keepalive to restart it on node
failures. The second kind is the service that is cluster aware, so that
processes could exist on multiple nodes, but they cooperate with each
other. In non-stop clusters we parallelized inetd, for example. It
maintained processes on all nodes, and kept a list of pids which it
updated as nodes came and went.
The whole /var/run/service_name.pid mechanism I would propose is only
used for non-cluster aware serives which are restricted to running on
the root node, but may be restarted on node failure. It is assumed that
to restart the service we might have to remove the .pid file and then on
(re)start the service would create the file again with the new pid.

--
David B. Zafman | Hewlett-Packard Company
Linux Kernel Developer | Open SSI Clustering Project
"Thus spake the master programmer: When you have learned to snatch
the error code from the trap frame, it will be time for you to leave."
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas -
http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink
_______________________________________________
ci-linux-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ci-linux-devel

Aneesh Kumar K.V

2002-06-11 04:05:03 UTC

Permalink

Hi,

Ok that sounds to be a good idea.For cluster we will give an upgrade
to packages initscripts in redhat and sysvinit in Debian.( may be
initscripts-cluster and sysvinit-cluster ) We can also ask the user to
install this as a part of creating cluster.

Regarding /var/run/service_name.pid. The general logic used by many of
the xinetd services are

see whether service_name.pid exist . If then read the PID.Do
kill(pid,0) to see if the service is really running. If so exit if not
recreate the service_name.pid file with the new pid entry.

In our case with clusterwide signaling we will be able to signal any
application running on other nodes. Now if I start any service that is
multi instance( That is running on all the nodes. May be a load balanced
web server )we need to make sure that the server on node1 and server on
node2 reads different service_name.pid Or they change the above logic
explained above( Which is guess is going to be tough job ).

I guess we need to make sure that each node sees different /var/run/
directory so that we can start these servers on all the nodes at the
same time without modifying any server code.

-aneesh

--
David B. Zafman | Hewlett-Packard Company
Linux Kernel Developer | Open SSI Clustering Project
"Thus spake the master programmer: When you have learned to snatch
the error code from the trap frame, it will be time for you to leave."
_______________________________________________________________
Don't miss the 2002 Sprint PCS Application Developer's Conference
August 25-28 in Las Vegas - http://devcon.sprintpcs.com/adp/index.cfm?source=osdntextlink
_______________________________________________
ssic-linux-devel mailing list
https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel

John Hughes

2002-06-11 07:14:04 UTC

Permalink

Post by Aneesh Kumar K.V
I guess we need to make sure that each node sees different /var/run/
directory so that we can start these servers on all the nodes at the
same time without modifying any server code.

And how do we deal with servers that need to run only one copy on
the cluster?

Aneesh Kumar K.V

2002-06-11 07:18:02 UTC

Permalink

Hi,

In that case the service will be stated by its own script . Not by run
levels i guess. That means we won't have /etc/rc3.d/Sxxserver_name file.

-aneesh

Post by John Hughes

And how do we deal with servers that need to run only one copy on
the cluster?