Cluster Maintenance¶
Cluster Status¶
There are various commands and log files that you can use to assess the cluster’s health & status.
Pacemaker¶
The controller nodes run pacemaker to share the virtual IP address for the
stack-master-controller.acorn domain. Only one controller node will be the
master-controller at once. Pacemaker also ensures various services like the
load balancer haproxy are running.
To check the status of the Pacemaker cluster, run sudo pcs status on any
controller node. This will display the node with the master-controller virtual
IP and the status of services on each controller node.
To restart the pacemaker service on a controller node, run sudo systemctl
restart pacemaker. Log files for the service can be found at
/var/log/pcsd/ and you can run sudo journalctl -u pacemaker for
additional service information.
MySQL¶
The MySQL database is replicated on each controller node. You can check a
node’s status in the cluster by running sudo mysql to open an SQL shell,
and then run a SHOW STATUS LIKE 'wsrep_local_state%'; query. A state
comment of Synced means the node is synced with the rest of the cluster.
You can check the logs by running sudo journalctl -u mariadb.
Ceph¶
The controller nodes are the monitors & manager for the Ceph storage cluster used by OpenStack.
To check the health of the storage cluster, run ceph health. For more
detailed information, run ceph status. While the cluster is being
re-balanced or repaired, you can run ceph -w to print the status and then
monitor any health-related messages. Run ceph iostat to continuously print
a table of the cluster’s I/O activity.
The storage nodes have a Ceph OSD for each storage drive. To print the layout
of storage nodes & OSDs, run ceph osd tree. For more detailed information
like the used & available space per-OSD, run ceph osd status.
Ceph log files are found in /var/log/ceph/ on the controller & storage
nodes. You can view the ceph services running on each node by running
systemctl. To restart all ceph services on a node, run sudo systemctl
restart ceph.target. You can start a specific OSD by SSHing into the OSDs
storage node and running something like sudo systemctl start
ceph-osd@2.service.
OpenStack¶
You can check the status of openstack services & VMs by using the openstack
command from any controller node. You will need to source one of the credential
files on the node, e.g.: . ~/admin-openrc.sh.
Check the available services using openstack service list. Check the
hypervisor services with openstack compute service list and the networking
services with openstack network agent list. Check the hypervisor stats with
openstack hypervisor stats show and the per-hypervisor resources with
openstack hypervisor list --long.
Source the ~/acorn-openrc.sh credential file to display details of the
acorn project. openstack server list will show all VMs and their status.
Similarly, run openstack volume list for all volumes, image list for
all images, & flavor list for all VM flavors.
See openstack --help for all available commands and flags.
Automated Maintenance¶
There is a Fabric file that can be used to automatically update and upgrade the cluster servers:
fab upgrade
TODO: Fabric command to check & bootstrap inactive galera cluster?
Adding OS Images¶
You can download pre-made images or create your own image for Linux & Windows VMs using a virtualizer like KVM or VirtualBox.
Once you have an image file, you need to convert it to the raw format using
qemu-img:
qemu-img convert -O raw my-src-image.qcow2 my-target-image.raw
Then you can add the image to the cluster via the Dashboard:
Login to the Dashboard using the admin credentials.
Under the
Adminmenu, selectComputeandImages.Click
Create Image, give it a name and description, select your raw image file, and change the format toRaw.Hit
Create Imageto upload the image file.Once complete, you should be able to switch to the
acornproject & launch a VM using your new image.
Adding VM Flavors¶
Flavors let you set the available resources for a VM. You can customize the CPU count, RAM, swap, & OS drive size.
Login to the Dashboard using the admin credentials.
Under the
Adminmenu, selectComputeandFlavors.Hit the
Create Flavorbutton.Name the flavor and specify the resources sizs, then hit
Create Flavor.
Adding / Replacing Storage Drives¶
When a storage node’s OS drive fails, you need to replace the drive, create a new 1 volume RAID array using the Adapatec Configuration boot utility.
When a storage drive fails, you will need to shutdown the node, swap the drive out, & initialize the new JBOD drive:
SSH into a controller node and run
ceph osd set nooutto prevent rebalancing when you take the storage node offline.Run
sudo poweroffon the storage node to shut it off.Swap out the HDD with a replacement drive. We use 3TB SAS drives.
Power the node back on.
During the boot process, you will receive an error from the RAID card stating the drive configuration has changed. Press
Control-ato enter the RAID setup.Select the RAID adapter that controls the new drive.
Select the
Initialize Drivesoption & select the new drive withSpaceand then pressEnterto confirm.Press
Escapeto go back to the menu and selectCreate JBOD.Select the new drive and confirm the JBOD creation.
Exit the menu to boot into the OS.
Once the storage node has booted up, SSH into
stack-controller-1.acorn.Enter the Ceph Deploy directory with
cd ceph-clusterand deploy an OSD to the replacement drive by runningceph-deploy osd create stack-storage-<node-number> --data /dev/<new-drive>.Run
ceph osd unset nooutto enable data rebalancing on drive failure.Login to the Dashboard using the admin credentials.
Ensure you are viewing the
adminproject. Under theIdentitymenu, navigate to theProjectspage.Click the dropdown for the
acornproject and selectModify Quotas.Click the
Volumetab and enter the newTotal Sizeof the cluster, using theSafe Cluster Sizevalue from the Ceph Calculator.Hit the
Savebutton.
Extending a Volume¶
If a VM’s volume is running out of free space, you can extend the volume to increase it’s total size:
SSH into the VM & shut it down.
Login to the Dashboard & navigate to the
Instancespage under theProject > Computemenu.Click the dropdown menu for the VM and click the
Detach Volumeoption and select the desired volume.Navigate to the
Volumespage under theVolumesmenu.Use the dropdown menu for the volume to resize the volume.
Once the volume is resized, re-attach it to the VM and boot up the VM.
SSH into the VM.
Unmount the volume by running
sudo umount <mount-point>orsudo umount /dev/<volume-disk>. Runmountto see a list of all mounted drives on the system.Run
gdisk /dev/<volume-disk>to partition the new drive.We need to delete the existing partition and create an identical one that fills up the entire disk partition. If there is only one partition, you can enter
dto delete it, thennto create a new one and accept the defaults in the following prompts by hitting<Enter>to make it take the entire disk. Then enterwto write the new parition table.Run
sudo partprobeto pickup the partition table changes.Run
sudo ressize2fs /dev/<volume-partition>to extend the filesystem to the new partition size.
Shutting Down¶
To shutdown a cluster:
Shutdown all the VMs using the web UI or with the
openstack server stop <server1> <server2> ...command from a controller node.SSH into the Compute nodes and shut them down by running
sudo poweroff.On a controller node, disable storage rebalancing so we can take the storage cluster offline by running
ceph osd set noout.SSH into each Storage node and shut them down.
On a controller node, put the pacemaker cluster into maintenance mode by running
pcs property set maintenance-mode=true.Shut off the controller nodes in a staggered fashion from node 3 to node 1. E.g., shutdown
stack-controller-3, wait a minute, shutdown 2, wait a minute, shutdown 1.
Starting Up¶
If you ever need to start a stopped cluster:
Start up the Master Controller Node
If this was the last node shutdown, run
sudo galera_new_cluster.If you don’t know which controller shutdown last, check
/var/lib/mysql/grastate.daton each controller forsafe_to_bootstrap: 1. Runsudo galera_new_clusteron this controller.Once the first MySQL server starts, start mysql on the other nodes by running
systemctl start mysql.Now start the Storage nodes. Verify all disks are up by running
ceph osd treeon a controller node. Check the health of the storage cluster by runningceph status.On a controller node, re-enable drive re-balancing by running
ceph osd unset noout.Start the Compute nodes.
Once everything has booted up, you should be able to start the VMs from the dashboard.
Shutting Down¶
If you need to shutdown the cluster(e.g., in case of a power outage), do so in the following order:
VMs
Compute Nodes
Storage Nodes
Backup Controller Nodes
Master Controller Node
Cluster Expansion¶
Adding additional controller, compute, or storage nodes to a cluster is fairly straightforward.
For every node, you should first follow the Node Setup section. Then add
the host to a group in the cluster-servers file & add a config file in
host_vars/ (base it off of the configs for other hosts in that group).
Then run the full ansible playbook:
ansible-playbook acorn.yml
Controller¶
New controllers require some manual configuration due to the high availability setup.
MySQL¶
The new controller should automatically connect to the MySQL cluster. You can verify this by checking the cluster size:
echo "SHOW STATUS LIKE '%cluster_size';" | mysql -u root -p
RabbitMQ¶
The ansible playbook will have copied an erlang cookie to all the controller hosts. Restart the new node in clustering mode:
sudo rabbitmqctl stop_app
sudo rabbitmqctl join_cluster rabbit@stack-controller-1
sudo rabbitmqctl start_app
Pacemaker¶
You’ll need to authenticate the new node from the master controller:
# On stack-controller-1
sudo pcs cluster auth -u hacluster stack-controller-4
Next, remove the default cluster from the new node:
# On stack-controller-4
sudo pcs cluster destroy
Add the new node using the master controller and start the service on the new node:
# On stack-controller-1
sudo pcs cluster node add stack-controller-4
# On stack-controller-4
sudo pcs cluster start
sudo pcs cluster enable
Ceph¶
Copy the SSH key from the master controller to the new controller:
# On stack-controller-1
ssh-copy-id stack-controller-4
Install & deploy Ceph on the new controller node:
# On stack-controller-1
cd ~/storage-cluster
ceph-deploy install --repo-url http://download.ceph.com/debian-luminous stack-controller-4
ceph-deploy admin stack-controller-4
Setup the new controller as a Ceph monitor:
ceph-deploy mon add stack-controller-4
Copy the Glance Key to the new controller node:
# On stack-controller-1
ceph auth get-or-create client.glance | ssh stack-controller-4 sudo tee /etc/ceph/ceph.client.glance.keyring
ssh stack-controller-4 sudo chown glance:glance /etc/ceph/ceph.client.glance.keyring
Extra Deploy Node
Copy the SSH key from each existing controller to the new controller:
ssh-copy-id stack-controller-4
Then initialize a key on the new server & copy it to the existing controller and storage nodes:
ssh-keygen -t ecdsa -b 521
ssh-copy-id stack-controller-1
ssh-copy-id stack-controller-2
ssh-copy-id stack-controller-3
ssh-copy-id stack-compute-1
ssh-copy-id stack-compute-2
ssh-copy-id stack-compute-3
ssh-copy-id stack-storage-1
ssh-copy-id stack-storage-2
ssh-copy-id stack-storage-3
And finally, copy over the ~/ceph-cluster folder from an existing
controller node to a new one:
# On stack-controller-1
rsync -avhz ~/ceph-cluster stack-controller-4:~/
Neutron¶
Add the new controller as a DHCP agent for the private network:
cd ~
. admin-openrc.sh
# Run this & find the ID of the `DHCP agent` on the new controller
openstack network agent list
# Then add the agent as a DHCP server
neutron dhcp-agent-network-add <dhcp-agent-id> private
Compute¶
The ansible playbook should handle all the required setup and OpenStack should pickup the additional compute node afterwards.
You can verify this by running openstack compute service list on a
controller node. The list should include the new compute host.
Storage¶
Follow the Setup & Configuration instructions, then add the hostname to the
storage group in the cluster-servers file and run the ansible playbook.
This will install Ceph and setup Cinder, but you’ll need to manually add the new node and any new storage drives to our Ceph cluster.
Start by pushing the SSH key from the master controller to the new node:
# On stack-controller-1
ssh-copy-id stack-storage-4
Then use ceph-deploy on the master controller to install Ceph on the new
node:
cd ~/storage-cluster
ceph-deploy install --repo-url http://download.ceph.com/debian-luminous stack-storage-4
Note that we use --repo-url here instead of the --release flag, so that
packages are downloaded through HTTP instead of HTTPS, which allows them to be
cached by our web proxy.
Deploy an OSD to each new storage disk. It’s recommended to split the journals out on a separate SSD with a partition for each OSD:
ceph-deploy disk list stack-storage-4
ceph-deploy osd create stack-storage-4:/dev/sdc:/dev/sdb1 stack-storage-4:/dev/sdd:/dev/sdb2
You can monitor the rebalancing progress by running ceph -w on
stack-controller-1.