IPL1A RC35 Mechid Artifacts

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 47

17.

Pre Maintenance Check, Precautions and Preparations

Tenant Impact: No

NOTE: All terminal input/output must be logged during the change.

** Important ** : Execute MOP-355 Update aiclcm packages

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-355-
Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=4Z0SyR

17.a.1 Add OPSFIX repository


On Opssimple node make a backup of sources.list
sudo cp -p /etc/apt/sources.list /var/tmp/sources.list

sudo chmod 600 /var/tmp/sources.list

If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.3)


deb http://mirrors-aic.it.att.com/opssimple/opsfix3.0.3-dev/ trusty main

deb http://mirrors-aic.it.att.com/opssimple/ops3.0.3/ trusty main

If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.3)


deb http://mirrors.it.att.com/opssimple/opsfix3.0.3/ trusty main

deb http://mirrors.it.att.com/opssimple/ops3.0.3/ trusty main

If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.2)


deb http://mirrors-aic.it.att.com/opssimple/opsfix3.0.2-dev/ trusty main

If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.2)


deb http://mirrors.it.att.com/opssimple/opsfix3.0.2/ trusty main

run repositories metadata update command:


$ sudo apt-get update

17.a.2 Pre-Maintenance Check Tools/System

 Install/Update aic-lcm packages:

$ ssh <opscvm>

$ sudo apt-get install aic-lcm aic-opssimple-plugins-aiclcm


$ [ $(apt-cache policy aic-opssimple-plugins-aiclcm | grep Candidate: | egrep -o "[0-9]+$") -lt 40023 ] && sudo apt-get

install -y aic-lcm=3.0.3-40023 aic-opssimple-plugins-aiclcm=40023

 Install ldap-utils package:

$ ssh <opscvm>

$ sudo apt-get install -y ldap-utils


 Install python3-openssl package:

$ ssh <opscvm>

$ sudo apt-get install -y python3-openssl

 Update OPSFIX-0064. To do that on OpsSimple node run:


 $ ssh <opscvm>

 $ sudo apt-cache search aic-opsfix-cmc-0064

If output contains information about the package, run:


$ sudo apt-get install aic-opsfix-cmc-0064
Otherwise, download and install it manually using the steps below:
$ pck=$(curl http://mirrors-aic.it.att.com/aic-mos/review/40972/opssimple/pool/main/a/aic-opsfix-

cmc-0064/ | egrep -o "aic-opsfix-cmc-0064.[0-9]*.deb" |sort|tail -1)

$ Copy the contents


from http://mirrors-aic.it.att.com/aic-mos/review/40972/opssimple/pool/main/a/
aic-opsfix-cmc-0064/$pck
$ sudo dpkg -i $pck

$ rm $pck

Check installed package (it should output a non-empty list of files):


$ sudo dpkg -L aic-opsfix-cmc-0064

Run next command to propagate contrail check script to contrail-controllers:


$ cd ~/aic

$ ansible-playbook -i inventory/ roles/aic_opsfix_0064/playbooks/aic_opsfix_deploy_0064.yml


Copy diff_struct.py script to fuel node:
$ cd ~/aic

$ ansible -i inventory/ fuel_host -mcopy -a "src=~/rotatemechid/diff_struct.py dest=/tmp/"

 Fuel status

It needs to check that environment status is operational, and no error nodes, no offline nodes.

On OpsSimple node run:


ssh <fuelvm>

fuel env
Check that status is operational
ssh <fuelvm>

fuel nodes | grep -v ready

fuel nodes | tail -n +3 | awk -F"|" '{ if ($9 !=1) print $1 "Offline" $3 $9}'

 LCM status

This step verifies that the LCM infrastructure is in place and operational. If the LCM infrastructure is
not operational it will not be able to deploy the changes properly. The Puppet code required to
deploy the OpenStack packages needs to be downloaded and assigned to the nodes. This is done by
getting R10K which download the puppet code, installs it into Puppet Master. The "environment"
column in Foreman indicates what version of the Puppet code a particular node is running. The "last
report" in column indicates when the Puppet code was applied for the last time.

Verify that the nodes are green, if they are not in sync, review node reports to assess severity. Based
on severity raise itrack (you need to assign it on TigerTeam members)
- https://itrack.web.att.com/secure/Dashboard.jspa?selectPageId=19152

Attach node report to the ticket

In the hosts tab, check that all the nodes except the non-openstack nodes are pointing at the
previous release_candidate. The only nodes which can be pointing at production are the non-
openstack nodes.

Results/Description:
 UMASK and GSTools

GSTools installation is changing the UMASK on the LCM nodes and causing quite a lot interferences
with R10K. Notice: You need to forward your ssh-key when login to fuel to be able to login on lcm
nodes
Login to fuelvm as attuid:

ssh <fuelvm> -A

Run this to get fuel VM IP/hostname:


sudo fuel node | grep lcm

Login to lcm node as attuid:

ssh <lcm_vm>

cd /etc/puppet/environments

pwd

sudo chmod -R ugo+rx /etc/puppet/environments

Update automation metadata

 Verify that test_mechids.py and update_mechids.py are available on OpsSimple VM


(see /home/m96722/rotatemechid/ folder).
 You need to prepare newmechids.yaml . For that purposes you can use
templates new_mechids.yaml.templ in folder /home/m96722/rotatemechid/
 Populate in OpsSimple node or transfer prepopulated newmechids.yaml into the
OpsSimple VM if needed
$ scp newmechids.yaml m96722@<opscvm>:/home/m96722/rotatemechid/newmechids.yaml

$ sudo chown m96722:mechid newmechids.yaml

 Test new mechIDs by test_mechids.py . New set of MechIDs will be tested against
both LDAPs or either of them: ITServices and ATTTest. You have to use parameter
--medium or --large depends on the cloud type you are working on.

$ ssh <opscvm>

$ cd /home/m96722/rotatemechid/

$ /var/www/aic_opssimple/backend/manage.py export -f ./opssimple_site.yaml <envname> yaml

For medium:

$ cloud_type=medium

For large:
$ cloud_type=large

$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type -> <test

against both servers for medium site>

$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --atttest -> <test

against ATTTest>

$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --itservices -> <test

against ITSERVICES>

Note: Please check in the output above that all mechids are members of LDAP groups: AP-
AIC_Prod_Users, AP-AIC-Mobility, AP-365-DAY-PASSWORD-EXPIRATION . If output has missing group,
DE lead must reach out to T3 managers to added them on priority. T3 Contact
- https://wiki.web.att.com/pages/viewpage.action?pageId=429599467
If there is a error in output like ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) for all mechid
tests, please check that CA certificates are installed: ls /etc/ssl/certs | egrep -i '(test-)?sbc'

Output should reflect certificate files, if not please proceed to Troubleshooting section to download
and install them, then test new mechIDs again.

 Update/Install OPSFIX-0186. To do that on OpsSimple node run:


 $ ssh <opscvm>

 $ sudo apt-get install aic-opsfix-cmc-0186

 $ sudo dpkg -L aic-opsfix-cmc-0186


it should output a non-empty list of files.
Optional: Find /home/m96722/aic/roles/aic_opsfix_0186/files/conf_list.yaml and append lines with needed
conf files and services accordingly, please keep yaml format while adding.

 On OpsSimple node revisit and populate with appropriate


hostnames inventory/hosts and inventory/openstack files (typical
location: /home/m96722/aic/ ). These files are used by ansible for every run.

17.b. Pre-Maintenance Check Manual (Non-Automated Requirements)


Step 17.b.1: Request storage operations to backup of all vLCP LUNs

Time Line: 72 hours prior to CW.

Action: Submit request to backup of all vLCP LUNs in site.

SRTS Ticket needs to be created 72 hours ahead of CW

AOTS http://ushportal.it.att.com/step3.cfm?home=ush&app=3962&is_sbc=&prob=52341

Please pay attention that if you want to revert back the changes of vLCP nodes for any reason it may
cause orphaned VMs if they were created after the point backup was created.

Step 17.b.2: Backup Fuel Nailgun database.

Time Line: 5 minutes.

Action: On the fuel VM, as fuel admin execute -


ssh <fuelvm>

Check if there is sufficient disk space in /var/tmp for backup, if there is not enough space please
use /var/log/tmp/ instead of /var/tmp/ further in backup/revert steps.
df -h

For mediums, recommend a minimum of 10GB free space for backup.

For large sites, recommend a minimum of 20GB free space for backup.

Check that next folders are empty or absent:


ls /root/rotatemechid/before /root/rotatemechid/after /var/tmp/fuelbackuppre

if not then clean them.

create backup folder -


mkdir /var/tmp/fuelbackuppre
cd /var/tmp/fuelbackuppre
sudo -u postgres pg_dump -d nailgun -f nailgun.dump
gzip nailgun.dump
chmod 700 /var/tmp/fuelbackuppre && chmod 600 /var/tmp/fuelbackuppre/*

Results/Descriptions: Nailgun dump file is created in the backup folder.

Please make a copy of /var/tmp/fuelbackuppre to somewhere else (like opssimple, maas, etc.) just in
case if Fuel will be corrupted during the procedure and backups will be unavailable.

Step 17.b.3: Backup Fuel settings

Time Line: 5 minutes

Action: On the fuel VM, as fuel admin execute -


ssh <fuelvm>
mkdir -p /root/rotatemechid/before
cd /root/rotatemechid/before

identify fuel env id and set variable


fuel env
export envid=<envid>
fuel settings download --env $envid
fuel network download --env $envid
chmod 600 network_${envid}.yaml settings_${envid}.yaml

Results/Descriptions:
login to OpsSimple node and run:
ssh <opscvm>

fuel env

/var/www/aic_opssimple/backend/manage.py export -f /home/m96722/rotatemechid/beforemechidrotate.yaml <env_name>

yaml

18. ATS Bulletin

Procedure -1 ATS Bulletin Check Action Results/Descriptions Time Line Access the ATS Bulletin site at
the following URL: ATS Bulletin

Click on the List/Search button Click on the -Search Menu> Determine a Key word associated with
the activity that you are performing and type it in to the Keyword: (Hint. field Click on the Submit
Query button Review each search result bulletin by clicking on the bulletin number to determine if
any of the bulletins have warnings/actions associated with the activity that you are implementing List
all associated bulletins in the following table. If no associated bulletin is found place N/A in the
following table.

Layer 3 - AIC

General AIC

19. Emergency Contacts

Contact (Team or Person) Contact Information AIC Fuel Team Vladimir Maliaev vm321d AIC Fuel
Team Alexey Odinokov ao241c CPVT Nageswara Rao Guddeti ng4707 AIC T2/T3 Via Q
chat qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team+Chat

20. Preliminary Implementation

NOTE: All terminal input/output must be logged during the change.


Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.

Pre-check tasks are completed the night of the cutover at least one hour prior to cutover activities.
1. Login to Opssimple # For Opssimple IP , Refer this page for AIC 3.x site list. Switch
user to m96722
2. sudo su - m96722 or sudo -iu m96722

3.

4. ##References##

AIC 3.x site list https://wiki.web.att.com/pages/viewpage.action?


spaceKey=CCPdev&title=AIC+OpsSimple+3.x+VMs
5. Ping all the nodes in the inventory to make sure all nodes are reachable from
opssimple node

Verify DE have ssh access to required nodes from Jump host node prior to the
changes (not using ansible).

From jump host run the following:


ansible lcp_cluster -i inventory/ -m ping

If the above steps doesn't work CPVT or DE to create AOTS ticket and post it in the group chat
below:
qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team +Chat

Create AOTS ticket specifying IP,FQDN of compute & VM info skip deployment on those compute
node(s)(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")

3. Disable Nagios Alerts:

On each nagios server:

Login using UAM account and execure toor.

Perform below commands -


/usr/local/bin/nagios-sched-downtime-zone.sh <zone> <days>

for example
/usr/local/bin/nagios-sched-downtime-zone.sh pdk5 0.5

Above example will put every node and all service checks for each node in scheduled downtime for
12 hours(0.5 days)

After completion of post checks, enable alarms.


For enabling alarming, the process is: "nagios-del-downtime-zone.sh [zone NAME]" or "nagios-del-
downtime-host.sh [host NAME]"
4. If SOLIDFIRE' or any other Storage backend's user/password is going to be rotated,
coordination between Storage team and DE must take place. Otherwise cinder will
fail! Please check with Storage Team, get new user/password (or just a password if
the user keeps the same) and make sure they switched user/password on Storage
Backend for the current zone/site. Also need to emphasize that in case of Solidfire,
password for 'sfopenstack' user should not be changed, because it will break other
sites that use the same user.
5. Remove all Users authorized by "LDAP-LDAP-server" from Foreman UI (relevant for
RCs less than RC25).
Use "admin/password" to login to Foreman UI, NOT your att_uid! You can obtain admin password
from /var/lib/puppet/foreman_cache_data/admin_password on LCM nodes. If someone already reset
the password before it will be different from what you can find in the cache file. You need to try
checking this file on every LCM node. If it still does not work, on any LCM node run foreman-rake
permissions:reset . Use fresh generated password to login to Foreman UI as admin user.

5.1. Check whether "Automatically create accounts in Foreman" is enabled via API request.
export FOREMAN_USER=admin

export FOREMAN_PASSWORD=foreman_password

export FOREMAN_URL=https://foreman_url

curl -s -k -H "Content-Type:application/json" -u ${FOREMAN_USER}:${FOREMAN_PASSWORD} \

${FOREMAN_URL}/api/v2/auth_source_ldaps | \

python -c 'import json,sys;print json.load(sys.stdin)["results"][0]["onthefly_register"]'

If the response is True , go to section 5.2, otherwise, execute the following commands to enable it:
FOREMAN_LDAP_ID=$(curl -s -k -X GET -H "Content-Type:application/json" \

-u ${FOREMAN_USER}:${FOREMAN_PASSWORD} ${FOREMAN_URL}/api/v2/auth_source_ldaps | \

python -c 'import json,sys;print json.load(sys.stdin)["results"][0]["id"]')

echo $FOREMAN_LDAP_ID # Make sure you have an integer ID

curl -s -k -X PUT -H "Content-Type:application/json" -u ${FOREMAN_USER}:${FOREMAN_PASSWORD} \

-d '{"onthefly_register":true}' ${FOREMAN_URL}/api/v2/auth_source_ldaps/${FOREMAN_LDAP_ID} | \

python -m json.tool | grep onthefly_register

Check that onthefly_register is set to true


5.2. Export environment variables if you didn't do it in section 5.1
export FOREMAN_USER=admin

export FOREMAN_PASSWORD=foreman_password

export FOREMAN_URL=https://foreman_url

Save the code below as remove_users_foreman.py and execute the script python
remove_users_foreman.py to remove all users authorized by "LDAP-LDAP-server" in Foreman.
import os

import requests

import json

from requests.auth import HTTPBasicAuth

user = os.getenv('FOREMAN_USER')

password = os.environ.get('FOREMAN_PASSWORD')

url = os.environ.get('FOREMAN_URL')

URL = "{0}/api/v2/users/".format(url)

response = requests.get(URL, verify=False,

auth=HTTPBasicAuth(user, password)).content
for i in json.loads(response)['results']:

if i['auth_source_name'] == 'LDAP-server':

del_url = URL + str(i["id"])

print "Removing user {0} {1} with ID {2}".format(

i["firstname"], i["lastname"], i["id"])

requests.delete(del_url, verify=False,

auth=HTTPBasicAuth(user, password))

5.3. Check in Foreman UI that no more users authorized by LDAP exist (from now and until the end
of the whole procedure below login in Foreman as admin ).

5.4. Add roles for new admin tenant mechid to openstack to support LDAP

Steps need to be executed on fuel node, a DE needs to source rc file for executing Openstack
commands. The DE can download their own openrc from Horizon, or it can be created by becoming
a root, making a copy of the openrc_v2 file, chown it to the DE-user, editing to substitute the mechid
to their ATTUID, and replacing the OS_PASSWORD line with:
# With Keystone you pass the keystone password.

echo "Please enter your OpenStack Password: "

read -sr OS_PASSWORD_INPUT

export OS_PASSWORD=$OS_PASSWORD_INPUT

Run:
grep "ldap.Identity" /etc/keystone/keystone.conf
Execute the following if ldap.Identity is present in above output:
source <your_openrc_file>

openstack role add --user <newAdminTenantMechid> --project admin admin

6. Check contrail vRouter connections.

6.1. Get amount of compute nodes on the env and remember it for further steps:
ssh <fuelvm>

compute_n=`fuel node| grep compute | wc -l` ; echo $compute_n

6.2. Perform Contrail pre checks (20.2 vRouter connection count checks) before proceeding
- https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
For every CCNT node, the number of connections must be equal to output of 6.1. (see above). If not,
that means number of vRouters is not_equal to number of computes . In that case you should stop the MOP
and go to item 6.4

6.3. Check that every vRouter has at least two Established connections.

Get a list of contrail nodes:


ssh <fuelvm>

fuel node| grep contrail-control| awk -F"|" '/ready/ {print $5}'

Login to ccnt01 contrail-controller node and check:

Visual check:
ssh m96722@<CCNT01_node_IP>

export mesh_ip=`ip a | awk '/inet.*br-mesh/ {print $2}'| egrep -o "([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}"`; echo $mesh_ip

for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do python

/var/tmp/ccnt_scripts/ist.py --host $each vr xmpp;done

If you see any issues of any type (i.e. Failed to reach destination , Error or any others), stop the MOP!
and go to 6.4.

Successfull output ex:

Error output:
If everything looks good, run next command on the same CCNT node to make sure, that there are at
least 2 Established connections and no Alerts:

Automatic check:
for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do echo -en "\nHost:

$each ";python /var/tmp/ccnt_scripts/ist.py --host $each vr xmpp | grep Established |wc -l | while read l ;do echo -n $l; if

[ $l -lt '2' ] ; then echo " ALERT!";fi;done;done; echo ''

Successfull output ex:

Error output:

If you see any ALERT, run command for visual inspection of bad node(s) and check the issue. If you
see the issue stop the MOP and go to 6.4

6.4. Alert! If the above steps doesn't work, stop the MOP and create AOTS ticket and report the
issue(s). (See item 2 in Preliminary section for details of the ticket creation)

21. Implementation

NOTE: All terminal input/output must be logged during the change.

Step 21.1: Suspend Puppet Agent

Time Line: 10 minutes

Tenant Impact: No

Action: On Fuel VM, as fuel admin user execute following


ssh <fuelvm>

Retrieve Fuel env id and set


fuel env
export envid=<envid>

upload fuel stop puppet graph


fuel2 graph upload --env $envid --type stop_puppet --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/stop_puppet.yaml

Execute stop puppet graph


fuel2 graph execute --env $envid --type stop_puppet

check graph execution is complete


fuel2 task list

All executed in previous steps tasks must be ready/skipped, if not -


then check specific task with <task_id> from above command's output:

fuel2 task history show <task_id> then login to the node with error task state and check puppet log on
it: /var/log/puppet.log

Results/Descriptions: Check puppet agent is stopped on all nodes.

On Fuel VM as fuel admin, execute following to check status of puppet agents on all nodes.
for i in `sudo fuel nodes |grep ready | awk '{print $9}'`; do echo -n "$i: "; ssh -q -o StrictHostKeyChecking=no $i 'sudo

service puppet status'; done


Change meta-data in OpsSimple, propagate to Fuel

It gets current metadata from OpsSimple to yaml file, then changes mechid related items and
uploads new yaml back to OpsSimple and Fuel.
$ ssh <opscvm>

$ /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/rotatemechid/current.yaml <env_name> yaml

Here you can choose Option1 for automatic update OpsSimple yaml with new mechids or Option2 if
something goes wrong with Option1.
Option1: By ./update_mechids.py script:
$ python3 /home/m96722/rotatemechid/update_mechids.py --mechid-file

/home/m96722/rotatemechid/newmechids.yaml --site-file /home/m96722/rotatemechid/current.yaml >

/home/m96722/rotatemechid/latest.yaml
Update att_user_role_mapping item with appropriate value(s) and check format (format is critical) in
next file:
$ vi /home/m96722/rotatemechid/latest.yaml

Option2: Manually:
$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml

$ vi /home/m96722/rotatemechid/latest.yaml

* Replace `old mechids/passwords` by `new mechids/passwords`

* Update `att_user_role_mapping` with appropriate value(s) and check format (format is **critical**)

* First key section is the ldap section in the aic-fuel-plugin

* Non-openstack sections such as RO, LMA... must be updated according to new MechIDs

* For Solidfire credentials change, make changes to SOLIDFIRE section under cinder block

sf_san_login: <user>

sf_san_password: <password>

 DE to get the changes reviewed by peer and make sure that new MechID(s) and
passwords passed the test (11.a.2 "Update automation meta-data" step) before
proceeding

Here you can make a visual comparison between current.yaml and latest.yaml to make sure that only
MechIDs/Passwords have been changed:
ssh <opscvm>

$ cd /home/m96722/rotatemechid

$ diff current.yaml latest.yaml

This step is actually gets OpsSimple to push the list of repos and news packages to Fuel.
ssh <opscvm>

cd /var/www/aic_opssimple/backend

./manage.py import /home/m96722/rotatemechid/latest.yaml yaml

./manage.py generate_scripts <env_name>


cd ~/aic/files/env_xxx/fuel-client

python ./script.py update_plugin

(Starting RC24) Rotate db_passwords for openstack services:


python ./script.py rotate_service_dbpassword

Example of successfull run:


2019-11-21 23:46:09,241 - INFO script.py:926 -- Changing db_password for nova

2019-11-21 23:46:09,242 - INFO script.py:929 -- gxrF9Qyvh7Qr1DIX1htXXXXX -> DDQQzxndkNFYWPInYCRXXXXX

old_password:gxrF9Qyvh7Qr1DIX1htXXXXX->new_password:DDQQzxndkNFYWPInYCRXXXXX

2019-11-21 23:46:09,242 - INFO script.py:926 -- Changing db_password for heat

2019-11-21 23:46:09,243 - INFO script.py:929 -- aB8NCEYhpcDbWRvSjc7XXXXX ->

q0pTKtZTamuGP8DPDweXXXXX

old_password:aB8NCEYhpcDbWRvSjc7XXXXX->new_password:q0pTKtZTamuGP8DPDXXXXX

Run Audit Fuel

From Fuel node run:


ssh <fuelvm>

cd /root/rotatemechid/after

fuel env

envid=<fuel env id>

fuel --env $envid settings --download

fuel --env $envid network --download

chmod 600 network_${envid}.yaml settings_${envid}.yaml

The DE must check that the only changes are mechid and passwords. The DE MUST get the diff
reviewed by a peer. To compare line by line:
cd /root/rotatemechid/after

diff -r networks_<envid>.yaml ../before/networks_<envid>.yaml

diff -r settings_<envid>.yaml ../before/settings_<envid>.yaml

To compare structure difference:


python /tmp/diff_struct.py networks_<envid>.yaml ../before/networks_<envid>.yaml

python /tmp/diff_struct.py settings_<envid>.yaml ../before/settings_<envid>.yaml

If something is wrong, the DE must go back, adjust the latest.yaml opssimple file and rerun the
procedure. ANY MISTAKE HERE WILL RESULT IN TENANT IMPACTS DURING PUPPET AGENT
RESTART. So the DE needs to take his time, and double check.

Propagate meta-data changes through the system.

Propagate the meta-data change through the LCP, switch repositories and the code environment on
every nodes

 During the check of the LCM infrastructure, we ensure that the latest Puppet Code
had been downloaded.
 Fuel manifests switch the repositories on OpenStack nodes
 Fuel updates hiera and configdb afterwards

fuel env

envid=<your env ID>

fuel2 graph upload --env $envid --type update_config_db --file

/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/update_config_db.yaml

fuel2 graph execute --env $envid --type update_config_db


# Rememeber id of the last executed task

$ fuel2 task list

# All tasks should be ready

$ fuel2 task history show <task_id>

At this point the Puppet Agents operations have been disabled, but no changes have been applied
to OpenStack yet.

Apply the mechid change through LCM


Tenant Impact: YES

Up to this point, there should haven't been any tenant impact since the Puppet agents have not run
yet. The DE must be aware, that OpenStack changes will be applied by the Puppet agents when they
reenables them.

Replace MechId for nova-compute service to avoid further vfd restart

Note: this step is mandatory only for versions older than 3.0.3 RC14, otherwise it can be skipped.

 Upload and execute rotate_mechid_compute graph: rotate_mechid_compute.yaml.


Open the file in the browser and copy-paste into the /root/mechidrotate directory.

ssh <fuelvm>

fuel env

envid=<fuel env id>

cd /root/mechidrotate/

Copy-paste the rotate_mechid_compute.yaml file from the location listed above


fuel2 graph upload --env $envid --type rotate_mechid --file rotate_mechid_compute.yaml

nodes_ids=`fuel node | grep -i compute | awk '{print $1}' | tr "\n" " "`

fuel2 graph execute --env $envid --type rotate_mechid --node $nodes_ids

fuel2 graph delete --env $envid --type rotate_mechid


Install start_puppet_now graph

 Upload start_puppet_now graph: start_puppet_now.yaml. Open the file in the browser


and copy-paste into the /root/mechidrotate directory.

ssh <fuelvm>

fuel env

envid=<fuel env id>

cd /root/mechidrotate/

Copy-past the start_puppet_now.yaml file from the location listed above


sudo fuel2 graph upload --env $envid --type start_puppet_now --file /root/mechidrotate/start_puppet_now.yaml

Apply changes on primary aic-identity node

Execute Puppet on primary aic-identity node and wait for the report in Foreman. Please use the
following way for all steps from 9 to 14. This will create the roles in keystone if necessary
# Pick primary identity node using the procedure below then run the start_puppet_now graph

$ sudo fuel env

$ node_id=$(for i in `sudo fuel node | awk -F"|" ' /identity/ {print $5}'`;do p=$(ssh -q -o StrictHostKeyChecking=no $i 'sudo

hiera roles') ; echo -n $p" - " ;sudo fuel node | grep $i|awk -F"|" ' {print $1}';done | grep primary | rev|cut -d'-' -f1|rev)

$ echo $node_id

$ envid=$(sudo fuel env | awk '/operational/ {print $1}')

$ echo $envid

$ sudo fuel2 graph execute --env $envid --type start_puppet_now --node $node_id

wait for reports in Foreman.


Apply changes on other aic-identity nodes

Execute Puppet on second and third aic-identity nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i identity | awk '{print $1}' | tr "\n" " "`

$ fuel2 graph execute --env $envid --type start_puppet_now --node $node_ids

wait for reports in Foreman.

Apply changes on aic-controller nodes

Execute Puppet on all aic-controller nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i aic-controller | awk '{print $1}' | tr "\n" " "`

$ fuel2 graph execute --env $envid --type start_puppet_now --node $node_ids

wait for reports in Foreman.

Apply changes on Contrail controller nodes

Timeline: takes 30-45 minutes.

This step needs to be done manually in one node at a time followed by service checks and before
performing next contrail node 02 and so on
Be sure that uptimes of Contrail processes are in sync, before to procces with CCNT02. Incase
services are not started at same time, DE need to raise AOTS to T2.

Create AOTS ticket specifying IP, FQDN of Contrail server(s) and output of Contrail service status
(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")

Perform Contrail checks (section 20) before you proceed


- https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
Execute Puppet on contrail-control nodes one-by-one with checking reports in Foreman for every next
node.
$ fuel node | grep -i contrail-control

Start with CCNT01 node!

From fuel node run:


$ fuel2 graph execute --env $envid --type start_puppet_now --node <node_ccnt1>

wait for a report in Foreman


Verification: Usually it takes 5-7 minutest to align with other services, but sometimes it maght take up to 25 minutes for

vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-

195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4

if everything is successfull above, switch to ccnt02 node and run:

From fuel node run:


$ fuel2 graph execute --env $envid --type start_puppet_now --node <node_ccnt2>

wait for a report in Foreman

Verification: Usually it takes 5-7 minutest to align with other services, but sometimes it maght take up to 25 minutes for

vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-

195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4

if everything is successfull above, switch to ccnt03 node and run:

From fuel node run:


$ fuel2 graph execute --env $envid --type start_puppet_now --node <node_ccnt3>

wait for a report in Foreman.

Verification: Perform Contrail checks (section 21 - step1 thru step8) before you proceeding -

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-

195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4

Apply changes on core LCP and swift nodes

Execute Puppet on the rest of LCP nodes.

Loop through LCP roles (DO NOT include 'aic-compute' role in the list):
$ nodes_ids=''

$ roleslist=$(fuel node | tail -n+3| awk -F"|" '{print $7}' |sort|uniq|egrep -v "^[[:space:]]*$|compute|identity|aic-controller|

contrail-control" | while read i; do echo -n $i\| ;done); role=${roleslist%?}

$ nodes_ids=`fuel node | egrep -i "$role" | awk '{print $1}' | tr "\n" " "`

$ fuel2 graph execute --env $envid --type start_puppet_now --node $nodes_ids

wait for reports in Foreman.


Apply changes on compute nodes

Execute Puppet on all compute and wait for reports in Foreman.

Due to performance implication it's not possible to run Puppet on all compute right away. Keeping
that in mind it's better to run Puppet either rack-by-rack or in a batch of 10, 20, 30 (it depends on
number CPU on lcm nodes). The vital part of running Puppet is checking reports from Puppet in the
Foreman, restart Puppet on a node in case of failure and decreasing bunch size to decrease number
of failed nodes.

Loop trough the racks within the aggregation zone:

Note: if the procedure below won't work due to site naming conventions (i.e. no "rXXc" in the name),
then obtain all of the computes for an availability zone and execute the start_puppet_now graph in
batches. Then do the same for the other availability zone(s).
$ rack_id='r10c'

$ node_ids=`fuel node | grep -i compute| grep -i $rack_id | awk '{print $1}' | tr "\n" " "`

$ fuel2 graph execute --env $envid --type start_puppet_now --node $node_ids

wait for reports in Foreman.

Update mechIDs/passwords in Glance Database

Here you need two pairs of credentials: <old_glance_mechid>,


<old_glance_password> and <new_glance_mechid>, <new_glance_password> . Normally, in case of single
MechId, these are just old MechId/password which is being rotated and new MechId/password.
ssh <fuelvm>

fuel node | grep dbng

#Logout of root, and:

ssh <any dbng node>

export OLD_MECHID_USER=<old_glance_mechid>

export OLD_MECHID_PASSWORD=<old_glance_password>

export NEW_MECHID_USER=<new_glance_mechid>

export NEW_MECHID_PASSWORD=<new_glance_password>

mysql glance -e "update image_locations set value=replace(value,'${OLD_MECHID_USER}:$

{OLD_MECHID_PASSWORD}', \

'${NEW_MECHID_USER}:${NEW_MECHID_PASSWORD}') where status = 'active' and value like '%$

{OLD_MECHID_USER}:${OLD_MECHID_PASSWORD}%';"
Check that there no more old MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value like '%${OLD_MECHID_USER}%';"

Output should be empty

Make sure that there are no very old and forsaken MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value NOT like '%$

{NEW_MECHID_USER}%';"

NOTE: If any older mechids (older than the previous mechid itself) are found that are associated with
an image then open AOTS ticket for the Operation to correct the issues.

Expire outdated mechids in mysqlDB

Login to opssimple node and run the playbook with specified tags to see mechIDs in mysqlDB to be
expired, then expire them and show expired mechids.

As a result you need to get a list of expired user(s). Save the list to remove users from mysqlDB
completely (see steps below).
$ ssh <opscvm>

$ cd aic/

$ ansible-playbook /home/m96722/aic/playbooks/expire_old_users.yml -i inventory/ -e

'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags display

$ ansible-playbook /home/m96722/aic/playbooks/expire_old_users.yml -i inventory/ -e

'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags expire


$ ansible-playbook /home/m96722/aic/playbooks/expire_old_users.yml -i inventory/ -e

'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags show_expired

Example of show_expired tag output:


...

TASK: [debug msg="User(s) {{ mysql_expired.stdout_lines }} are expired. You can delete them further"] ***

ok: [zmtn11dbng01.mtn11.cci.att.com] => {

"msg": "User(s) ['m11111@localhost', 'm11111@localhost1', 'm11111@localhost2', 'm11112@localhost',

'm11113@localhost', 'm11114@localhost', 'm11114@localhost1', 'm11114@localhost2', 'm11114@localhost3',

'm11114@localhost4'] are expired. You can delete them further"

...

Expire the password for previous MechID

Previous step should take care of all outdated mechIDs to expire them including the previous
mechID. In case the previous step run unsuccessfully, and also, to make sure that previous mechID is
expired please execute the following:
$ ssh <fuelvm>

$ sudo fuel node | grep dbng

$ ssh <any dbng node>


$ execute toor

$ mysql

> select user,host from mysql.user where user = "<old_mechid>";

> alter user "<old_mechid>"@"<host>" password expire; #<--- Run this command for each pair (old_mechid,host) from

the output of the previous command

Delete expired users from mysqlDB

Please use AIC-MOP-197 Delete absolute DB users and Drop Databases to delete expired users that
you got in Expire outdated mechids in mysqlDB step above.
Restart cmha services

Login to fuel using your


$ ssh <fuelvm> -A
$ for i in `sudo fuel node | awk -F\| '/aic-controller/ {print $5}'`; do ssh $i "sudo 'crm resource restart cmha' ; sudo 'service

cmha_restapi restart'" ;done

CMC node update

This step can be processed in parallel with the Test plan, because it is executed in DCP's CMC server.

AIC OpsSimple 3.x CD & LCM: https://wiki.web.att.com/pages/viewpage.action?pageId=517384070

AIC DCP/LCP Site Matrix: https://wiki.web.att.com/pages/viewpage.action?pageId=493569807

If the DEs are not aware of the CMC update, then they should check with peers who usually
perform this activity.

You need to change credentials in CMC node (DCP) for site inventory (as in example below):
1. orm.<SITE_NAME>.<LCP_TYPE>

example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large

2. opsc.<SITE_NAME>.<LCP_TYPE>

example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large

File Name: orm.site_type.environment

Example: orm.mck1b.medium

Install aiclcm package (Version : aic-opssimple-plugins-aiclcm.39142.deb)

Execute MOP-355 Update aiclcm packages https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP


%20Document%20Library/AIC-MOP-355-Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=4Z0SyR

Update new mechid settings for Fuel if LDAP enabled in keystone.conf on Fuel node

Check if LDAP enabled in keystone.conf on Fuel node Login on FuelVM:


ssh <attuid>@<fuelvm>

Execute:
grep "ldap.Identity" /etc/keystone/keystone.conf

Example of output: It means LDAP is enabled


# grep "ldap.Identity" /etc/keystone/keystone.conf

driver= keystone.identity.backends.ldap.Identity
In case of enabled LDAP proceed and execute Steps: 20, 21.b and 22 of AIC-MOP-379 (MOP to
enable LDAP as a backend for keystone on Fuel node).
(https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-379%20-
%20MOP%20to%20enable%20LDAP%20as%20a%20backend%20in%20keystone%20on%20Fuel
%20node.docx?d=w6c76d99a236343e78299a39c9d8aceb8&csf=1&web=1&e=v0a1zd)

Make sure that "aic-opsfix-cmc-0175" has verson:86666 or above is installed during Step 20 of AIC-
MOP-379

If a current site's RC > RC22.xx - during execution of MOP-379, skip opsfix-0175 execution

Non-Openstack components update

On some sites Trove has been removed. If you get error after following ansible-playbook execution,
please delete lines 132-160 from
/home/m96722/aic/playbooks/aic_non_openstack_mechid_update.yml on OpsSimple node and run
again. (fixed in 3.0.3 RC15)
$ ssh <opscvm>

$ cd aic/

$ ansible-playbook -i inventory -e 'mechid_file=/home/m96722/rotatemechid/newmechids.yaml'

playbooks/aic_non_openstack_mechid_update.yml
$ ansible-playbook -i inventory -e 'mechid_file=/home/m96722/rotatemechid/newmechids.yaml'

playbooks/aic_nagios_mechid_update.yml
Proceed with approved AIC-MOP-245 to swap MechID for RO related components (Implementation
takes about 15 minutes).

Nagios credentials update

On OpsSimple node execute below steps:


Step 1: Take backup of api_monitoring.conf
sudo cp -p /etc/nagios/nrpe.d/api_monitoring.conf /etc/nagios/nrpe.d/api_monitoring.conf_backup

Step 2: Please follow approved AIC-MOP-215 to get Nagios credentials updated

Step 3: verify nagios MechId updated


grep username /etc/nagios/nrpe.d/api_monitoring.conf

Note - if MechId is not updated skip steps 4-7 else

Step 4: Add the user in keystone with admin role and ccp-monitoring tenant if not already added

From opssimple node:


cd /home/m96722/aic
ansible-playbook -i inventory/ playbooks/openrc_automation/keystone_user_commands.yml --tag user_role --extra-vars

"mechid=<mechid> tenant=ccp-monitoring"

ansible-playbook -i inventory/ playbooks/openrc_automation/keystone_user_commands.yml --tag user_add --extra-vars

"mechid=<mechid> role=admin tenant=ccp-monitoring"

Step 5: Remove any occurences of api-monitoring.conf except opssimple node


ansible all -i inventory/ -m shell -sa "ls -l /etc/nagios/nrpe.d/api_monitoring.conf"

Take note of the successful runs of the above command, except opssimple.

Execute the below ansible command to remove the file api_monitoring.conf


ansible all:\!jump_host -i inventory/ -m shell -sa "rm -f /etc/nagios/nrpe.d/api_monitoring.conf"

Step 6: Edit api_monitoring.conf with mechid and password


sudo vim /etc/nagios/nrpe.d/api_monitoring.conf

Step 7: Execute next playbook on OpsSimple node to create MechId user in DB for Nagios:
ansible-playbook -i inventory/ playbooks/deploy_nagios_agent.yml --limit dbng_host
Step 8: Restart nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server restart
Step 9: Verify nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server status

Service should be in started status

22. Test Plan

NOTE: All terminal input/output must be logged during the change.

Wait until puppet finishes its work and check for new Foreman reports (green status). About 30 mins.

Foreman reports from vLCP nodes should have entries with username and password changes and
some services should be restarted. Please login to Foreman UI and check reports for all nodes.

Login to the Opssimple VM as att uid ( UAM) Refer to Jump host for sites as per below wiki
link https://wiki.web.att.com/display/CCPdev/AIC+OpsSimple+3.x+VMs And https://
wiki.web.att.com/display/CCPdev/Environments

sudo su - m96722 or sudo -iu m96722


Run next playbook to find expired mechids in conf files (see references
in /home/m96722/aic/roles/aic_opsfix_0186/files/conf_list.yaml ). Use latest.yaml with new mechids
cd /home/m96722/aic

ls /home/m96722/rotatemechid/latest.yaml

ansible-playbook -i inventory/ -e 'mechid_templ=/home/m96722/rotatemechid/latest.yaml' playbooks/validate_mechids.yml

Please check that there are no ansible errors related to access thrown during execution. It is
supposed to be fixed in 20. Preliminary Implementation, item 2.

Example of playbook output (you can find warnings on suspicious mechids):


...

TASK: [debug ] ****************************************************************

ok: [zmtn12mosc01.mtn12.cci.att.com] => {

"msg": "Suspicious mechids: m19437 m19438\n !!! Please check verbose /home/m96722/output for details."

...

For detailed playbook output see: cat /home/m96722/output . You may want to find suspicious mechids,
conf files and nodes in the output file. If any, please check Foreman reports for those affected nodes
and ensure that puppet agent is running. If issue is still in place even after successful report check
latest.yaml and ensure that OpsSimple pushed all the changes to Fuel.

Please, run next playbook to find services that were not been restarted after Conf files had been
changed:
ansible-playbook -i inventory/ playbooks/validate_services.yml

Please check that there are no ansible errors thrown during execution
For results see: cat /home/m96722/output_service

Example of playbook output(you can find OK and FAILED examples in the output_service file):
m96722@zmtn12opsc01:~/aic$ cat /home/m96722/output_service

...

zmtn12fuel01.mtn12.cci.att.com /etc/keystone/keystone.conf
zmtn12rosv02.mtn12.cci.att.com /opt/installer/ro.conf

zmtn12mosc02.mtn12.cci.att.com /etc/neutron/neutron.conf

neutron-server ok

mtn12r01c015.mtn12.cci.att.com /etc/neutron/neutron.conf

mtn12r01c009.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf

ceilometer-polling ok

zmtn12mosc01.mtn12.cci.att.com /etc/keystone/keystone.conf

mtn12r01c015.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf

ceilometer-polling ok

mtn12r01c009.mtn12.cci.att.com /etc/nova/nova.conf

nova-compute ok

zmtn12mosc01.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf

ceilometer-polling ok

ceilometer-api ok

ceilometer-alarm-evaluator ok

ceilometer-alarm-notifier ok

ceilometer-agent-notification ok

ceilometer-collector ok

ceilometer-agent-notification ok

mtn12r03s002.mtn12.cci.att.com /etc/swift/account-server.conf

swift-account-server ok

swift-account-auditor ok

swift-account-replicator ok

swift-account-reaper ok
mtn12r01c015.mtn12.cci.att.com /etc/nova/nova.conf

nova-compute ok

zmtn12mosc01.mtn12.cci.att.com /etc/glance/glance-registry.conf

glance-registry ok

zmtn12mosc01.mtn12.cci.att.com /etc/heat/heat.conf

heat-engine ok

heat-api-cfn FAILED!!! Restart is needed

heat-api-cloudwatch FAILED!!! Restart is needed

zmtn12mosc02.mtn12.cci.att.com /etc/heat/heat.conf

heat-api FAILED!!! Restart is needed

heat-engine ok

heat-api-cfn FAILED!!! Restart is needed

heat-api-cloudwatch FAILED!!! Restart is needed

...

If there are services with FAILED!!! Restart is needed message, please go to a node and restart it
manually (use sudo service <service_name> restart or sudo crm resource restart <crm_resource_name> or sudo
service apache restart if service runs under apache). The only exception is heat-
<services> (see: https://jira.web.labs.att.com/browse/DEFECT-6362 ). It is supposed to be restarted at
the end of the playbook. Please run the playbook again.

22.1 Verify sORM inventory credentials on DCP OpsSimple CMC node


corresponding to the LCP has been updated

AIC OpsSimple 3.x CD & LCM: https://wiki.web.att.com/pages/viewpage.action?pageId=517384070

AIC DCP/LCP Site Matrix: https://wiki.web.att.com/pages/viewpage.action?pageId=493569807

To verify the credentials in CMC node (DCP) for site inventory (as in example below), verify that the
date of the file and if older than the day of the change then it fails.
1. orm.<SITE_NAME>.<LCP_TYPE>

example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large
2. opsc.<SITE_NAME>.<LCP_TYPE>

example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large

File Name: orm.site_type.environment

Example: orm.mck1b.medium

Login to fuel host and verify if the env is operational.

ssh <fuel_vm>
fuel env

Check if the above step shows the environment as operational

Request CPVT to perform complete regression and clean up artifacts

23. Backout Procedure

NOTE: All terminal input/output must be logged during the change.

Following sections detail rollback/back-out procedure that can be used during MechID rotation. This
procedure will eventually be incorporated in a separate backup and restore guide that can be
exercised on AIC sites as part of standard operational activity.
1. specific components in vLCP independent of remaining nodes in vLCP.These need
to be exercised only if the component has to be restored to the last known state /
state at which they were backed up.
2. Entire LCP to a pre-upgrade state.

Compute nodes backup & restore procedure shall be included separately in operations backup and
restore guide.

Revert site to old MechID(s)

Old configuration is still in /home/m96722/rotatemechid/current.yaml You need to copy it


to /home/m96722/rotatemechid/latest.yaml
$ ssh <opscvm>

$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml

and repeat steps above starting from Change meta-data in OpsSimple, propagate to Fuel Option1
or Option2

Restore Fuel State from backup

 To restore just a nailgun database on Fuel node:


cd /var/tmp/fuelbackuppre
sudo -u postgres dropdb nailgun
sudo -u postgres psql nailgun < nailgun.dump

Restore all vLCP nodes for LUN backup

If previous actions did not help to restore, please request storage team to restore vLCP LUN backups
to revert back to the previous version of components in vLCP.

Revert Opsfix 0136

Revert
ansible-playbook -i inventory/ playbooks/aic_opsfix_revert_0136.yml

24. POST Activities (during maintenance window)

NOTE: All terminal input/output must be logged during the change.

To comply with RIM, and audit findings, every MOP must include steps to remove backups and
artifacts created during the deployment of that MOP

The removal process may be at a later date to allow for potential back outs. Any artifacts for this
must be listed in Post Change activities section and include instructions on scheduling the future
removal.

Any backups and artifacts created during the MOP execution which will not be needed for backout,
should be removed in the POST implementation activities section executed before the end of the
change window.

Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.

Restore OpsSimple repo Source

 To restore repo sources on OpsSimple node:

sudo cp -p /var/tmp/sources.list /etc/apt/sources.list


sudo chmod 644 /etc/apt/sources.list
sudo apt-get update

Cleanup

NOTE: All terminal input/output must be logged during the change.

Since newmechids.yaml and beforerotation.yaml and latest.yaml do have sensitive data inside, it's
important to delete them when the procedure is over.
$ ssh <opscvm>

$ rm -f /home/m96722/rotatemechid/beforemechidrotate.yaml

$ rm -f /home/m96722/rotatemechid/latest.yaml

$ rm -f /home/m96722/rotatemechid/newmechids.yaml

$ rm -f /home/m96722/aic-opsfix-cmc-0064.*.deb

$ rm -f /var/tmp/sources.list

$ rm -f /home/m96722/output

$ rm -f /home/m96722/output_service

$ ssh <fuelvm>

$ cd /root/rotatemechid/

$ cleanup all the artifacts in ./before and ./after folders

$ cd /var/tmp/fuelbackuppre

$ Clean up all the artifacts here in ./

$ cd /tmp/fuelbackuppre

$ Clean up all the artifacts here in ./

$ Clean up the place where you copied /var/tmp/fuelbackuppre in Step 17.b.2

--
CPVT to perform Sanity Checks

CPVT to cleanup artifacts

25. Post Maintenance Work

NOTE: All terminal input/output must be logged during the change.

NA

26. Appendix and Tables (IF REQUIRED)


26.1. Troubleshooting
Test new mechids fails, TLS Certificates not installed.
If you get error ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) during new mechid test, you need
to check that CA certificates are installed on OpsSimple node.

Example:
Testing <service>

Checking ***ATTTest LDAP***: its-ad-ldap.atttest.com

ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)

Check creds and parameters...

Status: Fail

There are two different sets of certs, one for each ldap domain: ITSERVICES and TESTITSERVICES.

To install them:
 Download zip https://workspace.web.att.com/sites/WDS/Lists/IP%20Addresses
%20for%20DNS%20%20WINS%20%20LDAP%20%20AD%20Time%20Sources/
Attachments/6/TrustedRootCerts.zip , unzip, scp to OpsSimple node and place
all *.cer files from ITSERVICES or TESTITSERVICES folders to /usr/local/share/ca-
certificates/
 Note: if you are granted enough permissions try directly: `sudo scp

<your_attuid>@199.37.162.36:/staging/ldap_cacerts/<domain>/* /usr/local/share/ca-certificates/

 Use `ITSERVICES` or `TESTITSERVICES` instead of `<domain>`.

 Rename *.cer to *.crt:


for i in `sudo ls /usr/local/share/ca-certificates/*.cer` ; do sudo mv $i ${i%.*}.crt ;done

 Run: sudo /usr/sbin/update-ca-certificates

Manually changed parameters in config files


Only works for PROD sites on RC22 and above

If you need to change parameters in config files you need to find them in FUEL UI -> Fuel plugin ->
additional_config and change values there BEFORE executing the custom graph update_config_db .

Example:
Notice: /Stage[main]/Attcinder::Controller::Aic_cinder_volume/Attcinder::Controller::Solidfire[SOLIDFIRE]/

Cinder_config[SOLIDFIRE/sf_svip]/value: current_value 32.50.209.243:3260, should be 32.50.209.244:3260 (noop)

After update_config_db is executed, hiera will get updated with new mechIDs and all changes made
in additional_config .
Deployment fails with "Could not evaluate: LDAP source LDAP-server delete error:
(422 Unprocessable Entity)" error
If the deployment fails with the HTTP 422 error:
/Stage[main]/Plugin_lcm::Tasks::Foreman/Foreman_ldap_auth[foreman_ldap_auth_source] (err): Could not evaluate: LDAP

source LDAP-server delete error: (422 Unprocessable Entity):

in /var/log/puppet/puppet.log on LCM nodes, please do the following:


1. Login to Foreman UI as admin user and CHECK the box "Automatically create

accounts in Foreman" if unchecked.


2. In Foreman UI switch to "Users" and delete all accounts that have "LDAP-LDAP-
server" in the column "Authorized by". Do not touch those that have "INTERNAL" in

that column.
3. Wait until puppet applied Day2 catalog once again.

Immutable config files


Sometime we notice that config files are set immutable and puppet is stopped on few nodes or
shows error
 check file attributes: lsattr <file>
 unset immutable: chattr -i <file>
 wait until puppet applies catalog

OpsSimple node shows error (APIError)


 check client configuration: /home/m96722/.config/aiccliopssimple/OpsSimple.yaml
 For some versions pykwalify module is not installed but still used, comment out
lines 8 and 9 in: /var/www/aic_opssimple/backend/api/base/uploaders.py then Restart
opssimple backed: /home/m96722/install/restartserver.sh
 If .APIError: "Error Authenticating. Please use a UAM ATT or Mech ID to login." , that
probably means /etc/shadow file is inaccessible for some authorization tools. You
may set read permission on this file for group sudo chmod g+r /etc/shadow or
set enable_pam: False in ~/.config/aiccliopssimple/OpsSimple.yaml Restart opssimple
backed: /home/m96722/install/restartserver.sh

Stuck Fuel graph in pending state


Sometimes when we launch fuel graph execute it ended up in pending state.

Symptoms:

 fuel graph execute doesn't return the console for 10-15 minutes
 fuel2 task list shows last task in pending state

Solution:
1) task_id=<task_id of pending task>
1) $ fuel task --delete --force --task-id $task_id

2) $ service postgresql restart

3) $ service nailgun restart

Login on nodes to troubleshoot


 Incase to login to any Openstack node to troubleshoot, use below method to login
to nodes
 To login to fuel node:

 Login with att uid (UAM)

 sudo fuel node

 ssh <attuid>@<node ip>


 Incase to login to any Non-Openstack node to troubleshoot, use below method to
login to nodes
 To login to opssimple node:

 Login with att uid (UAM)

ssh <node ip>

You might also like