IPL1A RC35 Mechid Artifacts

17.
Pre Maintenance Check, Precautions and Preparations
Tenant Impact: No
NOTE: All terminal input/output must be logged during the change.
** Important ** : Execute MOP-355 Update aiclcm packages
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-355-
Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=4Z0SyR
17.a.1 Add OPSFIX repository

On Opssimple node make a backup of sources.list
sudo cp -p /etc/apt/sources.list /var/tmp/sources.list
sudo chmod 600 /var/tmp/sources.list
If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.3)

deb http://mirrors-aic.it.att.com/opssimple/opsfix3.0.3-dev/ trusty main
deb http://mirrors-aic.it.att.com/opssimple/ops3.0.3/ trusty main
If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.3)

deb http://mirrors.it.att.com/opssimple/opsfix3.0.3/ trusty main
deb http://mirrors.it.att.com/opssimple/ops3.0.3/ trusty main
If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.2)

deb http://mirrors-aic.it.att.com/opssimple/opsfix3.0.2-dev/ trusty main
If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.2)

deb http://mirrors.it.att.com/opssimple/opsfix3.0.2/ trusty main
run repositories metadata update command:

$ sudo apt-get update
17.a.2 Pre-Maintenance Check Tools/System
 Install/Update aic-lcm packages:
$ ssh <opscvm>
$ sudo apt-get install aic-lcm aic-opssimple-plugins-aiclcm

$ [ $(apt-cache policy aic-opssimple-plugins-aiclcm | grep Candidate: | egrep -o "[0-9]+$") -lt 40023 ] && sudo apt-get
install -y aic-lcm=3.0.3-40023 aic-opssimple-plugins-aiclcm=40023
 Install ldap-utils package:
$ ssh <opscvm>
$ sudo apt-get install -y ldap-utils

 Install python3-openssl package:
$ ssh <opscvm>
$ sudo apt-get install -y python3-openssl
 Update OPSFIX-0064. To do that on OpsSimple node run:

 $ ssh <opscvm>
 $ sudo apt-cache search aic-opsfix-cmc-0064
If output contains information about the package, run:

$ sudo apt-get install aic-opsfix-cmc-0064
Otherwise, download and install it manually using the steps below:
$ pck=$(curl http://mirrors-aic.it.att.com/aic-mos/review/40972/opssimple/pool/main/a/aic-opsfix-
cmc-0064/ | egrep -o "aic-opsfix-cmc-0064.[0-9]*.deb" |sort|tail -1)
$ Copy the contents

from http://mirrors-aic.it.att.com/aic-mos/review/40972/opssimple/pool/main/a/
aic-opsfix-cmc-0064/$pck
$ sudo dpkg -i $pck
$ rm $pck
Check installed package (it should output a non-empty list of files):

$ sudo dpkg -L aic-opsfix-cmc-0064
Run next command to propagate contrail check script to contrail-controllers:

$ cd ~/aic
$ ansible-playbook -i inventory/ roles/aic_opsfix_0064/playbooks/aic_opsfix_deploy_0064.yml

Copy diff_struct.py script to fuel node:
$ cd ~/aic
$ ansible -i inventory/ fuel_host -mcopy -a "src=~/rotatemechid/diff_struct.py dest=/tmp/"
 Fuel status
It needs to check that environment status is operational, and no error nodes, no offline nodes.
On OpsSimple node run:

ssh <fuelvm>
fuel env
Check that status is operational
ssh <fuelvm>
fuel nodes | grep -v ready
fuel nodes | tail -n +3 | awk -F"|" '{ if ($9 !=1) print $1 "Offline" $3 $9}'
 LCM status
This step verifies that the LCM infrastructure is in place and operational. If the LCM infrastructure is
not operational it will not be able to deploy the changes properly. The Puppet code required to
deploy the OpenStack packages needs to be downloaded and assigned to the nodes. This is done by
getting R10K which download the puppet code, installs it into Puppet Master. The "environment"
column in Foreman indicates what version of the Puppet code a particular node is running. The "last
report" in column indicates when the Puppet code was applied for the last time.
Verify that the nodes are green, if they are not in sync, review node reports to assess severity. Based
on severity raise itrack (you need to assign it on TigerTeam members)
- https://itrack.web.att.com/secure/Dashboard.jspa?selectPageId=19152
Attach node report to the ticket
In the hosts tab, check that all the nodes except the non-openstack nodes are pointing at the
previous release_candidate. The only nodes which can be pointing at production are the non-
openstack nodes.
Results/Description:
 UMASK and GSTools
GSTools installation is changing the UMASK on the LCM nodes and causing quite a lot interferences
with R10K. Notice: You need to forward your ssh-key when login to fuel to be able to login on lcm
nodes
Login to fuelvm as attuid:
ssh <fuelvm> -A
Run this to get fuel VM IP/hostname:

sudo fuel node | grep lcm
Login to lcm node as attuid:
ssh <lcm_vm>
cd /etc/puppet/environments
pwd
sudo chmod -R ugo+rx /etc/puppet/environments
Update automation metadata
 Verify that test_mechids.py and update_mechids.py are available on OpsSimple VM

(see /home/m96722/rotatemechid/ folder).
 You need to prepare newmechids.yaml . For that purposes you can use
templates new_mechids.yaml.templ in folder /home/m96722/rotatemechid/
 Populate in OpsSimple node or transfer prepopulated newmechids.yaml into the
OpsSimple VM if needed
$ scp newmechids.yaml m96722@<opscvm>:/home/m96722/rotatemechid/newmechids.yaml
$ sudo chown m96722:mechid newmechids.yaml
 Test new mechIDs by test_mechids.py . New set of MechIDs will be tested against
both LDAPs or either of them: ITServices and ATTTest. You have to use parameter
--medium or --large depends on the cloud type you are working on.
$ ssh <opscvm>
$ cd /home/m96722/rotatemechid/
$ /var/www/aic_opssimple/backend/manage.py export -f ./opssimple_site.yaml <envname> yaml
For medium:
$ cloud_type=medium
For large:
$ cloud_type=large
$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type -> <test
against both servers for medium site>
$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --atttest -> <test
against ATTTest>
$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --itservices -> <test
against ITSERVICES>
Note: Please check in the output above that all mechids are members of LDAP groups: AP-
AIC_Prod_Users, AP-AIC-Mobility, AP-365-DAY-PASSWORD-EXPIRATION . If output has missing group,
DE lead must reach out to T3 managers to added them on priority. T3 Contact
- https://wiki.web.att.com/pages/viewpage.action?pageId=429599467
If there is a error in output like ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) for all mechid
tests, please check that CA certificates are installed: ls /etc/ssl/certs | egrep -i '(test-)?sbc'
Output should reflect certificate files, if not please proceed to Troubleshooting section to download
and install them, then test new mechIDs again.
 Update/Install OPSFIX-0186. To do that on OpsSimple node run:

 $ ssh <opscvm>
 $ sudo apt-get install aic-opsfix-cmc-0186
 $ sudo dpkg -L aic-opsfix-cmc-0186

it should output a non-empty list of files.
Optional: Find /home/m96722/aic/roles/aic_opsfix_0186/files/conf_list.yaml and append lines with needed
conf files and services accordingly, please keep yaml format while adding.
 On OpsSimple node revisit and populate with appropriate

hostnames inventory/hosts and inventory/openstack files (typical
location: /home/m96722/aic/ ). These files are used by ansible for every run.
17.b. Pre-Maintenance Check Manual (Non-Automated Requirements)

Step 17.b.1: Request storage operations to backup of all vLCP LUNs
Time Line: 72 hours prior to CW.
Action: Submit request to backup of all vLCP LUNs in site.
SRTS Ticket needs to be created 72 hours ahead of CW
AOTS http://ushportal.it.att.com/step3.cfm?home=ush&app=3962&is_sbc=&prob=52341
Please pay attention that if you want to revert back the changes of vLCP nodes for any reason it may
cause orphaned VMs if they were created after the point backup was created.
Step 17.b.2: Backup Fuel Nailgun database.
Time Line: 5 minutes.
Action: On the fuel VM, as fuel admin execute -

ssh <fuelvm>
Check if there is sufficient disk space in /var/tmp for backup, if there is not enough space please
use /var/log/tmp/ instead of /var/tmp/ further in backup/revert steps.
df -h
For mediums, recommend a minimum of 10GB free space for backup.
For large sites, recommend a minimum of 20GB free space for backup.
Check that next folders are empty or absent:

ls /root/rotatemechid/before /root/rotatemechid/after /var/tmp/fuelbackuppre
if not then clean them.
create backup folder -

mkdir /var/tmp/fuelbackuppre
cd /var/tmp/fuelbackuppre
sudo -u postgres pg_dump -d nailgun -f nailgun.dump
gzip nailgun.dump
chmod 700 /var/tmp/fuelbackuppre && chmod 600 /var/tmp/fuelbackuppre/*
Results/Descriptions: Nailgun dump file is created in the backup folder.
Please make a copy of /var/tmp/fuelbackuppre to somewhere else (like opssimple, maas, etc.) just in
case if Fuel will be corrupted during the procedure and backups will be unavailable.
Step 17.b.3: Backup Fuel settings
Time Line: 5 minutes
Action: On the fuel VM, as fuel admin execute -

ssh <fuelvm>
mkdir -p /root/rotatemechid/before
cd /root/rotatemechid/before
identify fuel env id and set variable

fuel env
export envid=<envid>
fuel settings download --env $envid
fuel network download --env $envid
chmod 600 network_${envid}.yaml settings_${envid}.yaml
Results/Descriptions:
login to OpsSimple node and run:
ssh <opscvm>
fuel env
/var/www/aic_opssimple/backend/manage.py export -f /home/m96722/rotatemechid/beforemechidrotate.yaml <env_name>
yaml
18. ATS Bulletin
Procedure -1 ATS Bulletin Check Action Results/Descriptions Time Line Access the ATS Bulletin site at
the following URL: ATS Bulletin
Click on the List/Search button Click on the -Search Menu> Determine a Key word associated with
the activity that you are performing and type it in to the Keyword: (Hint. field Click on the Submit
Query button Review each search result bulletin by clicking on the bulletin number to determine if
any of the bulletins have warnings/actions associated with the activity that you are implementing List
all associated bulletins in the following table. If no associated bulletin is found place N/A in the
following table.
Layer 3 - AIC
General AIC
19. Emergency Contacts
Contact (Team or Person) Contact Information AIC Fuel Team Vladimir Maliaev vm321d AIC Fuel
Team Alexey Odinokov ao241c CPVT Nageswara Rao Guddeti ng4707 AIC T2/T3 Via Q
chat qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team+Chat
20. Preliminary Implementation

Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
Pre-check tasks are completed the night of the cutover at least one hour prior to cutover activities.
1. Login to Opssimple # For Opssimple IP , Refer this page for AIC 3.x site list. Switch
user to m96722
2. sudo su - m96722 or sudo -iu m96722
3.
4. ##References##
AIC 3.x site list https://wiki.web.att.com/pages/viewpage.action?

spaceKey=CCPdev&title=AIC+OpsSimple+3.x+VMs
5. Ping all the nodes in the inventory to make sure all nodes are reachable from
opssimple node
Verify DE have ssh access to required nodes from Jump host node prior to the
changes (not using ansible).
From jump host run the following:

ansible lcp_cluster -i inventory/ -m ping
If the above steps doesn't work CPVT or DE to create AOTS ticket and post it in the group chat
below:
qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team +Chat
Create AOTS ticket specifying IP,FQDN of compute & VM info skip deployment on those compute
node(s)(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")
3. Disable Nagios Alerts:
On each nagios server:
Login using UAM account and execure toor.
Perform below commands -

/usr/local/bin/nagios-sched-downtime-zone.sh <zone> <days>
for example
/usr/local/bin/nagios-sched-downtime-zone.sh pdk5 0.5
Above example will put every node and all service checks for each node in scheduled downtime for
12 hours(0.5 days)
After completion of post checks, enable alarms.

For enabling alarming, the process is: "nagios-del-downtime-zone.sh [zone NAME]" or "nagios-del-
downtime-host.sh [host NAME]"
4. If SOLIDFIRE' or any other Storage backend's user/password is going to be rotated,
coordination between Storage team and DE must take place. Otherwise cinder will
fail! Please check with Storage Team, get new user/password (or just a password if
the user keeps the same) and make sure they switched user/password on Storage
Backend for the current zone/site. Also need to emphasize that in case of Solidfire,
password for 'sfopenstack' user should not be changed, because it will break other
sites that use the same user.
5. Remove all Users authorized by "LDAP-LDAP-server" from Foreman UI (relevant for
RCs less than RC25).
Use "admin/password" to login to Foreman UI, NOT your att_uid! You can obtain admin password
from /var/lib/puppet/foreman_cache_data/admin_password on LCM nodes. If someone already reset
the password before it will be different from what you can find in the cache file. You need to try
checking this file on every LCM node. If it still does not work, on any LCM node run foreman-rake
permissions:reset . Use fresh generated password to login to Foreman UI as admin user.
5.1. Check whether "Automatically create accounts in Foreman" is enabled via API request.
export FOREMAN_USER=admin
export FOREMAN_PASSWORD=foreman_password
export FOREMAN_URL=https://foreman_url
curl -s -k -H "Content-Type:application/json" -u ${FOREMAN_USER}:${FOREMAN_PASSWORD} \
${FOREMAN_URL}/api/v2/auth_source_ldaps | \
python -c 'import json,sys;print json.load(sys.stdin)["results"][0]["onthefly_register"]'
If the response is True , go to section 5.2, otherwise, execute the following commands to enable it:
FOREMAN_LDAP_ID=$(curl -s -k -X GET -H "Content-Type:application/json" \
-u ${FOREMAN_USER}:${FOREMAN_PASSWORD} ${FOREMAN_URL}/api/v2/auth_source_ldaps | \
python -c 'import json,sys;print json.load(sys.stdin)["results"][0]["id"]')
echo $FOREMAN_LDAP_ID # Make sure you have an integer ID
curl -s -k -X PUT -H "Content-Type:application/json" -u ${FOREMAN_USER}:${FOREMAN_PASSWORD} \
-d '{"onthefly_register":true}' ${FOREMAN_URL}/api/v2/auth_source_ldaps/${FOREMAN_LDAP_ID} | \
python -m json.tool | grep onthefly_register
Check that onthefly_register is set to true

5.2. Export environment variables if you didn't do it in section 5.1
export FOREMAN_USER=admin
export FOREMAN_PASSWORD=foreman_password
export FOREMAN_URL=https://foreman_url
Save the code below as remove_users_foreman.py and execute the script python
remove_users_foreman.py to remove all users authorized by "LDAP-LDAP-server" in Foreman.
import os
import requests
import json
from requests.auth import HTTPBasicAuth
user = os.getenv('FOREMAN_USER')
password = os.environ.get('FOREMAN_PASSWORD')
url = os.environ.get('FOREMAN_URL')
URL = "{0}/api/v2/users/".format(url)
response = requests.get(URL, verify=False,
auth=HTTPBasicAuth(user, password)).content
for i in json.loads(response)['results']:
if i['auth_source_name'] == 'LDAP-server':
del_url = URL + str(i["id"])
print "Removing user {0} {1} with ID {2}".format(
i["firstname"], i["lastname"], i["id"])
requests.delete(del_url, verify=False,
auth=HTTPBasicAuth(user, password))
5.3. Check in Foreman UI that no more users authorized by LDAP exist (from now and until the end
of the whole procedure below login in Foreman as admin ).
5.4. Add roles for new admin tenant mechid to openstack to support LDAP
Steps need to be executed on fuel node, a DE needs to source rc file for executing Openstack
commands. The DE can download their own openrc from Horizon, or it can be created by becoming
a root, making a copy of the openrc_v2 file, chown it to the DE-user, editing to substitute the mechid
to their ATTUID, and replacing the OS_PASSWORD line with:
# With Keystone you pass the keystone password.
echo "Please enter your OpenStack Password: "
read -sr OS_PASSWORD_INPUT
export OS_PASSWORD=$OS_PASSWORD_INPUT
Run:
grep "ldap.Identity" /etc/keystone/keystone.conf
Execute the following if ldap.Identity is present in above output:
source <your_openrc_file>
openstack role add --user <newAdminTenantMechid> --project admin admin
6. Check contrail vRouter connections.
6.1. Get amount of compute nodes on the env and remember it for further steps:
ssh <fuelvm>
compute_n=`fuel node| grep compute | wc -l` ; echo $compute_n
6.2. Perform Contrail pre checks (20.2 vRouter connection count checks) before proceeding
- https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
For every CCNT node, the number of connections must be equal to output of 6.1. (see above). If not,
that means number of vRouters is not_equal to number of computes . In that case you should stop the MOP
and go to item 6.4
6.3. Check that every vRouter has at least two Established connections.
Get a list of contrail nodes:

ssh <fuelvm>
fuel node| grep contrail-control| awk -F"|" '/ready/ {print $5}'
Login to ccnt01 contrail-controller node and check:
Visual check:
ssh m96722@<CCNT01_node_IP>
export mesh_ip=ìp a | awk '/inet.*br-mesh/ {print $2}'| egrep -o "([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}"`; echo $mesh_ip
for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do python
/var/tmp/ccnt_scripts/ist.py --host $each vr xmpp;done
If you see any issues of any type (i.e. Failed to reach destination , Error or any others), stop the MOP!
and go to 6.4.
Successfull output ex:
Error output:
If everything looks good, run next command on the same CCNT node to make sure, that there are at
least 2 Established connections and no Alerts:
Automatic check:
for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do echo -en "\nHost:
$each ";python /var/tmp/ccnt_scripts/ist.py --host $each vr xmpp | grep Established |wc -l | while read l ;do echo -n $l; if
[ $l -lt '2' ] ; then echo " ALERT!";fi;done;done; echo ''
Successfull output ex:
Error output:
If you see any ALERT, run command for visual inspection of bad node(s) and check the issue. If you
see the issue stop the MOP and go to 6.4
6.4. Alert! If the above steps doesn't work, stop the MOP and create AOTS ticket and report the
issue(s). (See item 2 in Preliminary section for details of the ticket creation)
21. Implementation
Step 21.1: Suspend Puppet Agent
Time Line: 10 minutes
Tenant Impact: No
Action: On Fuel VM, as fuel admin user execute following

ssh <fuelvm>
Retrieve Fuel env id and set

fuel env
export envid=<envid>
upload fuel stop puppet graph

fuel2 graph upload --env $envid --type stop_puppet --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/stop_puppet.yaml
Execute stop puppet graph

fuel2 graph execute --env $envid --type stop_puppet
check graph execution is complete

fuel2 task list
All executed in previous steps tasks must be ready/skipped, if not -

then check specific task with <task_id> from above command's output:
fuel2 task history show <task_id> then login to the node with error task state and check puppet log on
it: /var/log/puppet.log
Results/Descriptions: Check puppet agent is stopped on all nodes.
On Fuel VM as fuel admin, execute following to check status of puppet agents on all nodes.
for i in `sudo fuel nodes |grep ready | awk '{print $9}'`; do echo -n "$i: "; ssh -q -o StrictHostKeyChecking=no $i 'sudo
service puppet status'; done

Change meta-data in OpsSimple, propagate to Fuel
It gets current metadata from OpsSimple to yaml file, then changes mechid related items and
uploads new yaml back to OpsSimple and Fuel.
$ ssh <opscvm>
$ /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/rotatemechid/current.yaml <env_name> yaml
Here you can choose Option1 for automatic update OpsSimple yaml with new mechids or Option2 if
something goes wrong with Option1.
Option1: By ./update_mechids.py script:
$ python3 /home/m96722/rotatemechid/update_mechids.py --mechid-file
/home/m96722/rotatemechid/newmechids.yaml --site-file /home/m96722/rotatemechid/current.yaml >
/home/m96722/rotatemechid/latest.yaml
Update att_user_role_mapping item with appropriate value(s) and check format (format is critical) in
next file:
$ vi /home/m96722/rotatemechid/latest.yaml
Option2: Manually:
$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml
$ vi /home/m96722/rotatemechid/latest.yaml
* Replace òld mechids/passwords` by `new mechids/passwords`
* Update àtt_user_role_mapping` with appropriate value(s) and check format (format is **critical**)
* First key section is the ldap section in the aic-fuel-plugin
* Non-openstack sections such as RO, LMA... must be updated according to new MechIDs
* For Solidfire credentials change, make changes to SOLIDFIRE section under cinder block
sf_san_login: <user>
sf_san_password: <password>
 DE to get the changes reviewed by peer and make sure that new MechID(s) and
passwords passed the test (11.a.2 "Update automation meta-data" step) before
proceeding
Here you can make a visual comparison between current.yaml and latest.yaml to make sure that only
MechIDs/Passwords have been changed:
ssh <opscvm>
$ cd /home/m96722/rotatemechid
$ diff current.yaml latest.yaml
This step is actually gets OpsSimple to push the list of repos and news packages to Fuel.
ssh <opscvm>
cd /var/www/aic_opssimple/backend
./manage.py import /home/m96722/rotatemechid/latest.yaml yaml
./manage.py generate_scripts <env_name>

cd ~/aic/files/env_xxx/fuel-client
python ./script.py update_plugin
(Starting RC24) Rotate db_passwords for openstack services:

python ./script.py rotate_service_dbpassword
Example of successfull run:

2019-11-21 23:46:09,241 - INFO script.py:926 -- Changing db_password for nova
2019-11-21 23:46:09,242 - INFO script.py:929 -- gxrF9Qyvh7Qr1DIX1htXXXXX -> DDQQzxndkNFYWPInYCRXXXXX
old_password:gxrF9Qyvh7Qr1DIX1htXXXXX->new_password:DDQQzxndkNFYWPInYCRXXXXX
2019-11-21 23:46:09,242 - INFO script.py:926 -- Changing db_password for heat
2019-11-21 23:46:09,243 - INFO script.py:929 -- aB8NCEYhpcDbWRvSjc7XXXXX ->
q0pTKtZTamuGP8DPDweXXXXX
old_password:aB8NCEYhpcDbWRvSjc7XXXXX->new_password:q0pTKtZTamuGP8DPDXXXXX
Run Audit Fuel
From Fuel node run:

ssh <fuelvm>
cd /root/rotatemechid/after
fuel env
envid=<fuel env id>
fuel --env $envid settings --download
fuel --env $envid network --download
chmod 600 network_${envid}.yaml settings_${envid}.yaml
The DE must check that the only changes are mechid and passwords. The DE MUST get the diff
reviewed by a peer. To compare line by line:
cd /root/rotatemechid/after
diff -r networks_<envid>.yaml ../before/networks_<envid>.yaml
diff -r settings_<envid>.yaml ../before/settings_<envid>.yaml
To compare structure difference:

python /tmp/diff_struct.py networks_<envid>.yaml ../before/networks_<envid>.yaml
python /tmp/diff_struct.py settings_<envid>.yaml ../before/settings_<envid>.yaml
If something is wrong, the DE must go back, adjust the latest.yaml opssimple file and rerun the
procedure. ANY MISTAKE HERE WILL RESULT IN TENANT IMPACTS DURING PUPPET AGENT
RESTART. So the DE needs to take his time, and double check.
Propagate meta-data changes through the system.
Propagate the meta-data change through the LCP, switch repositories and the code environment on
every nodes
 During the check of the LCM infrastructure, we ensure that the latest Puppet Code
had been downloaded.
 Fuel manifests switch the repositories on OpenStack nodes
 Fuel updates hiera and configdb afterwards
fuel env
envid=<your env ID>
fuel2 graph upload --env $envid --type update_config_db --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/update_config_db.yaml
fuel2 graph execute --env $envid --type update_config_db

# Rememeber id of the last executed task
$ fuel2 task list
# All tasks should be ready
$ fuel2 task history show <task_id>
At this point the Puppet Agents operations have been disabled, but no changes have been applied
to OpenStack yet.
Apply the mechid change through LCM

Tenant Impact: YES
Up to this point, there should haven't been any tenant impact since the Puppet agents have not run
yet. The DE must be aware, that OpenStack changes will be applied by the Puppet agents when they
reenables them.
Replace MechId for nova-compute service to avoid further vfd restart
Note: this step is mandatory only for versions older than 3.0.3 RC14, otherwise it can be skipped.
 Upload and execute rotate_mechid_compute graph: rotate_mechid_compute.yaml.

Open the file in the browser and copy-paste into the /root/mechidrotate directory.
ssh <fuelvm>
fuel env
envid=<fuel env id>
cd /root/mechidrotate/
Copy-paste the rotate_mechid_compute.yaml file from the location listed above

fuel2 graph upload --env $envid --type rotate_mechid --file rotate_mechid_compute.yaml
nodes_ids=`fuel node | grep -i compute | awk '{print $1}' | tr "\n" " "`
fuel2 graph execute --env $envid --type rotate_mechid --node $nodes_ids
fuel2 graph delete --env $envid --type rotate_mechid

Install start_puppet_now graph
 Upload start_puppet_now graph: start_puppet_now.yaml. Open the file in the browser

and copy-paste into the /root/mechidrotate directory.
ssh <fuelvm>
fuel env
envid=<fuel env id>
cd /root/mechidrotate/
Copy-past the start_puppet_now.yaml file from the location listed above

sudo fuel2 graph upload --env $envid --type start_puppet_now --file /root/mechidrotate/start_puppet_now.yaml
Apply changes on primary aic-identity node
Execute Puppet on primary aic-identity node and wait for the report in Foreman. Please use the
following way for all steps from 9 to 14. This will create the roles in keystone if necessary
# Pick primary identity node using the procedure below then run the start_puppet_now graph
$ sudo fuel env
$ node_id=$(for i in `sudo fuel node | awk -F"|" ' /identity/ {print $5}'`;do p=$(ssh -q -o StrictHostKeyChecking=no $i 'sudo
hiera roles') ; echo -n $p" - " ;sudo fuel node | grep $i|awk -F"|" ' {print $1}';done | grep primary | rev|cut -d'-' -f1|rev)
$ echo $node_id
$ envid=$(sudo fuel env | awk '/operational/ {print $1}')
$ echo $envid
$ sudo fuel2 graph execute --env $envid --type start_puppet_now --node $node_id
wait for reports in Foreman.

Apply changes on other aic-identity nodes
Execute Puppet on second and third aic-identity nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i identity | awk '{print $1}' | tr "\n" " "`
$ fuel2 graph execute --env $envid --type start_puppet_now --node $node_ids
Apply changes on aic-controller nodes
Execute Puppet on all aic-controller nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i aic-controller | awk '{print $1}' | tr "\n" " "`
Apply changes on Contrail controller nodes
Timeline: takes 30-45 minutes.
This step needs to be done manually in one node at a time followed by service checks and before
performing next contrail node 02 and so on
Be sure that uptimes of Contrail processes are in sync, before to procces with CCNT02. Incase
services are not started at same time, DE need to raise AOTS to T2.
Create AOTS ticket specifying IP, FQDN of Contrail server(s) and output of Contrail service status
(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")
Perform Contrail checks (section 20) before you proceed

- https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
Execute Puppet on contrail-control nodes one-by-one with checking reports in Foreman for every next
node.
$ fuel node | grep -i contrail-control
Start with CCNT01 node!
From fuel node run:

$ fuel2 graph execute --env $envid --type start_puppet_now --node <node_ccnt1>
wait for a report in Foreman

Verification: Usually it takes 5-7 minutest to align with other services, but sometimes it maght take up to 25 minutes for
vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
if everything is successfull above, switch to ccnt02 node and run:
From fuel node run:

wait for a report in Foreman
Verification: Usually it takes 5-7 minutest to align with other services, but sometimes it maght take up to 25 minutes for
vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -
if everything is successfull above, switch to ccnt03 node and run:
From fuel node run:

wait for a report in Foreman.
Verification: Perform Contrail checks (section 21 - step1 thru step8) before you proceeding -
Apply changes on core LCP and swift nodes
Execute Puppet on the rest of LCP nodes.
Loop through LCP roles (DO NOT include 'aic-compute' role in the list):
$ nodes_ids=''
$ roleslist=$(fuel node | tail -n+3| awk -F"|" '{print $7}' |sort|uniq|egrep -v "^[[:space:]]*$|compute|identity|aic-controller|
contrail-control" | while read i; do echo -n $i\| ;done); role=${roleslist%?}
$ nodes_ids=`fuel node | egrep -i "$role" | awk '{print $1}' | tr "\n" " "`
$ fuel2 graph execute --env $envid --type start_puppet_now --node $nodes_ids

Apply changes on compute nodes
Execute Puppet on all compute and wait for reports in Foreman.
Due to performance implication it's not possible to run Puppet on all compute right away. Keeping
that in mind it's better to run Puppet either rack-by-rack or in a batch of 10, 20, 30 (it depends on
number CPU on lcm nodes). The vital part of running Puppet is checking reports from Puppet in the
Foreman, restart Puppet on a node in case of failure and decreasing bunch size to decrease number
of failed nodes.
Loop trough the racks within the aggregation zone:
Note: if the procedure below won't work due to site naming conventions (i.e. no "rXXc" in the name),
then obtain all of the computes for an availability zone and execute the start_puppet_now graph in
batches. Then do the same for the other availability zone(s).
$ rack_id='r10c'
$ node_ids=`fuel node | grep -i compute| grep -i $rack_id | awk '{print $1}' | tr "\n" " "`
Update mechIDs/passwords in Glance Database
Here you need two pairs of credentials: <old_glance_mechid>,

<old_glance_password> and <new_glance_mechid>, <new_glance_password> . Normally, in case of single
MechId, these are just old MechId/password which is being rotated and new MechId/password.
ssh <fuelvm>
fuel node | grep dbng
#Logout of root, and:
ssh <any dbng node>
export OLD_MECHID_USER=<old_glance_mechid>
export OLD_MECHID_PASSWORD=<old_glance_password>
export NEW_MECHID_USER=<new_glance_mechid>
export NEW_MECHID_PASSWORD=<new_glance_password>
mysql glance -e "update image_locations set value=replace(value,'${OLD_MECHID_USER}:$
{OLD_MECHID_PASSWORD}', \
'${NEW_MECHID_USER}:${NEW_MECHID_PASSWORD}') where status = 'active' and value like '%$
{OLD_MECHID_USER}:${OLD_MECHID_PASSWORD}%';"
Check that there no more old MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value like '%${OLD_MECHID_USER}%';"
Output should be empty
Make sure that there are no very old and forsaken MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value NOT like '%$
{NEW_MECHID_USER}%';"
NOTE: If any older mechids (older than the previous mechid itself) are found that are associated with
an image then open AOTS ticket for the Operation to correct the issues.
Expire outdated mechids in mysqlDB
Login to opssimple node and run the playbook with specified tags to see mechIDs in mysqlDB to be
expired, then expire them and show expired mechids.
As a result you need to get a list of expired user(s). Save the list to remove users from mysqlDB
completely (see steps below).
$ ssh <opscvm>
$ cd aic/
$ ansible-playbook /home/m96722/aic/playbooks/expire_old_users.yml -i inventory/ -e
'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags display
'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags expire

'opssimple_file=/home/m96722/rotatemechid/latest.yaml' --tags show_expired
Example of show_expired tag output:

...
TASK: [debug msg="User(s) {{ mysql_expired.stdout_lines }} are expired. You can delete them further"] ***
ok: [zmtn11dbng01.mtn11.cci.att.com] => {
"msg": "User(s) ['m11111@localhost', 'm11111@localhost1', 'm11111@localhost2', 'm11112@localhost',
'm11113@localhost', 'm11114@localhost', 'm11114@localhost1', 'm11114@localhost2', 'm11114@localhost3',
'm11114@localhost4'] are expired. You can delete them further"
...
Expire the password for previous MechID
Previous step should take care of all outdated mechIDs to expire them including the previous
mechID. In case the previous step run unsuccessfully, and also, to make sure that previous mechID is
expired please execute the following:
$ ssh <fuelvm>
$ sudo fuel node | grep dbng
$ ssh <any dbng node>

$ execute toor
$ mysql
> select user,host from mysql.user where user = "<old_mechid>";
> alter user "<old_mechid>"@"<host>" password expire; #<--- Run this command for each pair (old_mechid,host) from
the output of the previous command
Delete expired users from mysqlDB
Please use AIC-MOP-197 Delete absolute DB users and Drop Databases to delete expired users that
you got in Expire outdated mechids in mysqlDB step above.
Restart cmha services
Login to fuel using your

$ ssh <fuelvm> -A
$ for i in `sudo fuel node | awk -F\| '/aic-controller/ {print $5}'`; do ssh $i "sudo 'crm resource restart cmha' ; sudo 'service
cmha_restapi restart'" ;done
CMC node update
This step can be processed in parallel with the Test plan, because it is executed in DCP's CMC server.
AIC OpsSimple 3.x CD & LCM: https://wiki.web.att.com/pages/viewpage.action?pageId=517384070
AIC DCP/LCP Site Matrix: https://wiki.web.att.com/pages/viewpage.action?pageId=493569807
If the DEs are not aware of the CMC update, then they should check with peers who usually
perform this activity.
You need to change credentials in CMC node (DCP) for site inventory (as in example below):
1. orm.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large
2. opsc.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large
File Name: orm.site_type.environment
Example: orm.mck1b.medium
Install aiclcm package (Version : aic-opssimple-plugins-aiclcm.39142.deb)
Execute MOP-355 Update aiclcm packages https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP

%20Document%20Library/AIC-MOP-355-Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=4Z0SyR
Update new mechid settings for Fuel if LDAP enabled in keystone.conf on Fuel node
Check if LDAP enabled in keystone.conf on Fuel node Login on FuelVM:

ssh <attuid>@<fuelvm>
Execute:
grep "ldap.Identity" /etc/keystone/keystone.conf
Example of output: It means LDAP is enabled

# grep "ldap.Identity" /etc/keystone/keystone.conf
driver= keystone.identity.backends.ldap.Identity
In case of enabled LDAP proceed and execute Steps: 20, 21.b and 22 of AIC-MOP-379 (MOP to
enable LDAP as a backend for keystone on Fuel node).
(https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-379%20-
%20MOP%20to%20enable%20LDAP%20as%20a%20backend%20in%20keystone%20on%20Fuel
%20node.docx?d=w6c76d99a236343e78299a39c9d8aceb8&csf=1&web=1&e=v0a1zd)
Make sure that "aic-opsfix-cmc-0175" has verson:86666 or above is installed during Step 20 of AIC-
MOP-379
If a current site's RC > RC22.xx - during execution of MOP-379, skip opsfix-0175 execution
Non-Openstack components update
On some sites Trove has been removed. If you get error after following ansible-playbook execution,
please delete lines 132-160 from
/home/m96722/aic/playbooks/aic_non_openstack_mechid_update.yml on OpsSimple node and run
again. (fixed in 3.0.3 RC15)
$ ssh <opscvm>
$ cd aic/
$ ansible-playbook -i inventory -e 'mechid_file=/home/m96722/rotatemechid/newmechids.yaml'
playbooks/aic_non_openstack_mechid_update.yml
$ ansible-playbook -i inventory -e 'mechid_file=/home/m96722/rotatemechid/newmechids.yaml'
playbooks/aic_nagios_mechid_update.yml
Proceed with approved AIC-MOP-245 to swap MechID for RO related components (Implementation
takes about 15 minutes).
Nagios credentials update
On OpsSimple node execute below steps:

Step 1: Take backup of api_monitoring.conf
sudo cp -p /etc/nagios/nrpe.d/api_monitoring.conf /etc/nagios/nrpe.d/api_monitoring.conf_backup
Step 2: Please follow approved AIC-MOP-215 to get Nagios credentials updated
Step 3: verify nagios MechId updated

grep username /etc/nagios/nrpe.d/api_monitoring.conf
Note - if MechId is not updated skip steps 4-7 else
Step 4: Add the user in keystone with admin role and ccp-monitoring tenant if not already added
From opssimple node:

cd /home/m96722/aic
ansible-playbook -i inventory/ playbooks/openrc_automation/keystone_user_commands.yml --tag user_role --extra-vars
"mechid=<mechid> tenant=ccp-monitoring"
ansible-playbook -i inventory/ playbooks/openrc_automation/keystone_user_commands.yml --tag user_add --extra-vars
"mechid=<mechid> role=admin tenant=ccp-monitoring"
Step 5: Remove any occurences of api-monitoring.conf except opssimple node

ansible all -i inventory/ -m shell -sa "ls -l /etc/nagios/nrpe.d/api_monitoring.conf"
Take note of the successful runs of the above command, except opssimple.
Execute the below ansible command to remove the file api_monitoring.conf

ansible all:\!jump_host -i inventory/ -m shell -sa "rm -f /etc/nagios/nrpe.d/api_monitoring.conf"
Step 6: Edit api_monitoring.conf with mechid and password

sudo vim /etc/nagios/nrpe.d/api_monitoring.conf
Step 7: Execute next playbook on OpsSimple node to create MechId user in DB for Nagios:
ansible-playbook -i inventory/ playbooks/deploy_nagios_agent.yml --limit dbng_host
Step 8: Restart nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server restart
Step 9: Verify nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server status
Service should be in started status
22. Test Plan
Wait until puppet finishes its work and check for new Foreman reports (green status). About 30 mins.
Foreman reports from vLCP nodes should have entries with username and password changes and
some services should be restarted. Please login to Foreman UI and check reports for all nodes.
Login to the Opssimple VM as att uid ( UAM) Refer to Jump host for sites as per below wiki
link https://wiki.web.att.com/display/CCPdev/AIC+OpsSimple+3.x+VMs And https://
wiki.web.att.com/display/CCPdev/Environments
sudo su - m96722 or sudo -iu m96722

Run next playbook to find expired mechids in conf files (see references
in /home/m96722/aic/roles/aic_opsfix_0186/files/conf_list.yaml ). Use latest.yaml with new mechids
cd /home/m96722/aic
ls /home/m96722/rotatemechid/latest.yaml
ansible-playbook -i inventory/ -e 'mechid_templ=/home/m96722/rotatemechid/latest.yaml' playbooks/validate_mechids.yml
Please check that there are no ansible errors related to access thrown during execution. It is
supposed to be fixed in 20. Preliminary Implementation, item 2.
Example of playbook output (you can find warnings on suspicious mechids):

...
TASK: [debug ] ****************************************************************
ok: [zmtn12mosc01.mtn12.cci.att.com] => {
"msg": "Suspicious mechids: m19437 m19438\n !!! Please check verbose /home/m96722/output for details."
...
For detailed playbook output see: cat /home/m96722/output . You may want to find suspicious mechids,
conf files and nodes in the output file. If any, please check Foreman reports for those affected nodes
and ensure that puppet agent is running. If issue is still in place even after successful report check
latest.yaml and ensure that OpsSimple pushed all the changes to Fuel.
Please, run next playbook to find services that were not been restarted after Conf files had been
changed:
ansible-playbook -i inventory/ playbooks/validate_services.yml
Please check that there are no ansible errors thrown during execution
For results see: cat /home/m96722/output_service
Example of playbook output(you can find OK and FAILED examples in the output_service file):
m96722@zmtn12opsc01:~/aic$ cat /home/m96722/output_service
...
zmtn12fuel01.mtn12.cci.att.com /etc/keystone/keystone.conf
zmtn12rosv02.mtn12.cci.att.com /opt/installer/ro.conf
zmtn12mosc02.mtn12.cci.att.com /etc/neutron/neutron.conf
neutron-server ok
mtn12r01c015.mtn12.cci.att.com /etc/neutron/neutron.conf
mtn12r01c009.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
ceilometer-polling ok
zmtn12mosc01.mtn12.cci.att.com /etc/keystone/keystone.conf
mtn12r01c015.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
mtn12r01c009.mtn12.cci.att.com /etc/nova/nova.conf
nova-compute ok
zmtn12mosc01.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
ceilometer-api ok
ceilometer-alarm-evaluator ok
ceilometer-alarm-notifier ok
ceilometer-agent-notification ok
ceilometer-collector ok
ceilometer-agent-notification ok
mtn12r03s002.mtn12.cci.att.com /etc/swift/account-server.conf
swift-account-server ok
swift-account-auditor ok
swift-account-replicator ok
swift-account-reaper ok
mtn12r01c015.mtn12.cci.att.com /etc/nova/nova.conf
nova-compute ok
zmtn12mosc01.mtn12.cci.att.com /etc/glance/glance-registry.conf
glance-registry ok
zmtn12mosc01.mtn12.cci.att.com /etc/heat/heat.conf
heat-engine ok
heat-api-cfn FAILED!!! Restart is needed
heat-api-cloudwatch FAILED!!! Restart is needed
zmtn12mosc02.mtn12.cci.att.com /etc/heat/heat.conf
heat-api FAILED!!! Restart is needed
heat-engine ok
heat-api-cfn FAILED!!! Restart is needed
heat-api-cloudwatch FAILED!!! Restart is needed
...
If there are services with FAILED!!! Restart is needed message, please go to a node and restart it
manually (use sudo service <service_name> restart or sudo crm resource restart <crm_resource_name> or sudo
service apache restart if service runs under apache). The only exception is heat-
<services> (see: https://jira.web.labs.att.com/browse/DEFECT-6362 ). It is supposed to be restarted at
the end of the playbook. Please run the playbook again.
22.1 Verify sORM inventory credentials on DCP OpsSimple CMC node

corresponding to the LCP has been updated
AIC OpsSimple 3.x CD & LCM: https://wiki.web.att.com/pages/viewpage.action?pageId=517384070
AIC DCP/LCP Site Matrix: https://wiki.web.att.com/pages/viewpage.action?pageId=493569807
To verify the credentials in CMC node (DCP) for site inventory (as in example below), verify that the
date of the file and if older than the day of the change then it fails.
1. orm.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large
2. opsc.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large
File Name: orm.site_type.environment
Example: orm.mck1b.medium
Login to fuel host and verify if the env is operational.
ssh <fuel_vm>
fuel env
Check if the above step shows the environment as operational
Request CPVT to perform complete regression and clean up artifacts
23. Backout Procedure
Following sections detail rollback/back-out procedure that can be used during MechID rotation. This
procedure will eventually be incorporated in a separate backup and restore guide that can be
exercised on AIC sites as part of standard operational activity.
1. specific components in vLCP independent of remaining nodes in vLCP.These need
to be exercised only if the component has to be restored to the last known state /
state at which they were backed up.
2. Entire LCP to a pre-upgrade state.
Compute nodes backup & restore procedure shall be included separately in operations backup and
restore guide.
Revert site to old MechID(s)
Old configuration is still in /home/m96722/rotatemechid/current.yaml You need to copy it

to /home/m96722/rotatemechid/latest.yaml
$ ssh <opscvm>
$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml
and repeat steps above starting from Change meta-data in OpsSimple, propagate to Fuel Option1
or Option2
Restore Fuel State from backup
 To restore just a nailgun database on Fuel node:

cd /var/tmp/fuelbackuppre
sudo -u postgres dropdb nailgun
sudo -u postgres psql nailgun < nailgun.dump
Restore all vLCP nodes for LUN backup
If previous actions did not help to restore, please request storage team to restore vLCP LUN backups
to revert back to the previous version of components in vLCP.
Revert Opsfix 0136
Revert
ansible-playbook -i inventory/ playbooks/aic_opsfix_revert_0136.yml
24. POST Activities (during maintenance window)
To comply with RIM, and audit findings, every MOP must include steps to remove backups and
artifacts created during the deployment of that MOP
The removal process may be at a later date to allow for potential back outs. Any artifacts for this
must be listed in Post Change activities section and include instructions on scheduling the future
removal.
Any backups and artifacts created during the MOP execution which will not be needed for backout,
should be removed in the POST implementation activities section executed before the end of the
change window.
Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
Restore OpsSimple repo Source
 To restore repo sources on OpsSimple node:
sudo cp -p /var/tmp/sources.list /etc/apt/sources.list

sudo chmod 644 /etc/apt/sources.list
sudo apt-get update
Cleanup
Since newmechids.yaml and beforerotation.yaml and latest.yaml do have sensitive data inside, it's
important to delete them when the procedure is over.
$ ssh <opscvm>
$ rm -f /home/m96722/rotatemechid/beforemechidrotate.yaml
$ rm -f /home/m96722/rotatemechid/latest.yaml
$ rm -f /home/m96722/rotatemechid/newmechids.yaml
$ rm -f /home/m96722/aic-opsfix-cmc-0064.*.deb
$ rm -f /var/tmp/sources.list
$ rm -f /home/m96722/output
$ rm -f /home/m96722/output_service
$ ssh <fuelvm>
$ cd /root/rotatemechid/
$ cleanup all the artifacts in ./before and ./after folders
$ cd /var/tmp/fuelbackuppre
$ Clean up all the artifacts here in ./
$ cd /tmp/fuelbackuppre
$ Clean up all the artifacts here in ./
$ Clean up the place where you copied /var/tmp/fuelbackuppre in Step 17.b.2
--
CPVT to perform Sanity Checks
CPVT to cleanup artifacts
25. Post Maintenance Work
NA
26. Appendix and Tables (IF REQUIRED)

26.1. Troubleshooting
Test new mechids fails, TLS Certificates not installed.
If you get error ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) during new mechid test, you need
to check that CA certificates are installed on OpsSimple node.
Example:
Testing <service>
Checking ***ATTTest LDAP***: its-ad-ldap.atttest.com
ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1)
Check creds and parameters...
Status: Fail
There are two different sets of certs, one for each ldap domain: ITSERVICES and TESTITSERVICES.
To install them:
 Download zip https://workspace.web.att.com/sites/WDS/Lists/IP%20Addresses
%20for%20DNS%20%20WINS%20%20LDAP%20%20AD%20Time%20Sources/
Attachments/6/TrustedRootCerts.zip , unzip, scp to OpsSimple node and place
all *.cer files from ITSERVICES or TESTITSERVICES folders to /usr/local/share/ca-
certificates/
 Note: if you are granted enough permissions try directly: `sudo scp
<your_attuid>@199.37.162.36:/staging/ldap_cacerts/<domain>/* /usr/local/share/ca-certificates/
 Use ÌTSERVICES` or `TESTITSERVICES` instead of `<domain>`.
 Rename *.cer to *.crt:

for i in `sudo ls /usr/local/share/ca-certificates/*.cer` ; do sudo mv $i ${i%.*}.crt ;done
 Run: sudo /usr/sbin/update-ca-certificates
Manually changed parameters in config files

Only works for PROD sites on RC22 and above
If you need to change parameters in config files you need to find them in FUEL UI -> Fuel plugin ->
additional_config and change values there BEFORE executing the custom graph update_config_db .
Example:
Notice: /Stage[main]/Attcinder::Controller::Aic_cinder_volume/Attcinder::Controller::Solidfire[SOLIDFIRE]/
Cinder_config[SOLIDFIRE/sf_svip]/value: current_value 32.50.209.243:3260, should be 32.50.209.244:3260 (noop)
After update_config_db is executed, hiera will get updated with new mechIDs and all changes made
in additional_config .
Deployment fails with "Could not evaluate: LDAP source LDAP-server delete error:
(422 Unprocessable Entity)" error
If the deployment fails with the HTTP 422 error:
/Stage[main]/Plugin_lcm::Tasks::Foreman/Foreman_ldap_auth[foreman_ldap_auth_source] (err): Could not evaluate: LDAP
source LDAP-server delete error: (422 Unprocessable Entity):
in /var/log/puppet/puppet.log on LCM nodes, please do the following:

1. Login to Foreman UI as admin user and CHECK the box "Automatically create
accounts in Foreman" if unchecked.

2. In Foreman UI switch to "Users" and delete all accounts that have "LDAP-LDAP-
server" in the column "Authorized by". Do not touch those that have "INTERNAL" in
that column.
3. Wait until puppet applied Day2 catalog once again.
Immutable config files

Sometime we notice that config files are set immutable and puppet is stopped on few nodes or
shows error
 check file attributes: lsattr <file>
 unset immutable: chattr -i <file>
 wait until puppet applies catalog
OpsSimple node shows error (APIError)

 check client configuration: /home/m96722/.config/aiccliopssimple/OpsSimple.yaml
 For some versions pykwalify module is not installed but still used, comment out
lines 8 and 9 in: /var/www/aic_opssimple/backend/api/base/uploaders.py then Restart
opssimple backed: /home/m96722/install/restartserver.sh
 If .APIError: "Error Authenticating. Please use a UAM ATT or Mech ID to login." , that
probably means /etc/shadow file is inaccessible for some authorization tools. You
may set read permission on this file for group sudo chmod g+r /etc/shadow or
set enable_pam: False in ~/.config/aiccliopssimple/OpsSimple.yaml Restart opssimple
backed: /home/m96722/install/restartserver.sh
Stuck Fuel graph in pending state

Sometimes when we launch fuel graph execute it ended up in pending state.
Symptoms:
 fuel graph execute doesn't return the console for 10-15 minutes
 fuel2 task list shows last task in pending state
Solution:
1) task_id=<task_id of pending task>
1) $ fuel task --delete --force --task-id $task_id
2) $ service postgresql restart
3) $ service nailgun restart
Login on nodes to troubleshoot

 Incase to login to any Openstack node to troubleshoot, use below method to login
to nodes
 To login to fuel node:
 Login with att uid (UAM)
 sudo fuel node
 ssh <attuid>@<node ip>

 Incase to login to any Non-Openstack node to troubleshoot, use below method to
login to nodes
 To login to opssimple node:
 Login with att uid (UAM)
ssh <node ip>

IPL1A RC35 Mechid Artifacts

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

IPL1A RC35 Mechid Artifacts

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IPL1A RC35 Mechid Artifacts

Uploaded by

Copyright:

Available Formats

17.

Pre Maintenance Check, Precautions and Preparations

NOTE: All terminal input/output must be logged during the change.

** Important ** : Execute MOP-355 Update aiclcm packages

17.a.1 Add OPSFIX repository

sudo chmod 600 /var/tmp/sources.list

If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.3)

deb http://mirrors-aic.it.att.com/opssimple/ops3.0.3/ trusty main

If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.3)

deb http://mirrors.it.att.com/opssimple/ops3.0.3/ trusty main

If it is lab Env add this repo to /etc/apt/sources.list (For 3.0.2)

If it is Prod Env add this repo to /etc/apt/sources.list (For 3.0.2)

run repositories metadata update command:

17.a.2 Pre-Maintenance Check Tools/System

 Install/Update aic-lcm packages:

$ sudo apt-get install aic-lcm aic-opssimple-plugins-aiclcm

install -y aic-lcm=3.0.3-40023 aic-opssimple-plugins-aiclcm=40023

 Install ldap-utils package:

$ sudo apt-get install -y ldap-utils

$ sudo apt-get install -y python3-openssl

 Update OPSFIX-0064. To do that on OpsSimple node run:

 $ sudo apt-cache search aic-opsfix-cmc-0064

If output contains information about the package, run:

cmc-0064/ | egrep -o "aic-opsfix-cmc-0064.[0-9]*.deb" |sort|tail -1)

$ Copy the contents

Check installed package (it should output a non-empty list of files):

Run next command to propagate contrail check script to contrail-controllers:

$ ansible-playbook -i inventory/ roles/aic_opsfix_0064/playbooks/aic_opsfix_deploy_0064.yml

$ ansible -i inventory/ fuel_host -mcopy -a "src=~/rotatemechid/diff_struct.py dest=/tmp/"

On OpsSimple node run:

fuel nodes | grep -v ready

Attach node report to the ticket

Run this to get fuel VM IP/hostname:

Login to lcm node as attuid:

sudo chmod -R ugo+rx /etc/puppet/environments

Update automation metadata

 Verify that test_mechids.py and update_mechids.py are available on OpsSimple VM

$ sudo chown m96722:mechid newmechids.yaml

$ /var/www/aic_opssimple/backend/manage.py export -f ./opssimple_site.yaml <envname> yaml

$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type -> <test

against both servers for medium site>

 Update/Install OPSFIX-0186. To do that on OpsSimple node run:

 $ sudo apt-get install aic-opsfix-cmc-0186

 $ sudo dpkg -L aic-opsfix-cmc-0186

 On OpsSimple node revisit and populate with appropriate

17.b. Pre-Maintenance Check Manual (Non-Automated Requirements)

Time Line: 72 hours prior to CW.

Action: Submit request to backup of all vLCP LUNs in site.

SRTS Ticket needs to be created 72 hours ahead of CW

Step 17.b.2: Backup Fuel Nailgun database.

Time Line: 5 minutes.

Action: On the fuel VM, as fuel admin execute -

For mediums, recommend a minimum of 10GB free space for backup.

Check that next folders are empty or absent:

if not then clean them.

create backup folder -

Results/Descriptions: Nailgun dump file is created in the backup folder.

Step 17.b.3: Backup Fuel settings

Time Line: 5 minutes

Important : Execute MOP-355 Update aiclcm packages