IPL1A RC35 Upgrade Artifacts

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 111

17.b.

Pre-Maintenance Check Manual (Non-Automated Requirements)

Step 17.b.1: Check Nodes Status

Time Line: 5 minutes.

Action: Execute the commands below on the Fuel node


Login into fuel node ssh {fuel IP} use user as ATT UAM and become root by typing toor
fuel node shows "ready". No nodes in discover or error. Online field should be 1.

Step 17.b.2: Check Fuel Environment

Time Line: 5 minutes.

Action: Execute the commands below on the Fuel node


Login into fuel node ssh {fuel IP} use user as ATT UAM and become root by typing toor
fuel env shows the environment "operational" (dummy_env can be ignored)
Step 17.b.3: Request storage operations to backup of all vLCP LUNs (Large Sites Only)

Time Line: 72 hours prior to CW.

Action: Submit request to backup of all vLCP LUNs in site.

SRTS Ticket needs to be created 72 hours ahead of CW AIC STORAGE OPERATIONS


SUPPORT: http://ushportal.it.att.com/step2.cfm?app=3962&home=ush

Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.

Step 17.b.4: Backup contrail controller and config database.

Time Line: 20 minutes.

Action: On the Contrail controller VM, as fuel admin execute -


ssh <contrail-controller-vm>

create backup folder -


mkdir -p /var/tmp/contrailbackuppre/$(date +"%Y%m%d%H%M")
cd /var/tmp/contrailbackuppre/<foldercreatedabove>

check if there is sufficient space under /var/tmp for backup


df -h

Note -

For Large sites - make sure 60 GB free space on /var partition.

For Medium sites - make sure 25 GB free space on /var partition.

initiate backup -
sudo python /usr/lib/python2.7/dist-packages/cfgm_common/db_json_exim.py --export-to backup.json
chmod -Rf 640 /var/tmp/contrailbackuppre/
Results/Descriptions:

Backup tar files per contrail node is created in the backup folder.

Step 17.b.5: Backup xml of vms on the kvms

Time Line: 30 minutes.

Action: On OpsSimpleVM execute -

~~ ssh @

sudo su - m96722 OR sudo -iu m96722

cd /home/m96722/aic/

Check if there is sufficient disk space in /var/tmp for backup

ansible kvm_host -i inventory -m shell -sa "df -h"

Take the dump of xml files of all the vms and copy it to the jump_host

ansible-playbook -i inventory/ playbooks/infraxml_dump.yaml ~~

Step 17.b.6 Check for Contrail Zookeeper Cassandra Discrepancies

From Fuel vm login into contrail controller node,


ssh <contrail-controller-node-IP>
sudo /usr/localcw/bin/eksh
python /usr/lib/python2.7/dist-packages/vnc_cfg_api_server/db_manage.py --api-conf /etc/contrail/contrail-api.conf
--verbose check

If any of the checkers report failure to contrail Ops. Do NOT proceed with MOP execution until data
inconsistencies are remediated.
Step 17.b.7 Pre-upgrade checks

https://codecloud.web.att.com/projects/ST_CCP/repos/aic-docs/browse/docs/dg/preupgrade.md

Step 17.b.8 Contrail checks

Contrail Checks - https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/


AIC-MOP-195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=rhhr6W

Step 17.b.9: Backup Fuel Nailgun database.

Time Line: 5 minutes.

Action: Execute the commands below on the Fuel node


Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor

Check if there is sufficient disk space in /var/log for backup


df -h

For mediums, recommend a minimum of 10GB free space for backup.

For large sites, recommend a minimum of 20GB free space for backup.

create backup folder -


mkdir /var/log/fuelbackuppre
cd /var/log/fuelbackuppre
sudo -u postgres pg_dump nailgun > nailgun.dump
gzip nailgun.dump
chmod -f 640 /var/log/fuelbackuppre/*
chmod -f 740 /var/log/fuelbackuppre

Results/Descriptions: Nailgun dump file is created in the backup folder.


Step 17.b.10: Backup Fuel settings

Time Line: 5 minutes

Action: Execute the commands below on the Fuel node


Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor
mkdir /var/log/fuelbackuppre
cd /var/log/fuelbackuppre

identify fuel env id and set variable


fuel env
export envid=<envid>
fuel settings download --env $envid
fuel network download --env $envid

Results/Descriptions:

Step 17.b.11: LCM status

Time Line: 15 minutes

Action: Login to Foreman UI


 only LDAP accounts added to the groups(AP-AIC_ForemanViewerGroup or AP-
AIC_ForemanAdminGroup) can be used to access Foreman UI
 verify that the nodes are green, if they are not in sync, review node reports to
assess severity. Based on severity raise a Jira - https://jira.web.labs.att.com/issues
 Attach node report to the ticket

In the hosts tab, check that all the nodes except the non-openstack nodes are pointing at the
previous release_candidate.

 The only nodes which can be pointing at production are the non-openstack nodes.

Results/Description:
 Remove LCM nodes from Foreman:

In Hosts -> All hosts find and select all aic-influxdb,aic-elasticsearch and aic-alerting nodes

Remove them: Select Action -> Delete hosts

Step 17.b.12: Update OpsSimple packages

Time Line: 5 minutes

Action: Execute the commands below on the Opssimple node


Connect to Opssimple node with your personal attuid ssh {opsc IP} and become root using toor
vi /etc/apt/sources.list

verify /etc/apt/sources.list contents is as per SIL/Production or IST sites from links below:

For all SIL and Production Sites - prod.repo

For IST sites dev.repo


sudo apt-get update
sudo apt-get install aic-opssimple-addons-large
sudo apt-get install aic-opssimple-backend
sudo apt-get install aic-opssimple-core-lcp
sudo apt-get install aic-opssimple-plugins-foreman
sudo apt-get install aic-opssimple-plugins-gstools
sudo apt-get install aic-opssimple-plugins-knownstate
sudo apt-get install aic-opssimple-plugins-mosaudit
sudo apt-get install aic-opssimple-plugins-nagios
sudo apt-get install aic-opssimple-plugins-releaseupgrade
sudo apt-get install aic-opssimple-plugins-ro
sudo apt-get install aic-opssimple-plugins-security
sudo apt-get install aic-opssimple-plugins-vipr (Only for Medium sites)
sudo apt-get install aic-opssimple-plugins-astra
sudo apt-get install aic-opssimple-plugins-contrail

Results/Descriptions:

Make sure below plugins updated from the latest release repo, by checking apt-cache policy
command.

Here is plugins list - Opssimple plugins list

Note: aic-lcm packages may be at a higher version they can be updated separately by MOP-355 to
latest versions (aic-opssimple-plugins-aiclcm, aic-lcm)

Results/Descriptions: Verify OpsSimple Packages are updated.


#####Install latest aiclcm package:

Execute MOP-355 Update aiclcm packages

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-355-
Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=QKCsYL

#####Install certain version of packages on apollo node:


Execute AIC-MOP-563-MOP to pin debian packages to certain versions (See the link on the MOP in #16
Materials requirement)

Step 17.b.13: Create the Fuel dummy_env prior to export/import Opssimple_site if it does not
already exist

Time Line: 5 minutes

Action: Execute the commands below on the Fuel node


Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor
fuel env | grep dummy_env

If dummy_env was not present in the output of the above command then execute the
commands below to create the Fuel dummy_env.
fuel release
fuel env create --name dummy_env --release <Release id from above command>

Verify that the dummy_env was created in Fuel


fuel env | grep dummy_env
Results/Descriptions: commands ran sucessfully without errors and dummy_env exists in Fuel

Step 17.b.14: Check Foreman SSL certificates.

Time Line: 10 minutes

NOTE: from RC24 - New certificate for Foreman UI should exist (foreman-aic/foreman-
aic.includesprivatekey.pem) with Common Name: puppet..cci.att.com, and Alternative names:
lcma01..cci.att.com, lcma02..cci.att.com, lcma03..cci.att.com
cd /home/m96722/aic/files/certificates

openssl x509 -in foreman-aic/foreman-aic.includesprivatekey.pem -text -noout

Sample output for MTN11:


.....

Subject: C=US, ST=Michigan, L=Southfield, O=AT&T Services, Inc., CN=puppet.mtn11.cci.att.com

.....

X509v3 Subject Alternative Name:

DNS:zmtn11lcma01.mtn11.cci.att.com, DNS:zmtn11lcma02.mtn11.cci.att.com,

DNS:zmtn11lcma03.mtn11.cci.att.com, DNS:puppet.mtn11.cci.att.com
Results/Descriptions: for MTN11 - Common Name: puppet.mtn11.cci.att.com and Alternative
Name: zmtn11lcma01.mtn11.cci.att.com, DNS:zmtn11lcma02.mtn11.cci.att.com,
DNS:zmtn11lcma03.mtn11.cci.att.com, DNS:puppet.mtn11.cci.att.com **

Step 17.b.15: Verify Astra logging (Only required for Large sites with Astra.)

Time Line: 5 minutes

Action: On Opssimple VM, check the golden configuraion -


set use_syslog to 1 in Large sites which have Astra node and 0 which doesn't have Astra installed

Results/Descriptions: Verify use_syslog is set to 1 for large sites which have astra node and 0 which
doesn't have astra installed.

Step 17.b.16: Merge RC specific content into OpsSimple_site_yaml.

Time Line: 20 minutes

Action: On Opssimple VM, as m96722 user export current site yaml


ssh <opscvm>
cd /home/m96722/install
python3 /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/before_rc.yaml <fuelenvname>
yaml

Check /home/m96722/before_rc.yaml in case any errors occurred. If any errors are found then stop
deployment and escalate to the appropriate release management chat specified in section 19.
./resetdb.sh
python3 /var/www/aic_opssimple/backend/manage.py import /home/m96722/before_rc.yaml yaml
cd /var/www/aic_opssimple/backend/
./manage.py export -f /home/m96722/latest.yaml <siteid> yaml

where siteid - represents site short code, obtained from enviorment details wiki.

Merge site yaml prepared in step 6 with above file and verify following

 To ensure the right list of repos: Replace the Repo section in the
opssimple_site.yaml by the Repo golden configuration Repo section
 To ensure the right list of packages: The list of package for a release candidate
will be provided by the CI/CD team has a yaml file. The *: present in the at&t
plugin must be replaced by *: latest
 Ensure that any new parameter has been added to the att-plugin,lcm-plugin
 Ensure that node_env in the lcm_plugin section is pointing at 3_0_3_RC<35>_<stable|
prod>
 If node_env is pointing at version 3.0.3_RC15_<stable|prod> and older ensure that
r10k_do_init is set to true in the lcm_plugin section.
 If node_env is pointing at version 3.0.3_RC16_<stable|prod> and newer ensure that
r10k_do_init is set to false in the lcm_plugin sections: ~~ r10k_do_init: description:
If r10k should create puppet env after deployment label: Create puppet
environment (R10k) restrictions: - action: hide condition: settings:fuel-plugin-
lcm.metadata.enabled != true type: checkbox value: false weight: 135 ~~
 Ensure for each compute ignore flag is set to false.
 Ensure that all passwords have been decrypted. If you see encrypted passwords,
you can decrypt them by the aiclcm tool (aic-lcm repository):

~~

** Update the Environment section to kernel version: ubuntu4Q21** ~~ Environment:

 domain_name: {{ localaliasnvp }}.cci.att.com


ubuntu_repo: http://ubuntumirror.it.att.com/ubuntu4Q21/ubuntu/ kernel_version:
3.13.0-187.generic enable_root_access: true enable_automation_mechid: false
verify_image_checksum: true
image_checksum_file_url: http://mirrors.it.att.com/images/infra/MD5SUMS automat
ion_uam_role: AIC Automation Prod is_local_storage: false name: {{ opsenvname }}
site_code: {{ site_code }} template: production type: idc_edc ~~

ssh aiclcm opssimplesite decrypt_passwd --mechid --path path/to/your/opssimple_site.yaml ~~

Please, check the output of the commands above. If you see any issues, please resolve them before
you proceed further. You can find an example of the output below (truncated):
cd /var/www/aic_opssimple/backend/
./manage.py import /home/m96722/latest.yaml yaml

Results/Descriptions: Verify command ran sucessfully without errors


Step 17.b.16.1: Generate a random Redis Password and update OpsSimple_site_yaml.

Time Line: 10 minutes

Action: On Opssimple VM, as m96722 user export current site yaml


ssh <opscvm>

Export the ops file which is imported (RC35) earlier using below command.
python3 /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/ops_redis.yaml <fuelenvname>
yaml

Note: Make sure the above file is latest RC35 file.

Generate random password and Capture the output.

date +%s%N | md5sum | sha256sum | base64 | head -c 14 ; echo

Update the redis_password in "aic-fuel-plugin" of the exported home/m96722/ops_redis.yaml file


and import it using the below command.
python3 /var/www/aic_opssimple/backend/manage.py import /home/m96722/ops_redis.yaml yaml

Results/Descriptions: Verify command ran sucessfully without errors

Step 17.b.16.2: Check if db_password for cmha in Fuelpluginconfig block is default

Time Line: 10 minutes

Action: On Opssimple VM, as m96722 user export current site yaml


ssh <opscvm>

Export the ops file which is imported (RC35) earlier using below command.
python3 /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/ops_cmha.yaml <fuelenvname>
yaml

**Only Update the cmha db_password if the password is default **


Generate random password and Capture the output. date +%s%N | md5sum | sha256sum | base64 |
head -c 14 ; echo Update the db_password for cmha in "Fuelpluginconfig block" of the exported
home/m96722/ops_cmha.yaml file and import it using the below command. python3
/var/www/aic_opssimple/backend/manage.py import /home/m96722/ops_cmha.yaml yaml

Results/Descriptions: Verify command ran sucessfully without errors

Step 17.b.16.3: Refresh aic-lcm Configs

Time Line: 20 minutes

Action: On OpsSimple VM, Login using UAM and switch as m96722


ssh <attuid>@<opscvm>
sudo su - m96722 OR sudo -iu m96722

Execute the following:

~~ 1. /var/www/aic_opssimple/backend/manage.py export <ENV_NAME> -f /tmp/.yml yaml ##


Make sure Foreman section in .yaml has proper Credentials. 2. aiclcm setup factory_reset 3. aiclcm
setup check 4. aiclcm setup create --opssimplefile=/tmp/.yml ## aiclcm.setup.yaml will be created 5.
aiclcm setup install --setupfile aiclcm.setup.yaml ## Select y, when prompted to confirm for config
replace 6. opssimple-backend restart 7. sudo service nginx restart 8. aiclcm opscops export_site --env
<ENV_NAME> --path /tmp/_exported_useing_aiclcm.yml ~~
Step 17.b.17: Generate Scripts

Time Line: 5 minutes

Action: On Opssimple VM, as m96722 execute the commands -


ssh <opscvm>
cd /var/www/aic_opssimple/backend
./manage.py generate_scripts <siteid>

Results/Descriptions: Verify there are no differences listed in allchecks.txt file.

Step 17.b.18: Restart the haproxy service on fuel

Time Line: 2 mins

Action: On opssimple VM, as m96722 execute following,


cd /home/m96722/aic ansible fuel_host -i inventory -m shell -sa "service haproxy restart"

Step 17.b.19: Update Repo Paths

Time Line: 2 mins

Action: On opssimple VM, as m96722 execute following,


./setup.sh configure_repo kvm_host

./setup.sh configure_repo jump_host


./setup.sh configure_repo nagios_host
./setup.sh configure_repo astra_host
./setup.sh configure_repo maas_host

Results/Descriptions: Steps completed without errors

Step 17.b.20: Update Apollo Package

Time Line: 10 hours prior to CW.

Action: On seed node, as ubuntu user execute following,


toor
eksh -l -o emacs'
su - ubuntu
sudo apt-get update
sudo apt-get install --upgrade aic-opssimple-apollo
sudo apt-get install python-prettytable
cd apollo/
sudo apollo maas configure_bind

Results/Descriptions: Steps completed without errors


Step 17.b.21: Update the foreman.ini file (for OAuth authentication)

Time Line: 2 minutes

Action: On opssimple VM, as m96722 execute following to backup, delete and regenerate the
foreman.ini file:
sudo apt-get install python-requests-oauthlib
cp ~/aic/foreman_api/foreman.ini ~/aic/foreman_api/foreman.ini.bak.$(date +"%Y-%m-%dT%H%M")
rm ~/aic/foreman_api/foreman.ini
ansible-playbook -i inventory/ playbooks/deploy_puppet_agent_hosts.yml --tags foremanini

Results/Descriptions: Steps completed without errors


18. ATS Bulletin

1. Access the ATS Bulletin site at the following URL: ATS Bulletin
2. Click on the List/Search button
3. Click on the "Search" Menu
4. Determine a Key word associated with the activity that you are performing and type
it in to the Keyword: (Hint) field
5. Click on the Submit Query button
6. Review each search result bulletin by clicking on the bulletin number to determine
if any of the bulletins have warnings/actions associated with the activity that you
are implementing ATS Bulletin
7. Associated bulletins listed in the following table:

Bulletins Table

Layer 3 - AIC

General AIC

19. Emergency Contacts

Contact (Team or Person) Contact Information

AIC Deployment Upgrades - Shashank Gupta sg944h 972-415-3973 AIC Platform Engineering - Hari
Om Singh hs571j 469-731-6556 Release Management Chat -
Large qto://meeting/q_rooms_cb12191468270764278/AIC+RC26.2%2C+RC29%2C+RC31+Large+De
ployments Release Management Chat -
Medium qto://meeting/q_rooms_cb12191534195381761/AIC+3.0.3+RC18.1+Medium+Deployments
+%28see+Meeting+Attibutes+for+current+zones+%26+CWs%29

20. Preliminary Implementation

Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
NOTE: All terminal input/output must be logged during the change.

Step 20.1.a: Schedule Downtime to Disable Nagios Alerts

Time Line: 5 minutes

For work covering the whole zone: (Please run below commands on Both Nagios servers
ngos01,ngos02 for Large Environments)

Action: Connect to Opssimple node using your personal attuid, switch to m96722, and then connect
to one of the Nagios hosts and become root
ssh {opsc IP}
sudo -iu m96722
ssh {nagios IP}
toor
/usr/local/bin/nagios-DE-sched-downtime-zone.sh [zone NAME] [HOURS] [Comment]

Example: Say I have an approved CR for PDK5 starting shortly expecting it to take 8 hrs: log onto
zpdk5ngos01.pdk5.cci.att.com
sudo /usr/localcw/bin/eksh v
/usr/local/bin/nagios-DE-sched-downtime-zone.sh pdk5 8 CR 4342837 // RM Notification would have CR
details

and on nagios GUI (downtime link, lower left under ‘System’ section)

Once the Schedule downtime is run on both Nagios servers, verify all servers in the zone have
expected downtime using the script zone-downtime-validation.sh and also login to Nagios UI and
validate downtime as per below screenshot.

Example of script zone-downtime-validation.sh usage for medium and large sites : This script is in
/usr/local/bin , should be run as root, and requires 2 inputs from command line – the first being the
‘zone’ (make sure it’s what Nagios recognizes - for example if you used ‘IPL1b’ it would fail to find
that since nagios only knows ipbin1b). the second value is the comment you expect to see from
setting the downtime.
root@dsvtxvcngos03.infra.aic.att.net:/home/sc3998# ./zone-downtime-validation.sh ipbin1b "IPBIN1B RC35 Upgrade"

ipbinrsv114.ipbin1b.infra.aic.att.net - all checks in downtime

ipbinrsv115.ipbin1b.infra.aic.att.net - all checks in downtime

ipbinrsv116.ipbin1b.infra.aic.att.net - all checks in downtime

ipbinrsv117.ipbin1b.infra.aic.att.net - all checks in downtime

ipbinvmopsc01.ipbin1b.infra.aic.att.net - all checks in downtime

ALL servers in the zone have expected downtime


here’s an example from large chg1b (ngos02)

root@zchg1bngos02:/home/sc3998# ./zone-downtime-validation.sh chg1b "bgscollect does not stay running"

chg1r03a004.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

chg1r22o003.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

chg1r22o004.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

chg1r22o005.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

chg1r22o006.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

zchg1bopsc01.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

zchg1bngos01.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

zchg1bngos02.chg1b.cci.att.com - WARNING - service checks greater than number of downtimes posted

zchg1brosv01.chg1b.cci.att.com - all checks in downtime

zchg1brosv02.chg1b.cci.att.com - all checks in downtime

NOT all the zone is in downtime. Examine output above for WARNING messages

NOTE: If all the zone is not in downtime , re run the nagios-DE-sched-downtime-zone.sh again

NOTE: hostname MUST match exactly as defined in nagios /etc/nagios3/conf.d/.cfg file for nagios3
version and /usr/local/nagios/etc/conf.d/host.* for nagios4

NOTE: Release Management Chat to be communicated via Q chat for duration Nagios alert would
be disabled

Release Management Chat - Large


- https://teams.microsoft.com/l/channel/19%3a660c634c071749efab96f2c58b853103%40thread.tacv
2/AIC%2520Large%2520Prod%2520Deployments?groupId=9f716cdc-d9de-45cf-8171-
f74c1e6f3bc5&tenantId=e741d71c-c6b6-47b0-803c-0f3b32b07556

Release Management Chat - Medium


- https://teams.microsoft.com/l/channel/19%3a26525ea7678c4642ab3aea18a373990f
%40thread.tacv2/AIC%2520Medium%2520Prod%2520Deployments?groupId=9f716cdc-d9de-45cf-
8171-f74c1e6f3bc5&tenantId=e741d71c-c6b6-47b0-803c-0f3b32b07556

Results/Descriptions: Nagios dashboard disabled and release management notified

20.1.a.1 : The following additional validation needs to be done using the nagios GUI by the DE. Time
Line: 5 minutes Action:Log onto the nagios web GUI, select hosts, change the server count from ‘100’
in the dropdown to ‘all’. Verify downtime is added for all hosts and if downtime does not exists on
any nodes fix it by rerunning the script in Step 20.1.a to schedule downtime on those nodes. Validate
downtime as per below screenshot.

Results/Descriptions: All hosts on Nagios dashboard downtime is scheduled.

Step 20.1.b: Create Tower Ticket to supress TOA Alarm

Time Line: 10 minutes

Action: During upgrade if compute reboot is required (as part of Change window) DE need to
ensure to create Tower Ticket using below steps for complete CR duration.

 Use TOWER to open a request to O&T: http://tower.web.att.com/#/


 Select the relevant Issue type.
 Maintenance Request General alerting/ticketing problems and eSet rule changes
(eDART/routing etc.). In the Maintenance Request form:

 Input EMOS in the Filter Application List field



 Then click in the Select Application field and select the area that best fits your issue.

o For eDart requests select EMOS EDART / Universal Work Flow


o For eSet alarm/ticket routing select EMOS ESET Rule Assistance
(Infrastructure, DBA, Storage)
 Skip the Select Assignee field

 Select appropriate Request Type. If unsure, choose Information Request.

 Provide a Brief Summary that describes your issue

 Provided a Detailed Description. Be sure to include example AOTS ticket numbers, server names,

alert info, etc., as appropriate. This helps speed up the investigation.


 Confirm the I am the Primary Contact statement. Default is Yes.

 Answer the Is there Associated Capital Labor? question. Default is No.

 Add Attachment if needed (optional).

 Submit.

Feel free to Q or email me (ln8367), to confirm I received your request, or for assistance in
completing the form.
Inputting a request in TOWER will result in an iTrack Issue being created. TOWER is simply the Front-
door issue entry tool.

Step 20.1: Fix ssh config for admin user on Fuel node

Time Line: 3 minutes


Action: Connect to Opssimple node with your personal attuid ssh {opsc IP} , become m96722
using sudo -iu m96722 and execute the following:
cd aic
ansible-playbook -i inventory -s playbooks/deploy_ssh_keys.yml --tags=install --limit=fuel_host

Results/Descriptions: It will switch User from fuel to m96722 in ssh config (needed for MOP
automation)

Step 20.2: Execute AIC-MOP-586 : MOP to regenerate LCM certificates

https://codecloud.web.att.com/projects/ST_CCP/repos/aic-docs/browse/docs/mops/
MOP_LCM_certs.md?until=cb27d032aae46bf4aaa9249a7d03605b1af96816&untilPath=docs
%2Fmops%2FMOP_LCM_certs.md

Step 20.3: Suspend Puppet Agent


Time Line: 10 minutes
Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor

Retrieve Fuel env id and set


fuel env
export envid=<envid>

upload fuel stop puppet graph


fuel2 graph upload --env $envid --type stop_puppet --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/stop_puppet.yaml

Execute stop puppet graph


fuel2 graph execute --env $envid --type stop_puppet

check graph execution is complete


fuel2 task list

Results/Descriptions: Check puppet agent is stopped on all nodes.

On Fuel VM, logged in using ATTUID, execute following to check status of puppet agents on all
nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~

Results/Descriptions: Commands executed without errors and puppet agents stopped


Step 20.4: Disable weak ciphers on OpsSimple

Time Line: 2 minutes

Action: On OpsSimple VM, Login using UAM and switch as m96722 execute following to update
weak ciphers (security requirement).
cd aic
./setup.sh disable_ops_weakciphers
Step 20.5: Update Fuel repo

Time Line: 2 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor ,
update /etc/yum.repos.d/update_repo_RCXY.repo as per

prod.repo stable.repo

Results/Descriptions:

Step 20.6: Upgrade Fuel components

Time Line: 2 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:
sudo yum --disablerepo=* --enablerepo=update_repo_RC35_1_dep --enablerepo=update_repo_RC35_1_rpm --
enablerepo=update_repo_RC35_1_plugins update fuel-nailgun -y

~~ nailgun_syncdb nailgun_fixtures systemctl daemon-reload && service nailgun reload ~~ Check


and remove duplicate designate packages
rpm -qa| grep designate
rpm -ev <older version of designate> --noscripts

NOTE: In the case you run into issues with below command, kill the process and rerun to remove any
duplicate designate packages and continue.
$ sudo yum --disablerepo=* --enablerepo=update_repo_RC35_1_dep --enablerepo=update_repo_RC35_1_rpm --
enablerepo=update_repo_RC35_1_plugins update -y --exclude fuel-octane
Results/Descriptions:

Step 20.7: Check for duplicate yum packages.

Time Line: 2 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:
rpm -qa --qf "%{NAME} %{ARCH}\n" | sort | uniq -c | grep -v '1 '
If there are duplicates, remove the unnecessary packages and re-run above command again until it
shows no duplicates. To remove duplicates execute -
rpm -e <package_name>

Results/Descriptions:

Step 20.8: Update Fuel Services configuration.

Time Line: 10 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:

#####a) hiera data preparation deployment graph:


We have to keep the graph execution until next prod RC comming after RC35

fuel env

export envid=<envid>

fuel2 graph upload --env $envid --type hiera_prep --file

/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/fuel/hiera-preparation.yaml

fuel2 graph execute --env $envid --type hiera_prep

/etc/puppet/kilo-9.0/modules/fuel/examples/settings.sh $envid RC35_before_upgrade

#####b) Apply changes on Fuel node by execution update.sh :


/etc/puppet/kilo-9.0/modules/fuel/examples/update.sh
Note: There should be tasks:
hiera,sshkeygen,keystone,ldap,keystone_token_disable_nginx_services,nailgun,ostf,client,logindefs,fue
l_tasks_cleaner,security, backup.

If all these are not present then please re-execute command above.
The exit code could be either 0 or 2 .

If there any other exit codes then it may require troubleshooting.

Results/Descriptions: Check /var/log/remote/127.0.0.1/puppet-apply.log in case any errors occurred.

Step 20.9: Restart Fuel Services.

Time Line: 5 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:
service mcollective restart
service nailgun restart
service astute restart
fuel plugins --sync
fuel plugins --list

Results/Descriptions: Verify latest packages are installed The list of fuel packages is listed
here: fuelpluginlist.yaml

Step 20.10: Update Fuel meta data

Time Line: 5 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:
fuel env
export envid=<envid>
python /var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/update_plugins_metadata.py -e $envid -u
Note - if there are issues with python command, execute fuel plugins --sync one more time and re-run
the python command to update metadata.

Check for Contrail Alarm Metadata by downloading settings file and see if it was updated. If
Contrail alarm metadta was not updated re-run the python command to update the metadata.
fuel env
export envid=<envid>
fuel settings --env ${envid} --download
fgrep alarm_list $(pwd)/settings_${envid}.yaml

Results/Descriptions: Verify alarm and other plugin metadata is in sync by examing settings_$
{envid}.yaml.

Remove the settings files once verified


rm $(pwd)/settings_${envid}.yaml

Step 20.11: Propagate changes from Opssimple to Fuel.

Time Line: 5 minutes

Action: On Opssimple VM, as m96722 export current site yaml -


ssh <opscvm>
cd /var/www/aic_opssimple/backend
./manage.py generate_scripts <site_name>
cd ~/aic/files/env_<siteid>/fuel-client
python ./script.py update_repository
python ./script.py update_plugin

Results/Descriptions: Verify update repository and update plugin returns a result code of 200.
cd /home/m96722/aic
ansible-playbook -i inventory/ playbooks/prepare_authorized_keys.yml
Verify if a file named authorized_keys got created within /home/m96722/aic/files/env_/fuel-client/ It
will be a copy of the id_rsa.pub from /home/m96722/.ssh folder with from restriction of Fuel,
Opssimple & Seed nodes added.
cd ~/aic/files/env_<sitename>/fuel-client
python ./script.py restrict_os_user
The above command will push the authorized_keys of operator_user ( m96722 ) to Fuel.

Verify that proper configuration was uploaded:

Action: On Fuel VM, execute -


fuel env
export envid=<envid>
python /var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/update_plugins_metadata.py -e $envid -u

The output command should not contain any "key is missing" lines
Step 20.12: Install opsfix0064 and copy diff_struct.py to fuel node.

Time Line: 10 minutes

Action: On OpsSimpleVM execute -

Install opsfix0064:

~~

sudo su - m96722 OR sudo -iu m96722 sudo apt-get update sudo apt-get install aic-opsfix-cmc-
0064`

~~

Copy diff_struct.py script to fuel node:

~~ cd ~/aic ansible -i inventory/ fuel_host -mcopy -a "src=~/rotatemechid/diff_struct.py


dest=/tmp/" ~~
Step 20.13: Validate fuel settings.

Time Line: 10 minutes


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} , become root using toor and
execute:
ssh <fuelvm>
mkdir /var/log/new_RC35
cd /var/log/new_RC35
fuel env
export envid=<envid>
fuel settings --download --env $envid
fuel network --download --env $envid
/etc/puppet/kilo-9.0/modules/fuel/examples/settings.sh $envid RC35_new_config

Results/Descriptions: verify following,

No differences in network yaml.


python /tmp/diff_struct.py network_<envid>.yaml ../fuelbackuppre/network_<envid>.yaml

RC specific updates should show up differences in settings yaml.


python /tmp/diff_struct.py settings_<envid>.yaml ../fuelbackuppre/settings_<envid>.yaml

Step 20.14: Touch nova-common file in computes and controllers


Time Line: 5 minutes

Action: On Opssimple VM, as m96722 execute the commands


ssh <opscvm>
cd aic
ansible compute_host:controller_host -i inventory/openstack -m shell -sa "cp
/usr/localcw/opt/sudo/sudoers.d/aic_nova_sudoers /usr/localcw/opt/sudo/sudoers.d/nova-common"
ansible compute_host:controller_host -i inventory/openstack -m shell -sa "cp /etc/sudoers.d/aic_nova_sudoers
/etc/sudoers.d/nova-common"

21. Implementation

NOTE: All terminal input/output must be logged during the change.

Step 21.1: Remove the The StackLight Collector Plugin. (Only required for Large sites, not needed for
medium sites.)

Time Line: 10 minutes

Action: On Fuel VM, as fuel admin execute -

fuel plugins | grep lma_collector for example:


[root@nailgun ~]# fuel plugins | grep lma_collector

4 | lma_collector | 0.10.1188 | 4.0.0 | ubuntu (kilo-9.0, liberty-8.0, liberty-9.0, mitaka-9.0)

fuel plugins --remove lma_collector== for example:

fuel plugins --remove lma_collector==0.10.1188

Step 21.2: Execute Fuel release candidate graph.

Time Line: 10 minutes

Action: On Fuel VM, as fuel admin execute -


fuel env
export envid=<envid>
fuel2 graph delete --env $envid --type release_candidate
fuel2 graph upload --env $envid --type release_candidate --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/release_candidate.yaml

fuel2 graph execute --env $envid --type release_candidate


fuel2 graph delete --env $envid --type custom_step_1
fuel2 graph delete --env $envid --type custom_step_2
fuel2 graph delete --env $envid --type custom_step_3
fuel2 graph delete --env $envid --type custom_step_4
fuel2 graph upload --env $envid --type custom_step_1 --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/from_rc27/01-remove-
lma_art.yaml
fuel2 graph upload --env $envid --type custom_step_2 --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/from_rc32/01-
backup_mysql.yaml
fuel2 graph upload --env $envid --type custom_step_4 --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/from_rc32/03-
clean_lcm_and_dbng_nodes.yaml
fuel2 graph execute --env $envid --type custom_step_1
fuel2 graph execute --env $envid --type custom_step_2

Check that backup was successful on all nodes:


fuel2 task list

Percona upgrade is very disruptive procedure. To prevent mysql cluster failures take
additional steps:
Login to all 3 dbng (lcm) nodes, and update mysql configuration:
add "innodb_fast_shutdown = 0" to the '[mysqld]' section of the /etc/mysql/my.cnf

From any dbng (lcm) node restart the mysql cluster:


crm resource restart clone_p_mysqld

Wait until all three nodes are in the 'started' state::


crm resource list

Execute mysqlcheck on all 3 dbng (lcm) nodes:


mysqlcheck --auto-repair --optimize --all-databases

You should not have any errors.

If you see any errors - stop the upgrade and create a ticket for the tiger team for investigation.

Update pacemaker timeout. Login to all 3 dbng (lcm) nodes, and update pacemaker configuration:

Open '/usr/lib/ocf/resource.d/fuel/mysql-wss' file and find the line in the mysql_stop() function
proc_stop "${OCF_RESKEY_pid}" "mysqld.*${OCF_RESKEY_datadir}" SIGTERM 5 $((
$shutdown_timeout/5 ))

Add '1205' between "mysqld.*${OCF_RESKEY_datadir}" and SIGTERM to have:


proc_stop "${OCF_RESKEY_pid}" "mysqld.*${OCF_RESKEY_datadir}" 1205 SIGTERM 5 $
(( $shutdown_timeout/5 ))

Save changes. Please note that 1205 is 2 hours of trying to gracefully terminate the mysql process. Be
patient.

Stop the mysql cluster:


crm resource stop clone_p_mysqld

Wait until all three nodes are in the 'stopped' state::


crm resource list

Check that there are no mysql and mysql-related processes on the node (on all 3 nodes):
ps aux | grep mysql

Login to all 3 nodes and check the state of DB one by one:

Check that .my.cnf is pointing to the /root/.my.localhost.cnf:


ls -la /root/ | grep .my.cnf
lrwxrwxrwx 1 root root 23 Jan 12 19:41 .my.cnf -> /root/.my.localhost.cnf

If not, execute:
ln -sf /root/.my.localhost.cnf /root/.my.cnf

Ban p_mysqld resource in pacemaker one by one:


crm_resource --resource clone_p_mysqld -B --host $(hostname)

Start the Mysql locally:


mysqld --log-error=/var/log/mysql/mysqld_custom.log --user=mysql --wsrep-provider='none' --innodb-read-
only=OFF &

Check that you can login to mysql:


mysql>

If you not able to login - stop the upgrade and ask the tiger team for investigation.

Check the DB status:


mysqlcheck --auto-repair --optimize --all-databases
mysqladmin shutdown
ps aux | grep mysql

Remove innodb_fast_shutdown and add 'show_compatibility_56 = on' to all 3 nodes:


remove "innodb_fast_shutdown = 0" from the '[mysqld]' section of the /etc/mysql/my.cnf
add "show_compatibility_56 = on" to the '[mysqld]' section of the /etc/mysql/my.cnf

Remove '1205' '/usr/lib/ocf/resource.d/fuel/mysql-wss' file from all 3 dbng nodes to have:


proc_stop "${OCF_RESKEY_pid}" "mysqld.*${OCF_RESKEY_datadir}" SIGTERM 5 $((
$shutdown_timeout/5 ))

Repeat the same for LCM nodes.

Percona Upgrade procedure (for DBNG and LCM nodes)

On the Fuel node:

Create upgrade.yaml file with content:

- id: stop_mysql_cluster

type: shell

role: ['primary-lcm']

version: 2.1.0

required_for: [check-clusters-status]

parameters:

cmd: crm resource stop clone_p_mysqld

retries: 3

interval: 20

timeout: 300

- id: stop_rabbit_cluster

type: shell

role: ['primary-aic-dbng']
version: 2.1.0

requires: [stop_mysql_cluster]

parameters:

cmd: crm resource stop master_p_rabbitmq-server

retries: 3

interval: 20

timeout: 300

- id: check-clusters-status

version: 2.1.0

type: shell

role: ['primary-aic-dbng', 'aic-dbng', 'primary-lcm', 'lcm']

requires: [stop_rabbit_cluster, stop_mysql_cluster]

cross-depends:

- name: stop_mysql_cluster

parameters:

cmd: sleep 30; crm resource list | grep Stopped:| grep $(hostname)

retries: 12

interval: 30

timeout: 1800

strategy:

type: one_by_one

- id: update_config

type: shell

role: ['primary-aic-dbng', 'aic-dbng', 'primary-lcm', 'lcm']

version: 2.1.0

requires: [check-clusters-status]

parameters:

cmd: |
sed -i '/myisam_recover/d' /etc/mysql/my.cnf

puppet resource file_line home-dir path=/etc/mysql/my.cnf line='innodb-data-home-dir = /var/lib/mysql/' match=innodb-

data-home-dir

puppet resource file_line databases-exclude path=/etc/mysql/my.cnf line='databases-exclude = lost+found' after=parallel

retries: 3

interval: 20

timeout: 180

- id: update_packages

type: shell

role: ['primary-aic-dbng', 'aic-dbng', 'primary-lcm', 'lcm']

version: 2.1.0

requires: [update_config]

parameters:

cmd: |

apt-get remove percona-xtradb-cluster-server-5.6 percona-xtradb-cluster-client-5.6 percona-xtrabackup percona-xtradb-

cluster-common-5.6 -y

puppet resource package percona-xtradb-cluster-server-5.7 ensure=present

puppet resource package percona-xtradb-cluster-client-5.7 ensure=present

retries: 3

interval: 20

timeout: 3600

Upload and execute this graph:


fuel2 graph upload --env $envid --type package_update --file ./upgrade.yaml

fuel2 graph execute --env $envid --type package_update


On all 3 DBNG (LCM) nodes one by one

Login to all 3 dbng nodes and execute:


eval "/usr/bin/mysqld_safe --log-error=/var/log/mysql/mysqld_custom.log --skip-grant-tables --skip-networking --
user=mysql --wsrep-provider='none' 2>&1 > /dev/null" & disown

Wait at least 20-30 seconds and execute:


mysql_upgrade

If successful, stop MySQL on this node and execute the same for the next node:
service mysql stop

If not successful, try:


mv /var/lib/mysql/ib_logfile0 /root/
mv /var/lib/mysql/ib_logfile1 /root/
eval "/usr/bin/mysqld_safe --log-error=/var/log/mysql/mysqld_custom.log --skip-grant-tables --skip-networking --
user=mysql --wsrep-provider='none' 2>&1 > /dev/null" & disown

And execute mysql_upgrade again:


mysql_upgrade

If successful, stop MySQL on this node and execute the same for the next node:
service mysql stop

If you see any errors - stop the upgrade and create a ticket for the tiger team for investigation.

Repeat the same steps for LCM nodes

If all 3 nodes have been successfully upgraded, start the Mysql cluster:

Starting from the primary-dbng (primary-lcm) node:


crm_resource --resource clone_p_mysqld -U --host $(hostname)
crm resource start clone_p_mysqld
Try to login to mysql and remove constraints for other nodes:

From the primary-dbng (primary-lcm) node login to the mysql:


mysql

If you not able to login - stop the upgrade and ask the tiger team for investigation.

Start mysql on other 2 nodes:


crm_resource --resource clone_p_mysqld -U --host $(hostname)

#Percona upgrade is very disruptive procedure. After this procedure we have to login to dbng and
lcm nodes and check the cluster status.

From the fuel node as fuel admin:


fuel node | grep dbng

Login to any dbng and lcm node and login to the mysql. If mysql is not working on this node use
another dbng (lcm) node.

Check cluster status and uninstall validate_password plugin


(https://itrack.web.att.com/browse/AICDEFECT-1990):
mysql> uninstall plugin validate_password;
mysql> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';

wsrep_cluster_size should be 3, If not - create a ticket for


the tiger team for investigation., else - repeat the same
steps for lcm nodes.
Start rabbitmq cluster on any dbng node:
crm resource start master_p_rabbitmq-server

Proceed with upgrade procedure. From the fuel node:


fuel2 graph execute --env $envid --type custom_step_4

Results/Descriptions:
Step 21.2.1: Run glance db sync

Time Line: 10 minutes

Action: on mosc, as fuel user run


ssh <moscvm>
sudo glance-manage -d db_sync

Results/Descriptions:

Step 21.2.2: Update LCM hosts

Time Line: 10 minutes


Action: on any LCM, as foreman admin execute ssh <lcmvm> lcm_update_hosts -f

Step 21.2.3: Disable THP on all GV nodes

Time Line: 10 minutes

Action:

On OpsSimple VM, as m96722 execute -


cd aic
ansible controller_host[0] -i inventory/openstack -m shell -a ". /home/m96722/openrc_v2; openstack aggregate list |
grep 'gv'"
ansible controller_host[0] -i inventory/openstack -m shell -a ". /home/m96722/openrc_v2; openstack aggregate
show <previous_command_output> -f json | python -mjson.tool"
ansible <host_list_from_previous_command (host1:host2:...)> -i inventory/openstack -m shell -s -a "puppet apply
/etc/puppet/modules/fuel/examples/grubupdate.pp"

Check if 'transparent_hugepage=never' is present on all needed nodes:


ansible <host_list (host1:host2:...)> -i inventory/openstack -m shell -s -a "cat /etc/default/grub | grep
GRUB_CMDLINE_LINUX"

Reboot all GV nodes:


ansible <host_list (host1:host2:...)> -i inventory/openstack -m shell -s -a "reboot"

Step 21.3: Designate DDNS service removal and cron update. - for (Large Sites only)

Time Line: 5 minutes


Action: On each of the TMDG nodes, Login into UAM and execute toor
apt-get remove aic-designate-ddns

also update cron tab with commenting below entries to stop

#*/1 * * * * python /usr/lib/python2.7/dist-packages/aicddns/ddnsupdate.py

#30 4 * * * python /usr/lib/python2.7/dist-packages/aicddns/axfrupdate.py


crontab -e

[DEFECT-17627]: Also make sure desingate server is up and running

designate server-list

+--------------------------------------+-----------------------+

| id | name |

+--------------------------------------+-----------------------+

| 7abc8cc1-031f-4c66-bab2-8bb0bd3ae8f6 | ns1.zone.tci.att.com. |

+--------------------------------------+-----------------------+

Check the designate server existence only for Large sites : ansible-playbook -i inventory/
playbooks/openrc_automation/get_designate_server_list.yml
If above commands results in a null ouput, please create a sesignate server

Run the below script


/etc/designate/create_nameserver.sh

Step 21.4: Check LCM actions on limited set of nodes.

Time Line: 20 minutes

Action: on primary LCM nodes, as fuel admin execute following -

~~ facter -p environment puppet agent --test --verbose --debug --trace --evaltrace --summarize --
detailed-exitcodes --noop cat /var/log/puppet/puppet.log ~~

Check the result in the console, in /var/log/puppet/puppet.log and Foreman, verify that this is not an
"empty catalog", check that foreman is green, check that the last report for that node in Foreman is
around 1 minute ago. If something wrong is detected (unwanted mechid changes, unwanted timer
change..), the DE neeeds to go back and adjust the opssimple_site.yaml.

This following command will force the puppet agent to run immediatly.
~~ service puppet start ; sleep 5 && kill -USR1 ps aux | grep 'puppet agent' | grep -v configuration | grep -v
grep | awk '{ print $2 }' cat /var/log/puppet/puppet.log ~~

Check /var/log/puppet/puppet.log

Results/Descriptions: Commands executed without errors and that nothing wrong detected in log
or Foreman
**Step 21.5:**Re-enable Puppet Agents.
Time Line: 20 minutes

The puppet agents are currently in a suspended state. The following graph will re-enable them. The
"actual" run of the puppet agents is spread over a 30 min window.
Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor

Re-enable the Puppet Agents


fuel2 graph upload --env $envid --type start_puppet --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/start_puppet.yaml

fuel2 graph execute --env $envid --type start_puppet

check the graph execution is complete.

Results/Descriptions: On Fuel VM, logged in using ATTUID, execute following to to verify puppet
agents are started on all nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~
Step 21.6: GSTools Update

Time Line: 120 minutes

Action:

On OpsSimple VM, as m96722 execute -


cd aic
ansible all -i inventory -s -m shell -a "killall bgsagent "
ansible all -i inventory -s -m shell -a "killall bgssd.exe "

Execute GSTools update procedure per - GSTools MOP

Results/Descriptions: Commands executed without errors


Re-run on failed nodes:
Health check:
Health check at :
/opt/aic-health-check/cache/hc_ipbinvlopsc01.ipbin1a.infra.aic.att.net_20220427_030732.csv

Step21.6.1: Steps to start restart bgssd and bgscollect

Time Line: 20 minutes

Action:

On OpsSimple VM, as m96722 execute -


cd aic
ansible all -i inventory/ -m shell -sa "pkill -9 -u m95031"
ansible all -i inventory/ -m shell -sa "/etc/init.d/bgssd start"
ansible all -i inventory/ -m shell -sa "yes | /opt/tools/bpa/b1config10700.sh &"

NOTE: Please wait for 10 mins before executing the next steps
ansible all -i inventory/ -m shell -sa "/usr/adm/best1_default/bgs/bin/best1collect.exe -I noInstance -B
/usr/adm/best1_10.7.00"

Verify if all 4 bgs process are running


ansible all -i inventory/ -m shell -sa "ps -ef |grep bgs"

Verify nagios alert using cli


ansible all -i inventory/ -m shell -sa "/usr/lib/nagios/plugins/check_procs_multi -p bgsagent,bgscollect,bgssd.exe -a
bgsioconfigcollect"
Results/Descriptions: Commands executed without errors

Step 21.7: Steps to disable McAfee agent

Time Line: 20 minutes

Action: On OpsSimple VM, as m96722 execute -

sudo apt-get update

sudo apt-get install aic-opssimple-plugins-gstools

cd aic

ansible-playbook -i inventory/ playbooks/mcafee_disable.yml


Validate the changes
ansible-playbook -i inventory/ playbooks/mcafee_disable_validate.yml

Results/Descriptions: Commands executed without errors

Step 21.8: Revert Astra config on openstack nodes (Only required for Large sites, not needed for
medium sites.)

Time Line: 30 mins

Action: On OpsSimple VM, as m96722 execute -

Execute AIC-MOP-611

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?sourcedoc=%7B91A7FE22-
C104-45E9-BE3F-F2853478AF9E%7D&file=AIC-MOP-
611_MOP_FOR_OPSFIX_0211.docx&action=default&mobileredirect=true&cid=f7fc2067-2fc9-417d-
be18-c4bfe646467f

Results/Descriptions:

Step 21.9: Update Nagios.

Time Line: 20 minutes

Action: On OpsSimple VM, as m96722 execute -

Execute AIC-MOP-601:

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?sourcedoc=%7B0BA3044B-
0FA8-4A4C-A54A-51FC62FA80F1%7D&file=AIC-MOP-
601_MOP_FOR_OPSFIX_0212.docx&action=default&mobileredirect=true
ansible-playbook -i inventory playbooks/deploy_nagios_server.yml
ansible-playbook -i inventory playbooks/deploy_nagios_agent.yml

To upgrade nagios to Nagios core 4.4.6 version(applicable only for RC 29 release), execute
ansible nagios_host -i inventory/ -m shell -s -a "sudo apt-get install --upgrade nagios4"

To add startup links to to init scripts for nagios, execute


ansible nagios_host -i inventory/ -m shell -s -a "update-rc.d nagios defaults"

Results/Descriptions:
Removal of duplicate pacemaker log rotation config file - AICDEFECT-901 ansible -i inventory --limit
tmdg_hosts -s -m shell -a "rm /etc/logrotate.d/pacemaker"
Step 21.10: Restore Fuel config to normal state

Time Line: 20 minutes

Action: On Opssimple VM, as m96722 user execute the following:


cd /var/www/aic_opssimple/backend/
./manage.py export -f /home/m96722/latest.yaml <siteid> yaml
a) Change *: latest to *: present
To avoid unwanted changes outside a maintenance window, the packages list needs to be changed
from *: latest by the package list *: present in /home/m96722/latest.yaml .
b) Add / change - enable_puppet_agent: true in fuel-plugin-lcm section file /home/m96722/latest.yaml

c) Then execute the following to send data to a Fuel API:


cd /var/www/aic_opssimple/backend
./manage.py import /home/m96722/latest.yaml yaml
./manage.py generate_scripts <site_name>
cd ~/aic/files/env-<siteid>/fuel-client
python ./script.py update_plugin

d) Propagate changes through Fuel, Connect to Fuel node with your personal attuid ssh {fuel
IP} and become root using toor
fuel env
export envid={envid}

where envid is obtained from results for fuel env. above.


fuel2 graph execute --env $envid --type stop_puppet

/etc/puppet/kilo-9.0/modules/fuel/examples/settings.sh $envid RC35_before_config_db


fuel2 graph delete --env $envid --type update_config_db
fuel2 graph upload --env $envid --type update_config_db --file
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/update_config_db.yaml
fuel2 graph execute --env $envid --type update_config_db

On OpsSimple VM, as m96722, please execute following to check *: latest has changed to *: present :
ssh <opscvm> cd aic/

Execute the ansible command:


ansible all -i inventory/openstack -m shell -a "sudo hiera packages|grep -q present" Expected result: command
finished with SUCCESS status.
e) Start puppet agent on all nodes:
fuel2 graph execute --env $envid --type start_puppet

Results/Descriptions: Once graph execution is complete, on Fuel VM, logged in using ATTUID,
execute following to check status of puppet agents on all nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~

Results/Descriptions: Commands executed without errors

Step 21.11: Remove python3 from fuel node.

Time Line: 30 minutes

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
549_MOP_to_remove_python3_from_fuel.docx?
d=w910975145346489093628711b7c70f14&csf=1&web=1&e=N9Jwot
Step 21.12: Update MongoDB auth-scheme version to 5

Time Line: 20 minutes

Action: On opssimpleVM, as m96722 execute following,


validate current version is 3 ansible-playbook -i inventory/ playbooks/mongodb_auth_scheme_enabling.yml --
tags validate

Update AUTH-schema version to 5 ansible-playbook -i inventory/


playbooks/mongodb_auth_scheme_enabling.yml --tags deploy
Validate current version is now 5 ansible-playbook -i inventory/
playbooks/mongodb_auth_scheme_enabling.yml --tags validate

Results/Descriptions: Commands executed without errors


Step 21.13: Remove DHCP client from RO, Astra VMs

Time Line: 5 minutes

Action: On opssimpleVM, as m96722 execute following,


ansible-playbook -i inventory/ playbooks/remove_dhcp.yml

Results/Descriptions: Commands executed without errors

Step 21.14: Restart novaservice if VNC console is not accessible

Time Line: 5 minutes

Action: On each MOS Controller, as fuel admin execute following,


service nova-novncproxy restart

Results/Descriptions: Commands executed without errors


Step 21.15: Update puppet config on non Openstack nodes (Non-Fuel managed hosts with tags
deployagtastra,deployagtfuel,deployagtjump,deployagtkvm,deployagtmaas,deployagtnagios,deploy
agtro)

Time Line: 30 minutes

Action: On Opssimple VM, as m96722 user execute following,


ssh <opscvm>
cd aic

Stop puppet on the Non Openstack hosts:


ansible all -i inventory/hosts -m shell -s -a 'service puppet stop'

Remove the puppet certificates for the Non Openstack hosts:


ansible all -i inventory/hosts -m shell -s -a 'find /var/lib/puppet/ssl/ -name "*.pem" -delete'

Remove the Non Openstack host certificates from Fuel:


ssh <fuelvm>
toor
cd /var/lib/fuel/keys/<fuel_env_id>/puppet_ssl_certs
Note Replace the string "<non_openstack_fqdn>" in the statements below with the fqdn name of
the certificate.

Example
find astra-aic.mtn16b.cci.att.com.pem This example is provided for illustration only.
find -name <non_openstack_fqdn>.pem
find -name <non_openstack_fqdn>.pem -delete

On Opssimple VM, as m96722 user run the following playbooks to generate and deploy the new
puppet certificates for the Non Openstack hosts: ssh <opscvm>
cd aic

Generate and deliver non-openstack nodes certificates to LCM nodes


ansible-playbook -i inventory/ playbooks/update_fuel_puppetcerts_hosts.yml --tags
"deployagtastra,deployagtfuel,deployagtjump,deployagtkvm,deployagtmaas,deployagtnagios,deployagtro" -v –sudo

ansible lcm_host -i inventory/ -m shell -s -a 'chown -R puppet:puppet /var/lib/puppet/ssl/*'

Generate hiera data for non-openstack nodes:


ansible-playbook -i inventory/ playbooks/deploy_puppet_agent_hosts.yml --tags "addhierarole" -v –sudo
Install puppet agent configuration for non-openstack nodes:
ansible-playbook -i inventory/ playbooks/deploy_puppet_agent_hosts.yml --tags
"deployagtastra,deployagtfuel,deployagtjump,deployagtkvm,deployagtmaas,deployagtnagios,deployagtro" -v –sudo

Results/Descriptions: Commands executed without errors

Step 21.16: Enable RBAC for Load Balancer As A Service (LBaaS) objects by migrating Contrail
internal object called service_appliance_set
Time Line: 20 minutes

Action: On primary contrail controller, as root execute following,


python /opt/contrail/utils/chmod2.py --os-username *** --os-password *** --os-tenant-name *** --server
localhost:9100 --type service-appliance-set --name default-global-system-config:opencontrail --global-access 5

Reference defect - https://sdp.web.att.com/ccm/web/projects/Defect


%20Management#action=com.ibm.team.workitem.viewWorkItem&id=462503

Step 21.17: Disabling VNC Server Unauthenticated Access on all sites

Time Line: 30 minutes

IMPORTANT!

First check the already installed version of opsfix_200:


sudo apt-cache policy aic-opsfix-cmc-0200

If the installed version is lower than 40343, perform the rollback procedure. Only after that upgrade
the opsfix and apply the deploy procedure.

Action: For update firewall rules on compute node on all sites execute the following MOP:
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
510_MOP_for_opsfix_0200.docx?
d=w527ee9a050494ea8a6121d82e7b00423&csf=1&web=1&e=v9m9nF
Step 21.18: Execute MOP-064 MVMD Security Scanning

Time Line: 60 minutes

Action: Execute MOP-064 MVMD Security Scanning

https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
064_FOR_OPSFIX_0110.docx?d=w286a593bcfd9406391fbf81980534bc1&csf=1&web=1&e=GbCMJB
Revert:

Applying the MOP:


Test Plan:

If there are errors encountered while running ansible-playbook -i inventory/


playbooks/aic_opsfix_deploy_0110.yml --tags deploy

stating below errors:

checkdir error: cannot create /var/mvmd/ .....

File exists

Please ignore those as the files are trying to re-copy or recreate

Results/Descriptions: Commands executed without errors

Step 21.19: Create database user account to support MVMD security scanning of AIC OpenStack
MySQL databases on lcm and fuel nodes

Time Line: 20 minutes.

Action: Create database user account to support MVMD security


scanning https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
598_MOP_FOR_OPSFIX-0210.docx?
d=w473a1bd65c264759823af01a00a22e5b&csf=1&web=1&e=AUkOmq&isSPOFile=1
Step 21.20: Execute Step 21.4 Install DHCP from MOP-194 MaaS Update to update DHCP packages

Time Line: 20 minutes.

Action: Execute Step 21.4 Install DHCP from MOP-194 MaaS Update to update DHCP
packages https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-
MOP-194%20MaaS%20Update%201.docx?
d=wc3306ea8e0dd4b44a5fdea78e84b375e&csf=1&web=1&e=fKu9av
Step 21.sec.1: Disable libvirt daemon's listen mode on operational KVM

Time Line: 10 minutes

Action: On Opssimple VM, as m96722 user execute following,


ssh <opscvm>
cd aic
ansible-playbook -i inventory/ playbooks/disable_webvirt.yaml -e "variable_host=kvm_host"

Results/Descriptions: Commands executed without errors

Step 21.sec.2: Ubuntu password Update on KVM and MaaS Node

Time Line: 10 minutes

Action: Execute MoP https://codecloud.web.att.com/projects/ST_CCP/repos/aic-docs/browse/


docs/mops/password_rotate.md?
until=c60880424fa671dc9e2f27cd9ec9faa8d0afca6c&untilPath=docs%2Fmops
%2Fpassword_rotate.md&at=refs%2Fheads%2Fmaster

Results/Descriptions: Commands executed without errors


Step 21.sec.3: Disable ssh-access to Fuel using passwords (only ssh-access using public/private keys
will be possible). For all Large and medium sites

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute


ssh <opscvm>
cd aic
The below playbook sets sshd.conf parameter PasswordAuthentication as no and restarts sshd service
to apply this parameter on Fuel node. Full fqdn name of Fuel node can be taken by grep '\[fuel_host\]'
inventory/hosts -A 1 | sed -n 2p command.
ansible-playbook -i inventory/hosts playbooks/disable_password_login.yml --limit <full fqdn name of Fuel node>

Example:
$ ansible-playbook -i inventory/hosts playbooks/disable_password_login.yml --limit zmtn12fuel01.zmtn12.datacenter
PLAY [Disable password authentication] ****************************************

GATHERING FACTS ***************************************************************

ok: [zmtn12fuel01.zmtn12.datacenter]

TASK: [Changing ssh configuration and restarting ssh service] *****************

changed: [zmtn12fuel01.zmtn12.datacenter]

TASK: [restart sshd redhat] ***************************************************

changed: [zmtn12fuel01.zmtn12.datacenter]

TASK: [restart sshd ubuntu] ***************************************************

skipping: [zmtn12fuel01.zmtn12.datacenter]

TASK: [restart sshd CentOS] ***************************************************

changed: [zmtn12fuel01.zmtn12.datacenter]

PLAY RECAP ********************************************************************

zmtn12fuel01.zmtn12.datacenter : ok=4 changed=3 unreachable=0 failed=0

Results/Descriptions: Commands executed without errors

Step 21.sec.4: Block MAAS UI access from outside of LCP

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic

To block MAAS API for the outside-of-LCP world, use the following playbook
ansible-playbook -i inventory/ playbooks/iptables_maas.yml

To block MAAS proxy


ssh <uamid>@<MAASVM>
execute toor
su - ubuntu
Check MAAS proxy is running.
ps -ef | grep -i squid

To disable MAAS proxy


sudo squid3 -k shutdown

In couple of minute the proxy will be disabled.

verify MAAS proxy.


ps -ef | grep -i squid

If, for whatever reason, old iptables rules need to be restored, you can perform the following on the
maas node
sudo iptables-save | grep -v '--dport 5240' | sudo iptables-restore

To enable MAAS proxy


ssh <uamid>@<MAASVM>
execute toor
su - ubuntu
sudo /usr/sbin/squid3 -N -f /etc/maas/maas-proxy.conf &

verify MAAS proxy.


ps -ef | grep -i squid

Results/Descriptions: Commands executed without errors

Step 21.sec.5: Fix the opssimple log file permission

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
ansible-playbook -i inventory/ playbooks/fix_opssimple_log_permission.yml
Results/Descriptions: Commands executed without errors

Step 21.sec.6: Cleanup unwanted directories and files on opssimple, fuel and seed node

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
ansible-playbook -i inventory/ playbooks/preventative_maintenance.yml

After this step, all the unwanted directories and files will be moved to trash on the respective nodes
1. opssimple => /home/m96722/trash

2. fuel => /root/trash

3. seed => /home/ubuntu/trash

Results/Descriptions: Commands executed without errors

Step 21.sec.7: Fix the opssimple nginx conf permission


Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
ansible-playbook -i inventory/ playbooks/fix_opssimple_nginx_conf_permission.yml

Results/Descriptions: Commands executed without errors

Step 21.sec.8: Remove devops user from the KVM hosts.

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
ansible-playbook -i inventory/ playbooks/disable_passwd_login_ubuntu_user.yml --tags devops_removal

Results/Descriptions: Commands executed without errors

Step 21.sec.9: Rename sudoers

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/rename_sudoers.yml

Results/Descriptions: Commands executed without errors

Step 21.sec.10: Execute opsfix 0132 for openstack sudoers file integrity

Time Line: 30 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get install aic-opsfix-cmc-0132

Use only inventory/openstack while running the playbook


ansible-playbook -i inventory/openstack playbooks/aic_opsfix_backup_0132.yml
ansible-playbook -i inventory/openstack playbooks/aic_opsfix_deploy_0132.yml

Results/Descriptions: Commands executed without errors

Step 21.sec.11: Disable Ubuntu and lock the password.

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/disable_ubuntu_login.yml

Results/Descriptions: Verify if the playbook is executed successfully

Step 21.sec.12: Remove unauthorized public keys present on the nodes

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/remove_unauthorized_pubkey.yml

Results/Descriptions: Verify if the playbook is executed successfully

Step 21.sec.13: Remove m96722 90-cloud-init-user from sudoers

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/remove_unwanted_sudoers.yaml
Results/Descriptions: Commands executed without errors

Step 21.sec.14: Update Octane


Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


ssh <fuelvm> sudo yum --disablerepo=* --enablerepo=update_repo_RC35_1_dep --
enablerepo=update_repo_RC35_1_rpm --enablerepo=update_repo_RC35_1_plugins update fuel-octane -y

Results/Descriptions: Commands executed without errors

Step 21.sec.15: Update world readable file permissions

Time Line: 5 minutes

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/world_readable_file_permission.yml
Results/Descriptions: Commands executed without errors

Step 21.sec.16: Remove Multiple ssh keys present on the non-openstack nodes

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/remove_multiple_keys.yml

Step 21.sec.17: Lock Nagios user on Nagios host

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/lock_nagios_user.yml

Results/Descriptions: Commands executed without errors

Step 21.sec.18: Remove nova-common if present

Time Line: 5 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
ansible compute_host:controller_host -i inventory/openstack -m shell -sa "rm -rf
/usr/localcw/opt/sudo/sudoers.d/nova-common"
ansible compute_host:controller_host -i inventory/openstack -m shell -sa "rm -rf /etc/sudoers.d/nova-common"

Step 21.sec.19: Remove HP upgrade manager, which will remove DISCAGNT daemon

Time Line: 20 minutes.

Action: On OpsSimpleVM, as m96722 user execute following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/uninstall_hped.yml

Results/Descriptions: Commands executed without errors

Step 21.sec.20: Check for LCM VM's available RAM

Time Line: 3 minutes.

Action: On OpsSimpleVM, as m96722 user execute following

cd ~/aic

ansible lcm_host -i inventory/ -s -m shell -a "free -g"

If the output is less then 16GB like below


m96722@zmtn16aopsc01:~/aic$ ansible lcm_host -i inventory/openstack -s -m shell -a "free -g"

zmtn16alcma03.mtn16a.cci.att.com | success | rc=0 >>

total used free shared buffers cached

Mem: 15 15 0 0 1 5

-/+ buffers/cache: 8 7

Swap: 7 0 7

zmtn16alcma01.mtn16a.cci.att.com | success | rc=0 >>

total used free shared buffers cached

Mem: 15 14 0 0 1 5

-/+ buffers/cache: 7 7

Swap: 7 0 7

zmtn16alcma02.mtn16a.cci.att.com | success | rc=0 >>

total used free shared buffers cached

Mem: 15 14 0 0 1 5

-/+ buffers/cache: 7 8

Swap: 7 0 7

Please follow MOP-475 (https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document


%20Library/AIC-MOP-475_MOP_FOR_MEM_CPU_UPDATE_LCM.docx?
d=wc404662ddf684f38a94cba5871f96697&csf=1&web=1&e=cul6zT)

Results/Descriptions: Commands executed without errors

Step 21.sec.21: Check if any SSH weak algorithms are supported in Non Openstack nodes

Login to the Opssimple VM as att uid ( UAM) Refer to Jump host for sites as per below wiki
link https://wiki.web.att.com/display/AICP/AIC+Production+Environments

sudo su - m96722 or sudo -iu m96722

cd /home/m96722/aic/
Check if the mac algorithms were updated in sshd_config

ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'MACs hmac-sha2-
512,hmac-sha2-256,hmac-ripemd160'" Check if ciphers are updated in sshd_config

ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'Ciphers aes256-
ctr,aes192-ctr,aes128-ctr'" Check if KexAlgorithms are updated in sshd_config

ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'KexAlgorithms ecdh-
sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256'"
Note: Please ignore fuel as it is managed by puppet and will have different md5sum. If md5sum
doesn't match please execute the MOP
Please follow MOP-561:- (https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?
sourcedoc=%7BB677C89A-DAC8-476D-B58A-48A62D680275%7D&file=AIC-MOP-561%20MOP
%20for%20OPSFIX_0206.docx&action=default&mobileredirect=true)
22. Test Plan

NOTE: All terminal input/output must be logged during the change.

Inform CPVT team via email and the appropriate Release Mangement Chat to perform full PVT once
uplift is complete. PVT should clean up artifacts.

Release Management Chat -


Large qto://meeting/q_rooms_cb12191468270764278/AIC+RC26.2%2C+RC29%2C+RC32+Large+De
ployments

Release Management Chat -


Medium qto://meeting/q_rooms_cb12191534195381761/AIC+3.0.3+RC18.1+Medium+Deployments
+%28see+Meeting+Attibutes+for+current+zones+%26+CWs%29

23. Backout Procedure

NOTE: All terminal input/output must be logged during the change.


To restore vLCP control plane from backup execute steps 23.1, 23.2 and 23.3.

Step 23.1: Shutdown the LCP VMs which need to be restored

Time Line: 20 minutes.

Action: Access control plane KVM where the VM is running and shutdown VM
ssh <attuid>@<kvm>
virsh list --all
virsh shutdown <vmhostname>
virsh undefine <vmhostname>

Results/Descriptions: Control plane VM is shutdown and undefined.

Step 23.2: Storage team to restore control plane LUNs from backup (Large Sites Only)

Time Line: 60 minutes.

Action: Submit request to restore of all vLCP LUNs from backup.

For Production: AIC STORAGE OPERATIONS SUPPORT: http://ushportal.it.att.com/step2.cfm?


app=3962&home=ush

Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.

For Labs and SIL: http://labticketing.web.att.com/LabTicketing/(S(mmrztkbhhyshix45mbp0tmqk))/


AicTrouble.aspx?Lep=0

Callout Storage team qto://meeting/q_rooms_tg79241332875301882/EMOC+-


+AIC+Tier+2+Team+Chat

Also notify storage team through email DL-STORAGEOPS-REPLICATION@att.com

Step 23.3: Restore vms on the kvms from dump xml backup (Medium Sites Only)

Time Line: 30 minutes.

Action: On OpsSimpleVM execute -

~~ ssh @

sudo su - m96722 OR sudo -iu m96722

cd /home/m96722/aic/

Copy the xml files and define the vms using xml files.
ansible-playbook -i inventory/ playbooks/restore_vms.yml

Results/Descriptions: Commands executed without errors

~~

Step 23.4: Startup all LCP VMs

Time Line: 20 minutes.

Action: Access each control plane KVM and startup VMs. Proceed starting DBNG nodes, MOSC
controller nodes and then rest of the nodes.
ssh <attuid>@<kvm>
virsh list
virsh start <vmhostname>

Results/Descriptions: All control plane VMs are running.

Step 23.5: To rollback Nagios, rerun playbooks once step 23.1, 23.2, 23.3 are restored. These
playbooks will re-install previous RC packages.

Time Line: 20 minutes

Action: On OpsSimple VM, as m96722 execute -


ssh <opssimplevm>
ansible-playbook -i inventory playbooks/deploy_lma.yml
ansible-playbook -i inventory playbooks/deploy_nagios_server.yml
ansible-playbook -i inventory playbooks/deploy_nagios_agent.yml

Results/Descriptions: Commands executed without errors

Step 23.6: To rollback Astra, rerun playbooks once step 23.1, 23.2, 23.3 are restored. These
playbooks will re-install previous RC packages.

Time Line: 20 minutes

Action: On OpsSimple VM, as m96722 execute -


ansible-playbook -i inventory playbooks/deploy_astra_identity.yml
ansible-playbook -i inventory playbooks/deploy_astra_controller.yml
ansible-playbook -i inventory playbooks/deploy_astra_compute.yml

Results/Descriptions: Commands executed without errors


Step 23.7: Restore OpsSimple Configuration (Medium Sites Only)

Time Line: 20 minutes

Action: On Opssimple VM, as m96722 user import the original site yaml created in Step 17.b.15.
ssh <opscvm>
cd /home/m96722/install
./resetdb.sh
python3 /var/www/aic_opssimple/backend/manage.py import /home/m96722/before_rc.yaml yaml

Results/Descriptions: Commands executed without errors

Step 23.8: Restore Fuel from backup (Medium Sites Only)

Follow this step only if fuel needs to be restored from backups taken during preupgrade checks. This
step is not necessary if the control plane is restored from backups.

Time Line: 20 minutes

Action: From the FuelVM, execute following as m96722 user -


octane --debug -v fuel-repo-restore --from repos_and_images.tar.gz
octane --debug -v fuel-restore --from master_node_state.tar.gz --admin-password admin

Results/Descriptions: Fuel is restored from backup.

Step 23.9: Enable Ubuntu and unlock the password.

Time Line: 3 minutes

Action: On Opssimple VM, as m96722 user execute the following


cd ~/aic
sudo apt-get update
sudo apt-get install aic-opssimple-plugins-security
ansible-playbook -i inventory/ playbooks/revert_disable_ubuntu_login.yml

Results/Descriptions: Verify if the playbook is executed successfully

24. POST Activities (during maintenance window)

To comply with RIM, and audit findings, every MOP must include steps to remove backups and
artifacts created during the deployment of that MOP
The removal process may be at a later date to allow for potential back outs. Any artifacts for this
must be listed in Post Change activities section and include instructions on scheduling the future
removal.

Any backups and artifacts created during the MOP execution which will not be needed for backout,
should be removed in the POST implementation activities section executed before the end of the
change window.

Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.

NOTE: All terminal input/output must be logged during the change.

Check 24.1: Perform post upgrade verification

Time Line: 1 hours.

Action: Execute post-ugrade checklist - https://codecloud.web.att.com/projects/ST_CCP/repos/aic-


docs/browse/docs/dg/common/postupgradeverification.md?at=refs%2Fheads%2F3.0.3_RC35_stable

Results/Descriptions: All post-upgrade checks are successful.

Check 24.2: Check Nodes Status

Time Line: 5 minutes.


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor
fuel node shows "ready". No nodes in discover or error. Online field should be 1.

Results/Descriptions: Commands executed without errors and status is ready

Check 24.3: Check Fuel Environment

Time Line: 5 minutes.


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor
fuel env shows the environment "operational" (dummy_env can be ignored)

Results/Descriptions: Commands executed without errors and environment operational

Check 24.4: Perform Contrail verification


Time Line: 1 hours.

Action: Execute contrail checklist - https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP


%20Document%20Library/AIC-MOP-195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=rhhr6W

Results/Descriptions: All post-upgrade checks are successful.

Check 24.5: Check if the ruby package is upgraded

Time Line: 5 minutes.


Action: Validate the ruby package is upgraded to 2.0.0.484-1ubuntu2.13 .

On OpsSimple VM:
cd /home/m96722/aic
ansible all -i inventory/openstack -m shell -sa " dpkg -l | grep ruby2"
Results/Descriptions: All the openstack nodes shoul be upgraded to the version 2.0.0.484-
1ubuntu2.13

Check 24.6: Cleanup upgrade related artifacts

Time Line: 5 minutes.

Action: Remove upgrade related artifacts created on OpsSimple, Fuel and Contrail Controller VMs.
Connect to OpsSimple node with your personal attuid ssh {opsc IP} and become m96722 using sudo -
iu m96722 , then execute:
rm -rf /var/tmp/xml_dump/
rm -rf /var/tmp/aic_opsfix_*/
Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor , then
execute:
rm -rf /var/log/fuelbackuppre/nailgun.dump.gz
From OpsSimple host as m96722, Connect to one of the Contrail Controller VM and become root
using toor , then execute:
rm -rf /var/tmp/contrailbackuppre

Results/Descriptions: Commands executed without errors

Check 24.7: Request storage operations to backup of all vLCP LUNs (Large Sites Only)

Time Line: 1 hours.

Action: Submit request to backup of all vLCP LUNs in site.


For Production: AIC STORAGE OPERATIONS SUPPORT: http://ushportal.it.att.com/step2.cfm?
app=3962&home=ush

Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.

For Labs and SIL: http://labticketing.web.att.com/LabTicketing/(S(mmrztkbhhyshix45mbp0tmqk))/


AicTrouble.aspx?Lep=0

Check 24.8: Backup xml of vms on the kvms

Time Line: 30 minutes.

Action: On OpsSimpleVM execute -

~~ ssh @

sudo su - m96722 OR sudo -iu m96722

cd /home/m96722/aic/

ansible-playbook -i inventory/ playbooks/infraxml_dump.yml ~~

Results/Descriptions: Commands executed without errors

Check 24.9: Delete Scheduled Downtime to Enable Nagios alerts

NOTE: Repeat steps for both nagios01 and nagios02.

Time Line: 5 minutes.

Action: Connect to Opssimple node using your personal attuid, switch to m96722, an d then connect
to one of the Nagios hosts and become root
ssh {opsc IP}
sudo -iu m96722
ssh {nagios IP}
toor

Time Line: 1 hours.

Action: Enable Nagios dashboard back

nagios-del-downtime-host.sh [HOST exactly as in the cfg file for nagios]

Example -

/usr/local/bin/nagios-del-downtime-zone.sh pdk5

Note- Release Management Chat to be communicated via Q chat specified in Section 19 after
enabling Nagios alert
Results/Descriptions: Commands executed without errors

25. Post Maintenance Work

NOTE: All terminal input/output must be logged during the change.

Check 25.1: Remove update graphs.

Time Line: 5 minutes.


Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor and
execute:
fuel2 graph delete --env $envid --type release_candidate
fuel2 graph delete --env $envid --type start_puppet
fuel2 graph delete --env $envid --type stop_puppet
fuel2 graph delete --env $envid --type update_config_db

Results/Descriptions: Commands executed without errors

You might also like