IPL1A RC35 Upgrade Artifacts
IPL1A RC35 Upgrade Artifacts
IPL1A RC35 Upgrade Artifacts
Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.
Note -
initiate backup -
sudo python /usr/lib/python2.7/dist-packages/cfgm_common/db_json_exim.py --export-to backup.json
chmod -Rf 640 /var/tmp/contrailbackuppre/
Results/Descriptions:
Backup tar files per contrail node is created in the backup folder.
~~ ssh @
cd /home/m96722/aic/
Take the dump of xml files of all the vms and copy it to the jump_host
If any of the checkers report failure to contrail Ops. Do NOT proceed with MOP execution until data
inconsistencies are remediated.
Step 17.b.7 Pre-upgrade checks
https://codecloud.web.att.com/projects/ST_CCP/repos/aic-docs/browse/docs/dg/preupgrade.md
For large sites, recommend a minimum of 20GB free space for backup.
Results/Descriptions:
In the hosts tab, check that all the nodes except the non-openstack nodes are pointing at the
previous release_candidate.
The only nodes which can be pointing at production are the non-openstack nodes.
Results/Description:
Remove LCM nodes from Foreman:
In Hosts -> All hosts find and select all aic-influxdb,aic-elasticsearch and aic-alerting nodes
verify /etc/apt/sources.list contents is as per SIL/Production or IST sites from links below:
Results/Descriptions:
Make sure below plugins updated from the latest release repo, by checking apt-cache policy
command.
Note: aic-lcm packages may be at a higher version they can be updated separately by MOP-355 to
latest versions (aic-opssimple-plugins-aiclcm, aic-lcm)
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-355-
Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=QKCsYL
Step 17.b.13: Create the Fuel dummy_env prior to export/import Opssimple_site if it does not
already exist
If dummy_env was not present in the output of the above command then execute the
commands below to create the Fuel dummy_env.
fuel release
fuel env create --name dummy_env --release <Release id from above command>
NOTE: from RC24 - New certificate for Foreman UI should exist (foreman-aic/foreman-
aic.includesprivatekey.pem) with Common Name: puppet..cci.att.com, and Alternative names:
lcma01..cci.att.com, lcma02..cci.att.com, lcma03..cci.att.com
cd /home/m96722/aic/files/certificates
.....
DNS:zmtn11lcma01.mtn11.cci.att.com, DNS:zmtn11lcma02.mtn11.cci.att.com,
DNS:zmtn11lcma03.mtn11.cci.att.com, DNS:puppet.mtn11.cci.att.com
Results/Descriptions: for MTN11 - Common Name: puppet.mtn11.cci.att.com and Alternative
Name: zmtn11lcma01.mtn11.cci.att.com, DNS:zmtn11lcma02.mtn11.cci.att.com,
DNS:zmtn11lcma03.mtn11.cci.att.com, DNS:puppet.mtn11.cci.att.com **
Step 17.b.15: Verify Astra logging (Only required for Large sites with Astra.)
Results/Descriptions: Verify use_syslog is set to 1 for large sites which have astra node and 0 which
doesn't have astra installed.
Check /home/m96722/before_rc.yaml in case any errors occurred. If any errors are found then stop
deployment and escalate to the appropriate release management chat specified in section 19.
./resetdb.sh
python3 /var/www/aic_opssimple/backend/manage.py import /home/m96722/before_rc.yaml yaml
cd /var/www/aic_opssimple/backend/
./manage.py export -f /home/m96722/latest.yaml <siteid> yaml
where siteid - represents site short code, obtained from enviorment details wiki.
Merge site yaml prepared in step 6 with above file and verify following
To ensure the right list of repos: Replace the Repo section in the
opssimple_site.yaml by the Repo golden configuration Repo section
To ensure the right list of packages: The list of package for a release candidate
will be provided by the CI/CD team has a yaml file. The *: present in the at&t
plugin must be replaced by *: latest
Ensure that any new parameter has been added to the att-plugin,lcm-plugin
Ensure that node_env in the lcm_plugin section is pointing at 3_0_3_RC<35>_<stable|
prod>
If node_env is pointing at version 3.0.3_RC15_<stable|prod> and older ensure that
r10k_do_init is set to true in the lcm_plugin section.
If node_env is pointing at version 3.0.3_RC16_<stable|prod> and newer ensure that
r10k_do_init is set to false in the lcm_plugin sections: ~~ r10k_do_init: description:
If r10k should create puppet env after deployment label: Create puppet
environment (R10k) restrictions: - action: hide condition: settings:fuel-plugin-
lcm.metadata.enabled != true type: checkbox value: false weight: 135 ~~
Ensure for each compute ignore flag is set to false.
Ensure that all passwords have been decrypted. If you see encrypted passwords,
you can decrypt them by the aiclcm tool (aic-lcm repository):
~~
Please, check the output of the commands above. If you see any issues, please resolve them before
you proceed further. You can find an example of the output below (truncated):
cd /var/www/aic_opssimple/backend/
./manage.py import /home/m96722/latest.yaml yaml
Export the ops file which is imported (RC35) earlier using below command.
python3 /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/ops_redis.yaml <fuelenvname>
yaml
Export the ops file which is imported (RC35) earlier using below command.
python3 /var/www/aic_opssimple/backend/manage.py export -f /home/m96722/ops_cmha.yaml <fuelenvname>
yaml
Action: On opssimple VM, as m96722 execute following to backup, delete and regenerate the
foreman.ini file:
sudo apt-get install python-requests-oauthlib
cp ~/aic/foreman_api/foreman.ini ~/aic/foreman_api/foreman.ini.bak.$(date +"%Y-%m-%dT%H%M")
rm ~/aic/foreman_api/foreman.ini
ansible-playbook -i inventory/ playbooks/deploy_puppet_agent_hosts.yml --tags foremanini
1. Access the ATS Bulletin site at the following URL: ATS Bulletin
2. Click on the List/Search button
3. Click on the "Search" Menu
4. Determine a Key word associated with the activity that you are performing and type
it in to the Keyword: (Hint) field
5. Click on the Submit Query button
6. Review each search result bulletin by clicking on the bulletin number to determine
if any of the bulletins have warnings/actions associated with the activity that you
are implementing ATS Bulletin
7. Associated bulletins listed in the following table:
Bulletins Table
Layer 3 - AIC
General AIC
AIC Deployment Upgrades - Shashank Gupta sg944h 972-415-3973 AIC Platform Engineering - Hari
Om Singh hs571j 469-731-6556 Release Management Chat -
Large qto://meeting/q_rooms_cb12191468270764278/AIC+RC26.2%2C+RC29%2C+RC31+Large+De
ployments Release Management Chat -
Medium qto://meeting/q_rooms_cb12191534195381761/AIC+3.0.3+RC18.1+Medium+Deployments
+%28see+Meeting+Attibutes+for+current+zones+%26+CWs%29
Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
NOTE: All terminal input/output must be logged during the change.
For work covering the whole zone: (Please run below commands on Both Nagios servers
ngos01,ngos02 for Large Environments)
Action: Connect to Opssimple node using your personal attuid, switch to m96722, and then connect
to one of the Nagios hosts and become root
ssh {opsc IP}
sudo -iu m96722
ssh {nagios IP}
toor
/usr/local/bin/nagios-DE-sched-downtime-zone.sh [zone NAME] [HOURS] [Comment]
Example: Say I have an approved CR for PDK5 starting shortly expecting it to take 8 hrs: log onto
zpdk5ngos01.pdk5.cci.att.com
sudo /usr/localcw/bin/eksh v
/usr/local/bin/nagios-DE-sched-downtime-zone.sh pdk5 8 CR 4342837 // RM Notification would have CR
details
and on nagios GUI (downtime link, lower left under ‘System’ section)
Once the Schedule downtime is run on both Nagios servers, verify all servers in the zone have
expected downtime using the script zone-downtime-validation.sh and also login to Nagios UI and
validate downtime as per below screenshot.
Example of script zone-downtime-validation.sh usage for medium and large sites : This script is in
/usr/local/bin , should be run as root, and requires 2 inputs from command line – the first being the
‘zone’ (make sure it’s what Nagios recognizes - for example if you used ‘IPL1b’ it would fail to find
that since nagios only knows ipbin1b). the second value is the comment you expect to see from
setting the downtime.
root@dsvtxvcngos03.infra.aic.att.net:/home/sc3998# ./zone-downtime-validation.sh ipbin1b "IPBIN1B RC35 Upgrade"
NOT all the zone is in downtime. Examine output above for WARNING messages
NOTE: If all the zone is not in downtime , re run the nagios-DE-sched-downtime-zone.sh again
NOTE: hostname MUST match exactly as defined in nagios /etc/nagios3/conf.d/.cfg file for nagios3
version and /usr/local/nagios/etc/conf.d/host.* for nagios4
NOTE: Release Management Chat to be communicated via Q chat for duration Nagios alert would
be disabled
20.1.a.1 : The following additional validation needs to be done using the nagios GUI by the DE. Time
Line: 5 minutes Action:Log onto the nagios web GUI, select hosts, change the server count from ‘100’
in the dropdown to ‘all’. Verify downtime is added for all hosts and if downtime does not exists on
any nodes fix it by rerunning the script in Step 20.1.a to schedule downtime on those nodes. Validate
downtime as per below screenshot.
Action: During upgrade if compute reboot is required (as part of Change window) DE need to
ensure to create Tower Ticket using below steps for complete CR duration.
Confirm the I am the Primary Contact statement. Default is Yes.
Answer the Is there Associated Capital Labor? question. Default is No.
Add Attachment if needed (optional).
Submit.
Feel free to Q or email me (ln8367), to confirm I received your request, or for assistance in
completing the form.
Inputting a request in TOWER will result in an iTrack Issue being created. TOWER is simply the Front-
door issue entry tool.
Step 20.1: Fix ssh config for admin user on Fuel node
Results/Descriptions: It will switch User from fuel to m96722 in ssh config (needed for MOP
automation)
https://codecloud.web.att.com/projects/ST_CCP/repos/aic-docs/browse/docs/mops/
MOP_LCM_certs.md?until=cb27d032aae46bf4aaa9249a7d03605b1af96816&untilPath=docs
%2Fmops%2FMOP_LCM_certs.md
On Fuel VM, logged in using ATTUID, execute following to check status of puppet agents on all
nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~
Action: On OpsSimple VM, Login using UAM and switch as m96722 execute following to update
weak ciphers (security requirement).
cd aic
./setup.sh disable_ops_weakciphers
Step 20.5: Update Fuel repo
prod.repo stable.repo
Results/Descriptions:
NOTE: In the case you run into issues with below command, kill the process and rerun to remove any
duplicate designate packages and continue.
$ sudo yum --disablerepo=* --enablerepo=update_repo_RC35_1_dep --enablerepo=update_repo_RC35_1_rpm --
enablerepo=update_repo_RC35_1_plugins update -y --exclude fuel-octane
Results/Descriptions:
Results/Descriptions:
fuel env
export envid=<envid>
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/fuel/hiera-preparation.yaml
If all these are not present then please re-execute command above.
The exit code could be either 0 or 2 .
Results/Descriptions: Verify latest packages are installed The list of fuel packages is listed
here: fuelpluginlist.yaml
Check for Contrail Alarm Metadata by downloading settings file and see if it was updated. If
Contrail alarm metadta was not updated re-run the python command to update the metadata.
fuel env
export envid=<envid>
fuel settings --env ${envid} --download
fgrep alarm_list $(pwd)/settings_${envid}.yaml
Results/Descriptions: Verify alarm and other plugin metadata is in sync by examing settings_$
{envid}.yaml.
Results/Descriptions: Verify update repository and update plugin returns a result code of 200.
cd /home/m96722/aic
ansible-playbook -i inventory/ playbooks/prepare_authorized_keys.yml
Verify if a file named authorized_keys got created within /home/m96722/aic/files/env_/fuel-client/ It
will be a copy of the id_rsa.pub from /home/m96722/.ssh folder with from restriction of Fuel,
Opssimple & Seed nodes added.
cd ~/aic/files/env_<sitename>/fuel-client
python ./script.py restrict_os_user
The above command will push the authorized_keys of operator_user ( m96722 ) to Fuel.
The output command should not contain any "key is missing" lines
Step 20.12: Install opsfix0064 and copy diff_struct.py to fuel node.
Install opsfix0064:
~~
sudo su - m96722 OR sudo -iu m96722 sudo apt-get update sudo apt-get install aic-opsfix-cmc-
0064`
~~
21. Implementation
Step 21.1: Remove the The StackLight Collector Plugin. (Only required for Large sites, not needed for
medium sites.)
Percona upgrade is very disruptive procedure. To prevent mysql cluster failures take
additional steps:
Login to all 3 dbng (lcm) nodes, and update mysql configuration:
add "innodb_fast_shutdown = 0" to the '[mysqld]' section of the /etc/mysql/my.cnf
If you see any errors - stop the upgrade and create a ticket for the tiger team for investigation.
Update pacemaker timeout. Login to all 3 dbng (lcm) nodes, and update pacemaker configuration:
Open '/usr/lib/ocf/resource.d/fuel/mysql-wss' file and find the line in the mysql_stop() function
proc_stop "${OCF_RESKEY_pid}" "mysqld.*${OCF_RESKEY_datadir}" SIGTERM 5 $((
$shutdown_timeout/5 ))
Save changes. Please note that 1205 is 2 hours of trying to gracefully terminate the mysql process. Be
patient.
Check that there are no mysql and mysql-related processes on the node (on all 3 nodes):
ps aux | grep mysql
If not, execute:
ln -sf /root/.my.localhost.cnf /root/.my.cnf
If you not able to login - stop the upgrade and ask the tiger team for investigation.
- id: stop_mysql_cluster
type: shell
role: ['primary-lcm']
version: 2.1.0
required_for: [check-clusters-status]
parameters:
retries: 3
interval: 20
timeout: 300
- id: stop_rabbit_cluster
type: shell
role: ['primary-aic-dbng']
version: 2.1.0
requires: [stop_mysql_cluster]
parameters:
retries: 3
interval: 20
timeout: 300
- id: check-clusters-status
version: 2.1.0
type: shell
cross-depends:
- name: stop_mysql_cluster
parameters:
cmd: sleep 30; crm resource list | grep Stopped:| grep $(hostname)
retries: 12
interval: 30
timeout: 1800
strategy:
type: one_by_one
- id: update_config
type: shell
version: 2.1.0
requires: [check-clusters-status]
parameters:
cmd: |
sed -i '/myisam_recover/d' /etc/mysql/my.cnf
data-home-dir
retries: 3
interval: 20
timeout: 180
- id: update_packages
type: shell
version: 2.1.0
requires: [update_config]
parameters:
cmd: |
cluster-common-5.6 -y
retries: 3
interval: 20
timeout: 3600
If successful, stop MySQL on this node and execute the same for the next node:
service mysql stop
If successful, stop MySQL on this node and execute the same for the next node:
service mysql stop
If you see any errors - stop the upgrade and create a ticket for the tiger team for investigation.
If all 3 nodes have been successfully upgraded, start the Mysql cluster:
If you not able to login - stop the upgrade and ask the tiger team for investigation.
#Percona upgrade is very disruptive procedure. After this procedure we have to login to dbng and
lcm nodes and check the cluster status.
Login to any dbng and lcm node and login to the mysql. If mysql is not working on this node use
another dbng (lcm) node.
Results/Descriptions:
Step 21.2.1: Run glance db sync
Results/Descriptions:
Action:
Step 21.3: Designate DDNS service removal and cron update. - for (Large Sites only)
designate server-list
+--------------------------------------+-----------------------+
| id | name |
+--------------------------------------+-----------------------+
| 7abc8cc1-031f-4c66-bab2-8bb0bd3ae8f6 | ns1.zone.tci.att.com. |
+--------------------------------------+-----------------------+
Check the designate server existence only for Large sites : ansible-playbook -i inventory/
playbooks/openrc_automation/get_designate_server_list.yml
If above commands results in a null ouput, please create a sesignate server
~~ facter -p environment puppet agent --test --verbose --debug --trace --evaltrace --summarize --
detailed-exitcodes --noop cat /var/log/puppet/puppet.log ~~
Check the result in the console, in /var/log/puppet/puppet.log and Foreman, verify that this is not an
"empty catalog", check that foreman is green, check that the last report for that node in Foreman is
around 1 minute ago. If something wrong is detected (unwanted mechid changes, unwanted timer
change..), the DE neeeds to go back and adjust the opssimple_site.yaml.
This following command will force the puppet agent to run immediatly.
~~ service puppet start ; sleep 5 && kill -USR1 ps aux | grep 'puppet agent' | grep -v configuration | grep -v
grep | awk '{ print $2 }' cat /var/log/puppet/puppet.log ~~
Check /var/log/puppet/puppet.log
Results/Descriptions: Commands executed without errors and that nothing wrong detected in log
or Foreman
**Step 21.5:**Re-enable Puppet Agents.
Time Line: 20 minutes
The puppet agents are currently in a suspended state. The following graph will re-enable them. The
"actual" run of the puppet agents is spread over a 30 min window.
Action: Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor
Results/Descriptions: On Fuel VM, logged in using ATTUID, execute following to to verify puppet
agents are started on all nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~
Step 21.6: GSTools Update
Action:
Action:
NOTE: Please wait for 10 mins before executing the next steps
ansible all -i inventory/ -m shell -sa "/usr/adm/best1_default/bgs/bin/best1collect.exe -I noInstance -B
/usr/adm/best1_10.7.00"
cd aic
Step 21.8: Revert Astra config on openstack nodes (Only required for Large sites, not needed for
medium sites.)
Execute AIC-MOP-611
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?sourcedoc=%7B91A7FE22-
C104-45E9-BE3F-F2853478AF9E%7D&file=AIC-MOP-
611_MOP_FOR_OPSFIX_0211.docx&action=default&mobileredirect=true&cid=f7fc2067-2fc9-417d-
be18-c4bfe646467f
Results/Descriptions:
Execute AIC-MOP-601:
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?sourcedoc=%7B0BA3044B-
0FA8-4A4C-A54A-51FC62FA80F1%7D&file=AIC-MOP-
601_MOP_FOR_OPSFIX_0212.docx&action=default&mobileredirect=true
ansible-playbook -i inventory playbooks/deploy_nagios_server.yml
ansible-playbook -i inventory playbooks/deploy_nagios_agent.yml
To upgrade nagios to Nagios core 4.4.6 version(applicable only for RC 29 release), execute
ansible nagios_host -i inventory/ -m shell -s -a "sudo apt-get install --upgrade nagios4"
Results/Descriptions:
Removal of duplicate pacemaker log rotation config file - AICDEFECT-901 ansible -i inventory --limit
tmdg_hosts -s -m shell -a "rm /etc/logrotate.d/pacemaker"
Step 21.10: Restore Fuel config to normal state
d) Propagate changes through Fuel, Connect to Fuel node with your personal attuid ssh {fuel
IP} and become root using toor
fuel env
export envid={envid}
On OpsSimple VM, as m96722, please execute following to check *: latest has changed to *: present :
ssh <opscvm> cd aic/
Results/Descriptions: Once graph execution is complete, on Fuel VM, logged in using ATTUID,
execute following to check status of puppet agents on all nodes:
~~ for i in sudo /usr/localcw/bin/eksh -c "fuel nodes" | grep ready | awk '{print $5}' do echo -n "$i " ssh -qt -o
StrictHostKeyChecking=no $i 'sudo /usr/localcw/bin/eksh -c "sudo service puppet status"' done ~~
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
549_MOP_to_remove_python3_from_fuel.docx?
d=w910975145346489093628711b7c70f14&csf=1&web=1&e=N9Jwot
Step 21.12: Update MongoDB auth-scheme version to 5
Example
find astra-aic.mtn16b.cci.att.com.pem This example is provided for illustration only.
find -name <non_openstack_fqdn>.pem
find -name <non_openstack_fqdn>.pem -delete
On Opssimple VM, as m96722 user run the following playbooks to generate and deploy the new
puppet certificates for the Non Openstack hosts: ssh <opscvm>
cd aic
Step 21.16: Enable RBAC for Load Balancer As A Service (LBaaS) objects by migrating Contrail
internal object called service_appliance_set
Time Line: 20 minutes
IMPORTANT!
If the installed version is lower than 40343, perform the rollback procedure. Only after that upgrade
the opsfix and apply the deploy procedure.
Action: For update firewall rules on compute node on all sites execute the following MOP:
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
510_MOP_for_opsfix_0200.docx?
d=w527ee9a050494ea8a6121d82e7b00423&csf=1&web=1&e=v9m9nF
Step 21.18: Execute MOP-064 MVMD Security Scanning
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
064_FOR_OPSFIX_0110.docx?d=w286a593bcfd9406391fbf81980534bc1&csf=1&web=1&e=GbCMJB
Revert:
File exists
Step 21.19: Create database user account to support MVMD security scanning of AIC OpenStack
MySQL databases on lcm and fuel nodes
Action: Execute Step 21.4 Install DHCP from MOP-194 MaaS Update to update DHCP
packages https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-
MOP-194%20MaaS%20Update%201.docx?
d=wc3306ea8e0dd4b44a5fdea78e84b375e&csf=1&web=1&e=fKu9av
Step 21.sec.1: Disable libvirt daemon's listen mode on operational KVM
Example:
$ ansible-playbook -i inventory/hosts playbooks/disable_password_login.yml --limit zmtn12fuel01.zmtn12.datacenter
PLAY [Disable password authentication] ****************************************
ok: [zmtn12fuel01.zmtn12.datacenter]
changed: [zmtn12fuel01.zmtn12.datacenter]
changed: [zmtn12fuel01.zmtn12.datacenter]
skipping: [zmtn12fuel01.zmtn12.datacenter]
changed: [zmtn12fuel01.zmtn12.datacenter]
To block MAAS API for the outside-of-LCP world, use the following playbook
ansible-playbook -i inventory/ playbooks/iptables_maas.yml
If, for whatever reason, old iptables rules need to be restored, you can perform the following on the
maas node
sudo iptables-save | grep -v '--dport 5240' | sudo iptables-restore
Step 21.sec.6: Cleanup unwanted directories and files on opssimple, fuel and seed node
After this step, all the unwanted directories and files will be moved to trash on the respective nodes
1. opssimple => /home/m96722/trash
Step 21.sec.10: Execute opsfix 0132 for openstack sudoers file integrity
Step 21.sec.16: Remove Multiple ssh keys present on the non-openstack nodes
Step 21.sec.19: Remove HP upgrade manager, which will remove DISCAGNT daemon
cd ~/aic
Mem: 15 15 0 0 1 5
-/+ buffers/cache: 8 7
Swap: 7 0 7
Mem: 15 14 0 0 1 5
-/+ buffers/cache: 7 7
Swap: 7 0 7
Mem: 15 14 0 0 1 5
-/+ buffers/cache: 7 8
Swap: 7 0 7
Step 21.sec.21: Check if any SSH weak algorithms are supported in Non Openstack nodes
Login to the Opssimple VM as att uid ( UAM) Refer to Jump host for sites as per below wiki
link https://wiki.web.att.com/display/AICP/AIC+Production+Environments
cd /home/m96722/aic/
Check if the mac algorithms were updated in sshd_config
ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'MACs hmac-sha2-
512,hmac-sha2-256,hmac-ripemd160'" Check if ciphers are updated in sshd_config
ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'Ciphers aes256-
ctr,aes192-ctr,aes128-ctr'" Check if KexAlgorithms are updated in sshd_config
ansible all -i inventory/hosts -m shell -sa "cat /etc/ssh/sshd_config | grep -i 'KexAlgorithms ecdh-
sha2-nistp256,ecdh-sha2-nistp384,ecdh-sha2-nistp521,diffie-hellman-group-exchange-sha256'"
Note: Please ignore fuel as it is managed by puppet and will have different md5sum. If md5sum
doesn't match please execute the MOP
Please follow MOP-561:- (https://att.sharepoint.com/:w:/r/sites/NCOMOPS/_layouts/15/Doc.aspx?
sourcedoc=%7BB677C89A-DAC8-476D-B58A-48A62D680275%7D&file=AIC-MOP-561%20MOP
%20for%20OPSFIX_0206.docx&action=default&mobileredirect=true)
22. Test Plan
Inform CPVT team via email and the appropriate Release Mangement Chat to perform full PVT once
uplift is complete. PVT should clean up artifacts.
Action: Access control plane KVM where the VM is running and shutdown VM
ssh <attuid>@<kvm>
virsh list --all
virsh shutdown <vmhostname>
virsh undefine <vmhostname>
Step 23.2: Storage team to restore control plane LUNs from backup (Large Sites Only)
Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.
Step 23.3: Restore vms on the kvms from dump xml backup (Medium Sites Only)
~~ ssh @
cd /home/m96722/aic/
Copy the xml files and define the vms using xml files.
ansible-playbook -i inventory/ playbooks/restore_vms.yml
~~
Action: Access each control plane KVM and startup VMs. Proceed starting DBNG nodes, MOSC
controller nodes and then rest of the nodes.
ssh <attuid>@<kvm>
virsh list
virsh start <vmhostname>
Step 23.5: To rollback Nagios, rerun playbooks once step 23.1, 23.2, 23.3 are restored. These
playbooks will re-install previous RC packages.
Step 23.6: To rollback Astra, rerun playbooks once step 23.1, 23.2, 23.3 are restored. These
playbooks will re-install previous RC packages.
Action: On Opssimple VM, as m96722 user import the original site yaml created in Step 17.b.15.
ssh <opscvm>
cd /home/m96722/install
./resetdb.sh
python3 /var/www/aic_opssimple/backend/manage.py import /home/m96722/before_rc.yaml yaml
Follow this step only if fuel needs to be restored from backups taken during preupgrade checks. This
step is not necessary if the control plane is restored from backups.
To comply with RIM, and audit findings, every MOP must include steps to remove backups and
artifacts created during the deployment of that MOP
The removal process may be at a later date to allow for potential back outs. Any artifacts for this
must be listed in Post Change activities section and include instructions on scheduling the future
removal.
Any backups and artifacts created during the MOP execution which will not be needed for backout,
should be removed in the POST implementation activities section executed before the end of the
change window.
Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
On OpsSimple VM:
cd /home/m96722/aic
ansible all -i inventory/openstack -m shell -sa " dpkg -l | grep ruby2"
Results/Descriptions: All the openstack nodes shoul be upgraded to the version 2.0.0.484-
1ubuntu2.13
Action: Remove upgrade related artifacts created on OpsSimple, Fuel and Contrail Controller VMs.
Connect to OpsSimple node with your personal attuid ssh {opsc IP} and become m96722 using sudo -
iu m96722 , then execute:
rm -rf /var/tmp/xml_dump/
rm -rf /var/tmp/aic_opsfix_*/
Connect to Fuel node with your personal attuid ssh {fuel IP} and become root using toor , then
execute:
rm -rf /var/log/fuelbackuppre/nailgun.dump.gz
From OpsSimple host as m96722, Connect to one of the Contrail Controller VM and become root
using toor , then execute:
rm -rf /var/tmp/contrailbackuppre
Check 24.7: Request storage operations to backup of all vLCP LUNs (Large Sites Only)
Select AIC SNAPSHOT BACKUP & RESTORE and then fill out the information requested by the
form.
~~ ssh @
cd /home/m96722/aic/
Action: Connect to Opssimple node using your personal attuid, switch to m96722, an d then connect
to one of the Nagios hosts and become root
ssh {opsc IP}
sudo -iu m96722
ssh {nagios IP}
toor
Example -
/usr/local/bin/nagios-del-downtime-zone.sh pdk5
Note- Release Management Chat to be communicated via Q chat specified in Section 19 after
enabling Nagios alert
Results/Descriptions: Commands executed without errors