IPL1A RC35 Mechid Artifacts
IPL1A RC35 Mechid Artifacts
IPL1A RC35 Mechid Artifacts
Tenant Impact: No
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-355-
Update_aic_lcm_packages.docx?
d=w787973068da2491288f93d6b4a0ea17e&csf=1&web=1&e=4Z0SyR
$ ssh <opscvm>
$ ssh <opscvm>
$ ssh <opscvm>
$ rm $pck
Fuel status
It needs to check that environment status is operational, and no error nodes, no offline nodes.
fuel env
Check that status is operational
ssh <fuelvm>
fuel nodes | tail -n +3 | awk -F"|" '{ if ($9 !=1) print $1 "Offline" $3 $9}'
LCM status
This step verifies that the LCM infrastructure is in place and operational. If the LCM infrastructure is
not operational it will not be able to deploy the changes properly. The Puppet code required to
deploy the OpenStack packages needs to be downloaded and assigned to the nodes. This is done by
getting R10K which download the puppet code, installs it into Puppet Master. The "environment"
column in Foreman indicates what version of the Puppet code a particular node is running. The "last
report" in column indicates when the Puppet code was applied for the last time.
Verify that the nodes are green, if they are not in sync, review node reports to assess severity. Based
on severity raise itrack (you need to assign it on TigerTeam members)
- https://itrack.web.att.com/secure/Dashboard.jspa?selectPageId=19152
In the hosts tab, check that all the nodes except the non-openstack nodes are pointing at the
previous release_candidate. The only nodes which can be pointing at production are the non-
openstack nodes.
Results/Description:
UMASK and GSTools
GSTools installation is changing the UMASK on the LCM nodes and causing quite a lot interferences
with R10K. Notice: You need to forward your ssh-key when login to fuel to be able to login on lcm
nodes
Login to fuelvm as attuid:
ssh <fuelvm> -A
ssh <lcm_vm>
cd /etc/puppet/environments
pwd
Test new mechIDs by test_mechids.py . New set of MechIDs will be tested against
both LDAPs or either of them: ITServices and ATTTest. You have to use parameter
--medium or --large depends on the cloud type you are working on.
$ ssh <opscvm>
$ cd /home/m96722/rotatemechid/
For medium:
$ cloud_type=medium
For large:
$ cloud_type=large
$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --atttest -> <test
against ATTTest>
$ python3 ./test_mechids.py --file ./newmechids.yaml --site-file ./opssimple_site.yaml --$cloud_type --itservices -> <test
against ITSERVICES>
Note: Please check in the output above that all mechids are members of LDAP groups: AP-
AIC_Prod_Users, AP-AIC-Mobility, AP-365-DAY-PASSWORD-EXPIRATION . If output has missing group,
DE lead must reach out to T3 managers to added them on priority. T3 Contact
- https://wiki.web.att.com/pages/viewpage.action?pageId=429599467
If there is a error in output like ldap_sasl_bind(SIMPLE): Can't contact LDAP server (-1) for all mechid
tests, please check that CA certificates are installed: ls /etc/ssl/certs | egrep -i '(test-)?sbc'
Output should reflect certificate files, if not please proceed to Troubleshooting section to download
and install them, then test new mechIDs again.
AOTS http://ushportal.it.att.com/step3.cfm?home=ush&app=3962&is_sbc=&prob=52341
Please pay attention that if you want to revert back the changes of vLCP nodes for any reason it may
cause orphaned VMs if they were created after the point backup was created.
Check if there is sufficient disk space in /var/tmp for backup, if there is not enough space please
use /var/log/tmp/ instead of /var/tmp/ further in backup/revert steps.
df -h
For large sites, recommend a minimum of 20GB free space for backup.
Please make a copy of /var/tmp/fuelbackuppre to somewhere else (like opssimple, maas, etc.) just in
case if Fuel will be corrupted during the procedure and backups will be unavailable.
Results/Descriptions:
login to OpsSimple node and run:
ssh <opscvm>
fuel env
yaml
Procedure -1 ATS Bulletin Check Action Results/Descriptions Time Line Access the ATS Bulletin site at
the following URL: ATS Bulletin
Click on the List/Search button Click on the -Search Menu> Determine a Key word associated with
the activity that you are performing and type it in to the Keyword: (Hint. field Click on the Submit
Query button Review each search result bulletin by clicking on the bulletin number to determine if
any of the bulletins have warnings/actions associated with the activity that you are implementing List
all associated bulletins in the following table. If no associated bulletin is found place N/A in the
following table.
Layer 3 - AIC
General AIC
Contact (Team or Person) Contact Information AIC Fuel Team Vladimir Maliaev vm321d AIC Fuel
Team Alexey Odinokov ao241c CPVT Nageswara Rao Guddeti ng4707 AIC T2/T3 Via Q
chat qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team+Chat
Pre-check tasks are completed the night of the cutover at least one hour prior to cutover activities.
1. Login to Opssimple # For Opssimple IP , Refer this page for AIC 3.x site list. Switch
user to m96722
2. sudo su - m96722 or sudo -iu m96722
3.
4. ##References##
Verify DE have ssh access to required nodes from Jump host node prior to the
changes (not using ansible).
If the above steps doesn't work CPVT or DE to create AOTS ticket and post it in the group chat
below:
qto://meeting/q_rooms_tg79241332875301882/EMOC+-+AIC+Tier+2+Team +Chat
Create AOTS ticket specifying IP,FQDN of compute & VM info skip deployment on those compute
node(s)(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")
for example
/usr/local/bin/nagios-sched-downtime-zone.sh pdk5 0.5
Above example will put every node and all service checks for each node in scheduled downtime for
12 hours(0.5 days)
5.1. Check whether "Automatically create accounts in Foreman" is enabled via API request.
export FOREMAN_USER=admin
export FOREMAN_PASSWORD=foreman_password
export FOREMAN_URL=https://foreman_url
${FOREMAN_URL}/api/v2/auth_source_ldaps | \
If the response is True , go to section 5.2, otherwise, execute the following commands to enable it:
FOREMAN_LDAP_ID=$(curl -s -k -X GET -H "Content-Type:application/json" \
-u ${FOREMAN_USER}:${FOREMAN_PASSWORD} ${FOREMAN_URL}/api/v2/auth_source_ldaps | \
-d '{"onthefly_register":true}' ${FOREMAN_URL}/api/v2/auth_source_ldaps/${FOREMAN_LDAP_ID} | \
export FOREMAN_PASSWORD=foreman_password
export FOREMAN_URL=https://foreman_url
Save the code below as remove_users_foreman.py and execute the script python
remove_users_foreman.py to remove all users authorized by "LDAP-LDAP-server" in Foreman.
import os
import requests
import json
user = os.getenv('FOREMAN_USER')
password = os.environ.get('FOREMAN_PASSWORD')
url = os.environ.get('FOREMAN_URL')
URL = "{0}/api/v2/users/".format(url)
auth=HTTPBasicAuth(user, password)).content
for i in json.loads(response)['results']:
if i['auth_source_name'] == 'LDAP-server':
requests.delete(del_url, verify=False,
auth=HTTPBasicAuth(user, password))
5.3. Check in Foreman UI that no more users authorized by LDAP exist (from now and until the end
of the whole procedure below login in Foreman as admin ).
5.4. Add roles for new admin tenant mechid to openstack to support LDAP
Steps need to be executed on fuel node, a DE needs to source rc file for executing Openstack
commands. The DE can download their own openrc from Horizon, or it can be created by becoming
a root, making a copy of the openrc_v2 file, chown it to the DE-user, editing to substitute the mechid
to their ATTUID, and replacing the OS_PASSWORD line with:
# With Keystone you pass the keystone password.
export OS_PASSWORD=$OS_PASSWORD_INPUT
Run:
grep "ldap.Identity" /etc/keystone/keystone.conf
Execute the following if ldap.Identity is present in above output:
source <your_openrc_file>
6.1. Get amount of compute nodes on the env and remember it for further steps:
ssh <fuelvm>
6.2. Perform Contrail pre checks (20.2 vRouter connection count checks) before proceeding
- https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?
d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
For every CCNT node, the number of connections must be equal to output of 6.1. (see above). If not,
that means number of vRouters is not_equal to number of computes . In that case you should stop the MOP
and go to item 6.4
6.3. Check that every vRouter has at least two Established connections.
Visual check:
ssh m96722@<CCNT01_node_IP>
export mesh_ip=`ip a | awk '/inet.*br-mesh/ {print $2}'| egrep -o "([[:digit:]]{1,3}\.){3}[[:digit:]]{1,3}"`; echo $mesh_ip
for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do python
If you see any issues of any type (i.e. Failed to reach destination , Error or any others), stop the MOP!
and go to 6.4.
Error output:
If everything looks good, run next command on the same CCNT node to make sure, that there are at
least 2 Established connections and no Alerts:
Automatic check:
for each in $(python /var/tmp/ccnt_scripts/ist.py --host $mesh_ip ctr xmpp conn | awk '/att/ {print $2}');do echo -en "\nHost:
$each ";python /var/tmp/ccnt_scripts/ist.py --host $each vr xmpp | grep Established |wc -l | while read l ;do echo -n $l; if
Error output:
If you see any ALERT, run command for visual inspection of bad node(s) and check the issue. If you
see the issue stop the MOP and go to 6.4
6.4. Alert! If the above steps doesn't work, stop the MOP and create AOTS ticket and report the
issue(s). (See item 2 in Preliminary section for details of the ticket creation)
21. Implementation
Tenant Impact: No
fuel2 task history show <task_id> then login to the node with error task state and check puppet log on
it: /var/log/puppet.log
On Fuel VM as fuel admin, execute following to check status of puppet agents on all nodes.
for i in `sudo fuel nodes |grep ready | awk '{print $9}'`; do echo -n "$i: "; ssh -q -o StrictHostKeyChecking=no $i 'sudo
It gets current metadata from OpsSimple to yaml file, then changes mechid related items and
uploads new yaml back to OpsSimple and Fuel.
$ ssh <opscvm>
Here you can choose Option1 for automatic update OpsSimple yaml with new mechids or Option2 if
something goes wrong with Option1.
Option1: By ./update_mechids.py script:
$ python3 /home/m96722/rotatemechid/update_mechids.py --mechid-file
/home/m96722/rotatemechid/latest.yaml
Update att_user_role_mapping item with appropriate value(s) and check format (format is critical) in
next file:
$ vi /home/m96722/rotatemechid/latest.yaml
Option2: Manually:
$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml
$ vi /home/m96722/rotatemechid/latest.yaml
* Update `att_user_role_mapping` with appropriate value(s) and check format (format is **critical**)
* Non-openstack sections such as RO, LMA... must be updated according to new MechIDs
* For Solidfire credentials change, make changes to SOLIDFIRE section under cinder block
sf_san_login: <user>
sf_san_password: <password>
DE to get the changes reviewed by peer and make sure that new MechID(s) and
passwords passed the test (11.a.2 "Update automation meta-data" step) before
proceeding
Here you can make a visual comparison between current.yaml and latest.yaml to make sure that only
MechIDs/Passwords have been changed:
ssh <opscvm>
$ cd /home/m96722/rotatemechid
This step is actually gets OpsSimple to push the list of repos and news packages to Fuel.
ssh <opscvm>
cd /var/www/aic_opssimple/backend
old_password:gxrF9Qyvh7Qr1DIX1htXXXXX->new_password:DDQQzxndkNFYWPInYCRXXXXX
q0pTKtZTamuGP8DPDweXXXXX
old_password:aB8NCEYhpcDbWRvSjc7XXXXX->new_password:q0pTKtZTamuGP8DPDXXXXX
cd /root/rotatemechid/after
fuel env
The DE must check that the only changes are mechid and passwords. The DE MUST get the diff
reviewed by a peer. To compare line by line:
cd /root/rotatemechid/after
If something is wrong, the DE must go back, adjust the latest.yaml opssimple file and rerun the
procedure. ANY MISTAKE HERE WILL RESULT IN TENANT IMPACTS DURING PUPPET AGENT
RESTART. So the DE needs to take his time, and double check.
Propagate the meta-data change through the LCP, switch repositories and the code environment on
every nodes
During the check of the LCM infrastructure, we ensure that the latest Puppet Code
had been downloaded.
Fuel manifests switch the repositories on OpenStack nodes
Fuel updates hiera and configdb afterwards
fuel env
/var/www/nailgun/plugins/aic-fuel-plugin-3.0/utils/custom_graphs/standard_operation/update_config_db.yaml
At this point the Puppet Agents operations have been disabled, but no changes have been applied
to OpenStack yet.
Up to this point, there should haven't been any tenant impact since the Puppet agents have not run
yet. The DE must be aware, that OpenStack changes will be applied by the Puppet agents when they
reenables them.
Note: this step is mandatory only for versions older than 3.0.3 RC14, otherwise it can be skipped.
ssh <fuelvm>
fuel env
cd /root/mechidrotate/
nodes_ids=`fuel node | grep -i compute | awk '{print $1}' | tr "\n" " "`
ssh <fuelvm>
fuel env
cd /root/mechidrotate/
Execute Puppet on primary aic-identity node and wait for the report in Foreman. Please use the
following way for all steps from 9 to 14. This will create the roles in keystone if necessary
# Pick primary identity node using the procedure below then run the start_puppet_now graph
$ node_id=$(for i in `sudo fuel node | awk -F"|" ' /identity/ {print $5}'`;do p=$(ssh -q -o StrictHostKeyChecking=no $i 'sudo
hiera roles') ; echo -n $p" - " ;sudo fuel node | grep $i|awk -F"|" ' {print $1}';done | grep primary | rev|cut -d'-' -f1|rev)
$ echo $node_id
$ echo $envid
$ sudo fuel2 graph execute --env $envid --type start_puppet_now --node $node_id
Execute Puppet on second and third aic-identity nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i identity | awk '{print $1}' | tr "\n" " "`
Execute Puppet on all aic-controller nodes and wait for reports in Foreman.
$ node_ids=`fuel node | grep -i aic-controller | awk '{print $1}' | tr "\n" " "`
This step needs to be done manually in one node at a time followed by service checks and before
performing next contrail node 02 and so on
Be sure that uptimes of Contrail processes are in sync, before to procces with CCNT02. Incase
services are not started at same time, DE need to raise AOTS to T2.
Create AOTS ticket specifying IP, FQDN of Contrail server(s) and output of Contrail service status
(http://ushportal.it.att.com/index.cfm, put "AIC" in the search field, then select "AIC DATA
CENTER/NTC MOBILITY - REPORT A PROBLEM")
vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
Verification: Usually it takes 5-7 minutest to align with other services, but sometimes it maght take up to 25 minutes for
vRouters to re-establish a connection. Perform Contrail checks (section 21 - step1 thru step7) before you proceeding -
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
Verification: Perform Contrail checks (section 21 - step1 thru step8) before you proceeding -
https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-
195_Contrail_Control_Plane_Generic_checks.docx?d=w6469387ef976478ab2ea94cbdc1652dd&csf=1&web=1&e=LVaaY4
Loop through LCP roles (DO NOT include 'aic-compute' role in the list):
$ nodes_ids=''
$ roleslist=$(fuel node | tail -n+3| awk -F"|" '{print $7}' |sort|uniq|egrep -v "^[[:space:]]*$|compute|identity|aic-controller|
$ nodes_ids=`fuel node | egrep -i "$role" | awk '{print $1}' | tr "\n" " "`
Due to performance implication it's not possible to run Puppet on all compute right away. Keeping
that in mind it's better to run Puppet either rack-by-rack or in a batch of 10, 20, 30 (it depends on
number CPU on lcm nodes). The vital part of running Puppet is checking reports from Puppet in the
Foreman, restart Puppet on a node in case of failure and decreasing bunch size to decrease number
of failed nodes.
Note: if the procedure below won't work due to site naming conventions (i.e. no "rXXc" in the name),
then obtain all of the computes for an availability zone and execute the start_puppet_now graph in
batches. Then do the same for the other availability zone(s).
$ rack_id='r10c'
$ node_ids=`fuel node | grep -i compute| grep -i $rack_id | awk '{print $1}' | tr "\n" " "`
export OLD_MECHID_USER=<old_glance_mechid>
export OLD_MECHID_PASSWORD=<old_glance_password>
export NEW_MECHID_USER=<new_glance_mechid>
export NEW_MECHID_PASSWORD=<new_glance_password>
{OLD_MECHID_PASSWORD}', \
{OLD_MECHID_USER}:${OLD_MECHID_PASSWORD}%';"
Check that there no more old MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value like '%${OLD_MECHID_USER}%';"
Make sure that there are no very old and forsaken MechId entries left in Glance DB:
mysql glance -e "select value from image_locations where status = 'active' and value NOT like '%$
{NEW_MECHID_USER}%';"
NOTE: If any older mechids (older than the previous mechid itself) are found that are associated with
an image then open AOTS ticket for the Operation to correct the issues.
Login to opssimple node and run the playbook with specified tags to see mechIDs in mysqlDB to be
expired, then expire them and show expired mechids.
As a result you need to get a list of expired user(s). Save the list to remove users from mysqlDB
completely (see steps below).
$ ssh <opscvm>
$ cd aic/
TASK: [debug msg="User(s) {{ mysql_expired.stdout_lines }} are expired. You can delete them further"] ***
...
Previous step should take care of all outdated mechIDs to expire them including the previous
mechID. In case the previous step run unsuccessfully, and also, to make sure that previous mechID is
expired please execute the following:
$ ssh <fuelvm>
$ mysql
> alter user "<old_mechid>"@"<host>" password expire; #<--- Run this command for each pair (old_mechid,host) from
Please use AIC-MOP-197 Delete absolute DB users and Drop Databases to delete expired users that
you got in Expire outdated mechids in mysqlDB step above.
Restart cmha services
This step can be processed in parallel with the Test plan, because it is executed in DCP's CMC server.
If the DEs are not aware of the CMC update, then they should check with peers who usually
perform this activity.
You need to change credentials in CMC node (DCP) for site inventory (as in example below):
1. orm.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large
2. opsc.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large
Example: orm.mck1b.medium
Update new mechid settings for Fuel if LDAP enabled in keystone.conf on Fuel node
Execute:
grep "ldap.Identity" /etc/keystone/keystone.conf
driver= keystone.identity.backends.ldap.Identity
In case of enabled LDAP proceed and execute Steps: 20, 21.b and 22 of AIC-MOP-379 (MOP to
enable LDAP as a backend for keystone on Fuel node).
(https://att.sharepoint.com/:w:/r/sites/NCOMOPS/MOP%20Document%20Library/AIC-MOP-379%20-
%20MOP%20to%20enable%20LDAP%20as%20a%20backend%20in%20keystone%20on%20Fuel
%20node.docx?d=w6c76d99a236343e78299a39c9d8aceb8&csf=1&web=1&e=v0a1zd)
Make sure that "aic-opsfix-cmc-0175" has verson:86666 or above is installed during Step 20 of AIC-
MOP-379
If a current site's RC > RC22.xx - during execution of MOP-379, skip opsfix-0175 execution
On some sites Trove has been removed. If you get error after following ansible-playbook execution,
please delete lines 132-160 from
/home/m96722/aic/playbooks/aic_non_openstack_mechid_update.yml on OpsSimple node and run
again. (fixed in 3.0.3 RC15)
$ ssh <opscvm>
$ cd aic/
playbooks/aic_non_openstack_mechid_update.yml
$ ansible-playbook -i inventory -e 'mechid_file=/home/m96722/rotatemechid/newmechids.yaml'
playbooks/aic_nagios_mechid_update.yml
Proceed with approved AIC-MOP-245 to swap MechID for RO related components (Implementation
takes about 15 minutes).
Step 4: Add the user in keystone with admin role and ccp-monitoring tenant if not already added
"mechid=<mechid> tenant=ccp-monitoring"
Take note of the successful runs of the above command, except opssimple.
Step 7: Execute next playbook on OpsSimple node to create MechId user in DB for Nagios:
ansible-playbook -i inventory/ playbooks/deploy_nagios_agent.yml --limit dbng_host
Step 8: Restart nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server restart
Step 9: Verify nagios-nrpe-server service on OpsSimple
sudo service nagios-nrpe-server status
Wait until puppet finishes its work and check for new Foreman reports (green status). About 30 mins.
Foreman reports from vLCP nodes should have entries with username and password changes and
some services should be restarted. Please login to Foreman UI and check reports for all nodes.
Login to the Opssimple VM as att uid ( UAM) Refer to Jump host for sites as per below wiki
link https://wiki.web.att.com/display/CCPdev/AIC+OpsSimple+3.x+VMs And https://
wiki.web.att.com/display/CCPdev/Environments
ls /home/m96722/rotatemechid/latest.yaml
Please check that there are no ansible errors related to access thrown during execution. It is
supposed to be fixed in 20. Preliminary Implementation, item 2.
"msg": "Suspicious mechids: m19437 m19438\n !!! Please check verbose /home/m96722/output for details."
...
For detailed playbook output see: cat /home/m96722/output . You may want to find suspicious mechids,
conf files and nodes in the output file. If any, please check Foreman reports for those affected nodes
and ensure that puppet agent is running. If issue is still in place even after successful report check
latest.yaml and ensure that OpsSimple pushed all the changes to Fuel.
Please, run next playbook to find services that were not been restarted after Conf files had been
changed:
ansible-playbook -i inventory/ playbooks/validate_services.yml
Please check that there are no ansible errors thrown during execution
For results see: cat /home/m96722/output_service
Example of playbook output(you can find OK and FAILED examples in the output_service file):
m96722@zmtn12opsc01:~/aic$ cat /home/m96722/output_service
...
zmtn12fuel01.mtn12.cci.att.com /etc/keystone/keystone.conf
zmtn12rosv02.mtn12.cci.att.com /opt/installer/ro.conf
zmtn12mosc02.mtn12.cci.att.com /etc/neutron/neutron.conf
neutron-server ok
mtn12r01c015.mtn12.cci.att.com /etc/neutron/neutron.conf
mtn12r01c009.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
ceilometer-polling ok
zmtn12mosc01.mtn12.cci.att.com /etc/keystone/keystone.conf
mtn12r01c015.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
ceilometer-polling ok
mtn12r01c009.mtn12.cci.att.com /etc/nova/nova.conf
nova-compute ok
zmtn12mosc01.mtn12.cci.att.com /etc/ceilometer/ceilometer.conf
ceilometer-polling ok
ceilometer-api ok
ceilometer-alarm-evaluator ok
ceilometer-alarm-notifier ok
ceilometer-agent-notification ok
ceilometer-collector ok
ceilometer-agent-notification ok
mtn12r03s002.mtn12.cci.att.com /etc/swift/account-server.conf
swift-account-server ok
swift-account-auditor ok
swift-account-replicator ok
swift-account-reaper ok
mtn12r01c015.mtn12.cci.att.com /etc/nova/nova.conf
nova-compute ok
zmtn12mosc01.mtn12.cci.att.com /etc/glance/glance-registry.conf
glance-registry ok
zmtn12mosc01.mtn12.cci.att.com /etc/heat/heat.conf
heat-engine ok
zmtn12mosc02.mtn12.cci.att.com /etc/heat/heat.conf
heat-engine ok
...
If there are services with FAILED!!! Restart is needed message, please go to a node and restart it
manually (use sudo service <service_name> restart or sudo crm resource restart <crm_resource_name> or sudo
service apache restart if service runs under apache). The only exception is heat-
<services> (see: https://jira.web.labs.att.com/browse/DEFECT-6362 ). It is supposed to be restarted at
the end of the playbook. Please run the playbook again.
To verify the credentials in CMC node (DCP) for site inventory (as in example below), verify that the
date of the file and if older than the day of the change then it fails.
1. orm.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/orm.dpa2b.large
2. opsc.<SITE_NAME>.<LCP_TYPE>
example: /home/m96722/orm/inventory/host_vars/opsc.dpa2b.large
Example: orm.mck1b.medium
ssh <fuel_vm>
fuel env
Following sections detail rollback/back-out procedure that can be used during MechID rotation. This
procedure will eventually be incorporated in a separate backup and restore guide that can be
exercised on AIC sites as part of standard operational activity.
1. specific components in vLCP independent of remaining nodes in vLCP.These need
to be exercised only if the component has to be restored to the last known state /
state at which they were backed up.
2. Entire LCP to a pre-upgrade state.
Compute nodes backup & restore procedure shall be included separately in operations backup and
restore guide.
$ cp /home/m96722/rotatemechid/current.yaml /home/m96722/rotatemechid/latest.yaml
and repeat steps above starting from Change meta-data in OpsSimple, propagate to Fuel Option1
or Option2
If previous actions did not help to restore, please request storage team to restore vLCP LUN backups
to revert back to the previous version of components in vLCP.
Revert
ansible-playbook -i inventory/ playbooks/aic_opsfix_revert_0136.yml
To comply with RIM, and audit findings, every MOP must include steps to remove backups and
artifacts created during the deployment of that MOP
The removal process may be at a later date to allow for potential back outs. Any artifacts for this
must be listed in Post Change activities section and include instructions on scheduling the future
removal.
Any backups and artifacts created during the MOP execution which will not be needed for backout,
should be removed in the POST implementation activities section executed before the end of the
change window.
Permissions for Openstack configuration file backups should be set to 640, and other files 640 or
600. The rule of least privilege should apply, but we need to make sure the DE has access to the files
for restore / backout purposes. This should not be a group that non-SA and non-DE users would be
in.
Cleanup
Since newmechids.yaml and beforerotation.yaml and latest.yaml do have sensitive data inside, it's
important to delete them when the procedure is over.
$ ssh <opscvm>
$ rm -f /home/m96722/rotatemechid/beforemechidrotate.yaml
$ rm -f /home/m96722/rotatemechid/latest.yaml
$ rm -f /home/m96722/rotatemechid/newmechids.yaml
$ rm -f /home/m96722/aic-opsfix-cmc-0064.*.deb
$ rm -f /var/tmp/sources.list
$ rm -f /home/m96722/output
$ rm -f /home/m96722/output_service
$ ssh <fuelvm>
$ cd /root/rotatemechid/
$ cd /var/tmp/fuelbackuppre
$ cd /tmp/fuelbackuppre
--
CPVT to perform Sanity Checks
NA
Example:
Testing <service>
Status: Fail
There are two different sets of certs, one for each ldap domain: ITSERVICES and TESTITSERVICES.
To install them:
Download zip https://workspace.web.att.com/sites/WDS/Lists/IP%20Addresses
%20for%20DNS%20%20WINS%20%20LDAP%20%20AD%20Time%20Sources/
Attachments/6/TrustedRootCerts.zip , unzip, scp to OpsSimple node and place
all *.cer files from ITSERVICES or TESTITSERVICES folders to /usr/local/share/ca-
certificates/
Note: if you are granted enough permissions try directly: `sudo scp
<your_attuid>@199.37.162.36:/staging/ldap_cacerts/<domain>/* /usr/local/share/ca-certificates/
If you need to change parameters in config files you need to find them in FUEL UI -> Fuel plugin ->
additional_config and change values there BEFORE executing the custom graph update_config_db .
Example:
Notice: /Stage[main]/Attcinder::Controller::Aic_cinder_volume/Attcinder::Controller::Solidfire[SOLIDFIRE]/
After update_config_db is executed, hiera will get updated with new mechIDs and all changes made
in additional_config .
Deployment fails with "Could not evaluate: LDAP source LDAP-server delete error:
(422 Unprocessable Entity)" error
If the deployment fails with the HTTP 422 error:
/Stage[main]/Plugin_lcm::Tasks::Foreman/Foreman_ldap_auth[foreman_ldap_auth_source] (err): Could not evaluate: LDAP
that column.
3. Wait until puppet applied Day2 catalog once again.
Symptoms:
fuel graph execute doesn't return the console for 10-15 minutes
fuel2 task list shows last task in pending state
Solution:
1) task_id=<task_id of pending task>
1) $ fuel task --delete --force --task-id $task_id