AI Operations
AI Operations
AI Operations
AI Operations
May 5, 2020
Artificial Intelligence (AI), combined with and empowered by advanced analytics, big data and
virtualized computing power, will drive the automation and enhancement of CSPs’ network, IT
service and business operations.
AI capabilities will be gradually infused in IT, network, and business systems and services through the
implementation and deployment of AI models and AI components in all layers of CSPs systems
architecture.
Systems running in IT & network operations will be providing AI capabilities through AI models and
AI components embedded in their systems (BSS, OSS, Data Analytics, ERP, 3rd parties applications,
digital applications etc.) supporting all sorts of business and operational processes.
AI deployment in systems will bring tremendous opportunity to improve the business processes,
business services and the overall CSPs’ performance but will also create some challenges. In order to
face the challenges created by the large-scale deployments of AI models in CSPs' operations, service
management processes will have to be redesigned and adapted to manage the new AI-driven
operations and business scenarios and their underpinning AI-based systems.
AIOps definition
The term ‘AIOps’ is not new and has been used widely in the industry for many years, albeit in
different contexts with different nuances (Gartner, IDC ..).
1. AI components that are actively running delivering business and operational services. In AIOps,
we assume that key systems in Production are deeply infused with AI capabilities forming a
blend of AI and traditional software. In figure 1 we indicate these systems as AI-based BSS, AI-
based OSS, AI-based Data Analytics, AI-based ERP, 3rdparty AI platforms, other AI applications.
2. Service Management processes, frameworks and tools that have been properly reengineered
and adapted in order to support the operations management of AI components in Production.
We call it AIOps Service Management.
Page 2 of 8
Focus of AIOps
Transforming and reengineering CSPs operations to prepare them for AI is a broad, complex and
challenging. At TM Forum, we are identifying the gaps between traditional operations and AI
operations and creating a new operations process framework, allowing CSPs to re-engineer their
processes accordingly to be able to safely and securely implement and manage AI.
From an operations management perspective, according to our analysis and experience, we have
identified the following main differences between AI software and traditional software:
• The software lifecycle of traditional systems is mainly driven from left to right, i.e. from
Development to Operations (from Dev to Ops). In AIOps a new and key aspect to manage is that
the self-driven software updates in Production generate a new flow from right to left, i.e. from
Operations to Development (from Ops to Dev), which doesn't exist for traditional software.
Current continuous improvement practices are based on human feedback and interventions, not
on software-driven updates. On the other hand, the lifecycle of AI components is bidirectional
flowing also from Operations to Development, as AI models may change autonomously their
state and configuration in Production (online learning, self-driven updates) without human
interventions, requiring then a prompt and comprehensive retrospective evaluation (Figure 11).
Page 3 of 8
• All software evolves. Continuous Improvement Lean and Kaizen principles have been extensively
adopted in software engineering and service operations management. Indeed, the retrospective
approach is part of Agile and DevOps methodologies.
However,
Page 4 of 8
• Data has a key role and is one of the key components of the structure of AI models. It is the fuel
driving the evolution of AI systems. New Input datasets enable the evolution of AI models. New
data can bring new and different outcomes. For these reasons in AIOps, data operations become
even more critical and central (AIDataOps).
• ML training of AI algorithms and the re-training of AI models in Production are brand new
processes in software development and operations management, which do not exist in
traditional software lifecycle.
• AI models are nondeterministic by nature. All software in large and complex operations can be
considered at a certain degree nondeterministic because of the high number of involved
variables and unpredictable scenarios that they may face. However traditional software is or
should be deterministic by nature, i.e. given the same input it provides the same output. On the
other hand, AI models may behave differently in the same circumstances because their internal
state and internal logic may permanently change and evolve.
• AI software can be even more fragile than traditional software. As for any software, a small
difference between versions of code, between software configurations or between
environments baseline can create issues, defects or unexpected outcome. In addition to that,
for AI software a new byte in the input data can destabilize the AI model.
• AI models are exposed to the risk of bias. AI software can be biased with inappropriate,
incomplete, corrupted, incorrect or fraudulent input data. This risk adds up to all the other risks
and weaknesses existing for traditional software that are obviously applicable to AI software as
well (virus, malicious agents, sabotage, vulnerability etc.).
• AI models are black boxes. It is challenging to determine why AI models make a specific decision,
prediction, or classification. There are hidden dependencies inside the ML models, resulting
from the combination of the integration of input data, training parameters, configuration
settings etc. While code review of software and other audit techniques would usually clarify the
overall logic behind the behavior of traditional software, for AI software this would not be
enough. Additional and different approaches and techniques are needed to increase the
transparency and the “explainability” of AI software.
• We have learned from Continuous Delivery and DevOps practices that software should be
considered as in permanent working state or beta state. This principle is even truer for AI
models, which are pieces of software with the capability to learn spontaneously and
continuously when exposed to new data. By definition, AI models are in a permanent
evolutionary and working state (like human brains...).
• The intrinsic characteristics of the AI models listed above amplify further the management
responsibility of the Operations departments, making them even more central and accountable
for the service quality, service performance, for the proper and timely control and maintenance
of the continuously evolving and non-deterministic AI systems in Production.
• With the deployment of AI at scale, Production environments become dynamic by nature.
Deploying only AI offline modules, certainly creates new challenges but would contain the
complexity of the operations. However, if we use just AI offline models, we would give up in this
Page 5 of 8
way to the benefits brought by the AI online models. In order to leverage the full potential of AI,
we need to learn how to manage both offline and online AI models in Production, supervising
their continuous dynamic evolution, ensuring the full control and governance of our operations.
Because of all the differences and gaps between traditional software and AI software listed above,
we need to rethink and redesign the operations management processes to prepare them to manage
and govern AI software, and more in general to operate safely and effectively a blend of AI and
traditional software running together and simultaneously in CSPs operations. As there are
differences between traditional software and AI software, there are consequently gaps between
service management for traditional software and service management for AI. The very nature of AI
means that we need to operate our systems and processes differently.
Page 6 of 8
The transformation journey from “Traditional Service Management to “AIOps Service Management”
would address the business and operational needs to deploy and integrate into the existing CSPs’
operations a significant number of AI components with relevant business capabilities.
If your organization has no plan to deploy AI or plans to deploy just few isolated, siloed and/or not
relevant AI models that do not really impact key business processes, it is not necessary to start this
journey because the existing frameworks make already a good job to manage traditional software in
operations.
However, if you plan to deploy large amounts of AI across your business, we strongly recommend
you start immediately on the transformation journey towards the AIOps Service Management.
• Redesign the Deployment processes to release and commit the AI models to Production.
• Redesign the Production processes to operate AI software
• Redesign the Operations Governance processes to govern AI software
• Deal with fast flows of changes coming from Dev to Ops and from Ops to Dev for both offline
and online models
• Integrate effective AI data operations and ML training practices in the AI software operations
management.
Additionally, the definition of clear roles and responsibilities in new AI operating models, a proper
redesign of the concerned organizations and the selection and inclusion of the appropriate skills are
key success factors of this transformation journey, albeit they are not in the scope of this activity.
Even if it’s not a strictly necessary condition for AIOps, the setup of a loosely coupled, open, service-
based, well-structured, well-documented system architecture helps to support a more agile and
efficient operations service management in general. The TM Forum Whitepaper ‘AIOps Service
Management Deployment’ identifies the gaps between traditional operations and AIOps in the
deployment phase.
As well as identifying and redesigning the operational processes for AI, a TM Forum Catalyst (proof-
of-concept) team, ‘AI for IT & Network Operations (AIOps) – Phase III’ has recently completed its
third phase with participation from 12 companies. The Catalyst included seven leading CSPs,
collectively representing over 1.5 billion customers. The team has developed eight use cases
addressing the various business needs presented by the CSP champions (China Telecom, China
Mobile, China Unicom, KDDI Research, PCCW/Hong Kong Telecommunications (HKT), Smart
Communications and Telefonica Deutschland). These cut across customer experience, quality of
service, business performance and efficiency, and include:
Page 7 of 8
Page 8 of 8