Dynamic optimizers for complex industrial systems via direct data-driven synthesis

Alhazmi, Khalid; Sarathy, S. Mani

doi:10.1038/s44172-025-00368-8

Download PDF

Article
Open access
Published: 17 February 2025

Dynamic optimizers for complex industrial systems via direct data-driven synthesis

Communications Engineering volume 4, Article number: 25 (2025) Cite this article

428 Accesses
Metrics details

Subjects

Abstract

The chemical process industry (CPI) faces significant challenges in improving sustainability and efficiency while maintaining conservative principles for managing cost, complexity, and uncertainty. This work introduces a data-driven approach to dynamic real-time optimization (D-RTO) that addresses the aforementioned concerns by directly extracting process optimization policies from historical plant data. Our method constructs a value function to evaluate trajectory quality and employs weighted regression to derive improved policies. When applied to a plant-wide industrial process control problem, the proposed optimizer demonstrates superior performance in adapting to disturbances while maintaining stability and product quality. These results challenge conventional assumptions regarding the potential of data-driven optimization in the CPI. Although limitations exist due to the black-box nature of neural networks, this study presents a promising avenue for enhancing operational efficiency in industrial settings. The proposed approach offers a practical solution for process optimization, as it leverages readily available historical data and does not require extensive modeling efforts. By demonstrating significant efficiency improvement on a realistic industrial benchmark problem, this work paves the way for the adoption of data-driven optimization techniques in real-world CPI applications.

Solving real-world optimization tasks using physics-informed neural computing

Article Open access 08 January 2024

Advanced control parameter optimization in DC motors and liquid level systems

Article Open access 09 January 2025

Machine learning-assisted discovery of flow reactor designs

Article Open access 05 August 2024

Introduction

The chemical process industry (CPI) provides the world with essential products fundamental to modern civilization. Food, medicine, energy, and transport are made possible on a grand scale through CPI products. Concurrently, the industry is confronted with significant challenges pertaining to environmental sustainability and the availability of energy and mineral resources. CPI process improvement has immense potential to address these challenges^1,2. However, to adopt such improvements and realize their benefits, they must consider the human factors involved, particularly the inherent resistance to change within industrial settings³. This resistance and risk aversion is often rooted in concerns about the cost, complexity, and uncertainty associated with implementing new technologies or methodologies. Addressing these legitimate concerns requires demonstrating that advancements in process optimization can be achieved through methods that are not only theoretically sound and economically viable but also straightforward to integrate into existing operations. The objective of this work is, therefore, to take a step forward in developing a process control and optimization policy that embodies these principles.

Steady-state optimization methods, including model-based and model-free real-time optimization (RTO)^4,5, are limited in their effectiveness for processes with frequent changes due to the presence of long transient dynamics in most integrated plants^6,7. Repeatedly treating each change as a new optimization case is rarely practical, as steady-state RTO requires the process to reach steady-state, which may be significantly delayed⁸. This leads to infrequent re-optimization and suboptimal performance, with potential stability issues in the closed-loop system. Moreover, the effectiveness of steady-state operation itself has been questioned due to time-varying process economics⁹.

In contrast, dynamic real-time optimization (D-RTO) and economic model predictive control (EMPC) offer advanced solutions by integrating economic optimization with dynamic process control. D-RTO advances beyond steady-state RTO by utilizing dynamic models to continuously optimize process performance, thereby catering to time-varying conditions without waiting for a steady-state¹⁰. Meanwhile, EMPC combines the principles of optimization and control by incorporating stability to ensure that operations remain optimal and robust over time¹¹. Both methodologies provide a more responsive and adaptive approach to managing dynamic operational environments resulting in improved performance and economic outcomes.

Constructing and updating a detailed D-RTO or EMPC dynamic model poses significant challenges for managing economic outcomes and ensuring real-world constraints are met, which negatively impacts the prospect of their widespread adoption in industry^12,13. Direct or indirect data-driven dynamic optimization emerges as a promising solution to these challenges. The indirect approach involves formulating an uncertainty model based on data¹⁴, or performing sequential system identification for model adaptation^15,16, followed by optimization using methods such as EMPC or D-RTO. The indirect approach has seen considerable application; however, the direct approach has yet to gain widespread attention in process control research. The latter aims to derive optimizing policies directly from data, bypassing the conventional model identification step¹⁷. An example of the direct approach is online reinforcement learning (RL), which involves learning from interactions with the actual process or its simulations^18,19,20. However, applying RL to safety-critical systems is fraught with obstacles, including considerable safety risks during the online learning phase²¹.

All of the approaches above have limitations and merits. Here, we introduce an alternative approach to CPI real-time dynamic optimization motivated by advances in offline dynamic programming algorithms^22,23. We leverage the rich information embedded in operational historical plant data to construct a value function—a measure of the expected return of being in a particular state, or of taking a specific action from that state, under a given policy. The quality of trajectories in the plant data is evaluated according to this value function²⁴. Subsequently, we employ weighted regression to derive a policy that imitates the operational trajectories with the highest potential for expected returns.

The advantages of this data-driven approach are multifold: first, it sidesteps the need for constructing a detailed process model and online parameter and state estimation, which is especially beneficial for systems that are poorly conditioned or inherently unstable. Second, this method eliminates the need for dynamic simulations, which some data-driven methods rely on, thus minimizing the discrepancies that often arise when transferring simulated experiences to real-world operations. By training on actual plant data from regular operations rather than on a simulation, the policy is inherently tailored to the plant’s unique characteristics. Lastly, computational efficiency is markedly improved for online applications, as the optimizing policy is derived offline, and only policy execution is required in real time.

This learning-based optimization policy has been rigorously applied to the Tennessee Eastman (TE) challenge problem, a widely recognized benchmark simulation for evaluating process control strategies. The complexity and dynamics of the TE process mimic those of a real-world chemical production environment, providing a substantive testbed for our method. Through this application, we demonstrate the feasibility and utility of our approach for complex industrial processes. The training dataset utilized for learning is not synthetic, reflecting the size, complexity, and variability of actual plant operations.

This work makes three main contributions. First, it proposes deriving dynamic optimization policies from historical plant data based on a learning-based approach, where a value function learns to assign weights to trajectories based on their expected returns, and a policy learns to select setpoints that maximize the expected value function through weighted regression. Second, it demonstrates that readily available historical plant data can provide sufficient information for learning improved dynamic optimizers, challenging the assumption that specially designed experiments are necessary. Third, it applies this approach to the complex TE challenge problem, showcasing the scalability and effectiveness of the method in tackling the challenges of interconnected systems and external perturbations prevalent in industrial settings. Our approach significantly reduces production costs relative to the base case, with decreases of 27.48% in Mode 1, 53.04% in Mode 2, and 10.32% in Mode 3 of the TE challenge problem. Additionally, the online computational time of the learning-based dynamic optimizer is only 31 seconds. These contributions underscore the potential of learning-based optimization to drive tangible improvements in process performance.

Methods

Problem formulation

Consider a chemical plant where ${y}_{t}\in {{\mathbb{R}}}^{n}$ represents the outputs (measured variables) of the plant at time t, and ${u}_{t}\in {{\mathbb{R}}}^{m}$ denotes the control action applied to the plant. We assume the existence of a control policy $\mu :{{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{q}\to {{\mathbb{R}}}^{m}$, which determines the control action u_t based on the current output y_t and a set point v_t, i.e., u_t = μ(y_t, v_t). We also assume a layered control architecture where a higher-level optimization policy $\pi :{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{q}$ selects the set point v_t based on the current state y_t, i.e., v_t = π(y_t). Note that the decision variable for the control policy, μ, is the control input u_t, while the decision variable for the optimization policy, π, is the setpoint for the control system v_t.

The objective is to learn the optimization policy π that minimizes the expected cumulative cost, reflecting the performance criteria of the plant, as described in Eq. (1).

$${\pi }^{* }=\arg {\min }_{\pi }{\mathbb{E}}\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}c({y}_{t},{v}_{t})\right]$$

(1)

where $c:{{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{q}\to {\mathbb{R}}$ is the stage cost function that quantifies the performance of the system given the state y_t and the set point v_t, and γ ∈ (0, 1] is the discount factor that weights immediate cost more heavily than future cost. This optimization problem is subject to the system dynamics, the control policy μ, and the constraints inherent to the system states y_t and actions u_t.

We aim to solve this problem using a precollected dataset of system interactions ${{\mathcal{D}}}={\{({y}_{i},{v}_{i},{c}_{i},{y}_{i}^{{\prime} })\}}_{i = 1}^{N}$ where each tuple contains the current state y_i, set point v_i, observed cost c_i, and the subsequent output ${y}_{i}^{{\prime} }$. This dataset represents historical operation data under various policies or operational conditions. We note that the optimization policy maps outputs, not states, to actions, eliminating the need for state estimation as required in model-based methods.

Key components of the proposed framework

Figure 1 illustrates the D-RTO framework, integrating both offline and online components. The offline segment involves data collection and processing from historical operations to inform the development of a dynamic optimizer through a learning algorithm. This learned optimizer is then applied in real time to the controlled plant operations, adjusting setpoints to minimize the objective function based on real-time data.

**Fig. 1: Overview of the proposed data-driven optimization framework.**

Next, we outline the critical elements underpinning our direct data-driven dynamic optimization framework. This framework integrates a control structure for stability and setpoint tracking, a learning algorithm for adaptive optimization, and a data curation strategy to ensure the relevance and quality of input data. Algorithm 1 provides a summary of the deployment of the proposed data-driven dynamic optimization framework in industrial settings.

Algorithm 1

Design Procedure for the proposed framework

1: Control design: Select regulatory and supervisory control layers suitable for the process dynamics, interactions, and constraints.

2: Setpoint identification: Identify setpoints with the greatest impact on operational costs and product quality using historical data and process expertise. These will become the focus of offline learning.

3: Safety override integration: Define strict safety limits for critical process variables. Implement overrides within the control layer to supersede RL-suggested setpoints that could violate these limits.

4: Offline policy learning:

• Data curation: Carefully prepare a dataset according to “Training Data and Informativity" guidelines.

• Value function learning: Choose a value function representation aligned with process complexity and data characteristics.

• Policy extraction: Apply a method such as advantage-weighted regression (AWR) to extract a policy from the learned value function, focusing on actions with high expected long-term rewards²⁵.

5: Policy deployment: Deploy the learned policy online to the system.

Control structure design

As presented in Fig. 1, the control structure consists of three layers: regulatory control, supervisory control, and RTO. Regulatory control maintains process variables near setpoints. For complex systems, supervisory control techniques like model predictive control (MPC) or well-designed decentralized control are necessary. MPC uses a dynamic model to optimize input trajectories considering constraints and interactions. Decentralized control partitions the system into sub-units with individual controllers, focusing on key objectives and process characteristics.

The proposed dynamic RTO approach operates atop the control structure, relying on it for accurate execution of optimal setpoints. The supervisory and regulatory layers ensure disturbance rejection and operational stability. The RTO layer optimizes a metric defining operating profit or cost, and sends the computed setpoints to the supervisory control layer. Safety overrides should be integrated to prevent the data-driven optimizer from violating safety limits, acting as a failsafe when process conditions approach unsafe boundaries.

Learning algorithm selection

The learning algorithm addresses the optimization problem described in Eq. (1). In this work, our selected approach to solving this problem involves training two separate neural networks using the historical operational data: one for the value function and another for the policy.

The first neural network is trained to approximate a value function, which evaluates the long-term optimality of trajectories in the dataset ${{\mathcal{D}}}$. The value function can take different forms, such as a state-value function (evaluating the expected return of being in a certain state under a specific policy), a Q-function, or an action-value function (assessing the worth of taking a particular action in a given state), or an advantage function (calculating the relative advantage of an action compared to the average action in a state). The value function network takes the process measurements (observations) and setpoints (actions) as inputs and learns to predict the expected return (i.e., the sum of discounted future rewards) for each state-action pair. By focusing on minimizing the value function instead of solely the immediate cost c, this approach effectively captures the long-term impact of actions in processes with long transient dynamics due to time delays and recycle loops.

The second neural network represents the optimization policy (i.e., the dynamic optimizer). Once the value function network is trained, it is used to assign weights to the trajectories in the dataset ${{\mathcal{D}}}$ based on their expected returns. Here, a trajectory refers to a series of observations (measured process variables), setpoints (actions), and the resulting production costs over a certain period of time. The policy network takes the process measurements (observations) as inputs and learns to predict the setpoints (actions) that lead to the highest expected returns, as determined by the learned value function. This policy extraction process can be viewed as a weighted regression problem, where the aim is to find a policy π that maximizes the expected value function. Figure 2 provides a conceptual illustration of this process.

**Fig. 2: Conceptual illustration of the policy extraction process in the proposed framework.**

Training data curation and informativity

The topic of what can be learned from data and measuring data informativity is a promising area of research that will help us to systematically select operational data for the training of dynamic optimizers^26,27,28. However, we provide general guidelines for data inclusion that we have empirically found to lead to better optimizers. Rather than treating process data as fixed, careful curation is useful for effective data-driven optimization.

In the data curation process, we carefully select operational data representing desirable operating conditions. This focus on high-quality data aligns with literature emphasizing its importance²⁹. We also incorporate periods of ramping up and down, transients, and data from previous experimental campaigns into the dataset. Although not explicitly modeled, manipulated variable limits are implicitly considered, as the policy extraction step limits controls to those observed within the dataset. For example, if the desired maximum reactor pressure is 2900 kPa, we exclude data where the reactor pressure exceeds this limit. By curating a dataset that spans the full operational range, the optimization framework gains exposure to the feasible input space.

Our approach diverges significantly from the RL-RTO method outlined in ref. ⁵, which involves learning a reward function from simulation-generated data. The method in ref. ⁵ reduces to deriving a greedy policy, which maximizes the immediate reward rather than the expected return, diverging from the essence of RL approaches³⁰. Then, an evolutionary optimization algorithm is applied to adjust policy weights, aiming to maximize the reward. However, generating synthetic training data through random perturbations of decision variables and disturbances is impractical. Moreover, the described method results in a steady-state RTO that is solved repeatedly when environmental conditions change.

Results

Application to the TE process

The TE Problem, introduced in 1990 by Downs and Vogel³¹, is based on an actual industrial process and a widely acknowledged benchmark for evaluating the efficacy of process control techniques. The system is designed around five major units: a reactor, a product condenser, a vapor-liquid separator, a recycle compressor, and a product stripper, as illustrated in Fig. 3. Products, along with unreacted components, exit the reactor in vapor form, while the nonvolatile catalyst remains in the reactor. Following this, the reactor’s product stream is directed through a cooler to condense the products, and then it proceeds to a vapor-liquid separator. The noncondenables are recycled back to the reactor via a centrifugal compressor. The condensed components are then transferred to a product stripping column, where the remaining reactants are stripped using the A/B/C feed stream. Products G and H leave from the base of the stripper. The inert and the byproduct are primarily purged from the system as vapor from the vapor-liquid separator.

**Fig. 3: Process flow diagram (PFD) of the TE challenge problem, a benchmark simulation for evaluating process control strategies³¹.**

The process includes 41 measurements and 12 manipulated variables. The TEP mirrors the complexity and challenges of a real chemical plant due to its nonlinear dynamics, multivariable interactions, and dynamic behavior. It incorporates realistic disturbances, noise, and a range of operational modes, including various fault scenarios akin to those encountered in industrial settings. Significantly, the TEP is an integrated process with a material recycle stream. The presence of recycle streams in process systems significantly increases their complexity, which is evidenced by increased sensitivity to disturbances, extended time constants, and potential instability issues. The temporal multi-scale behavior from diverse physical and chemical phenomena leads to stiff differential equations that further complicate dynamic modeling and optimization. The disparity between these complex systems and simpler non-integrated models challenges the transferability of optimization schemes to real-world industrial applications³².

The process m two products from four reactants. Also present are an inert and a byproduct making a total of eight components: A, B, C, D, E, F, G, and H. The reactions are:

$$\begin{array}{ll}\,{{\mbox{A}}}(g)+{{\mbox{C}}}(g)+{{\rm{D}}}(g)\to {{\mbox{G}}}\,(liq),&\,{\mbox{Product \, 1,}}\,\\ \,{{\mbox{A}}}(g)+{{\mbox{C}}}(g)+{{\mbox{E}}}(g)\to {{\mbox{H}}}\,(liq),&\,{\mbox{Product \, 2,}}\,\\ \,{{\mbox{A}}}(g)+{{\mbox{E}}}(g)\to {{\mbox{F}}}\,(liq),&\,{\mbox{Byproduct,}}\,\\ 3\,{{\mbox{D}}}(g)\to 2{{\mbox{F}}}\,(liq),&\,{\mbox{Byproduct.}}\,\end{array}$$

In this study, we examine three different operating modes of the TEP. Each mode is characterized by a target G/H product mass ratio and a target production rate, as detailed in Table 1.

Table 1 Operational modes for the TE process, characterized by the target product mass ratio (G/H) and production rate

Full size table

Implementation of the proposed framework

The TEP’s control objectives are focused on maintaining process variables at their desired levels, ensuring operations stay within equipment constraints, minimizing the variability of product rate and quality in response to disturbances, reducing the movement of valves that could affect other processes, and achieving a quick recovery from any disturbances, including shifts in production rates or changes in the product composition³¹.

In this work, we adopt a decentralized control approach for the TEP, which has been shown to effectively satisfy the specifications of the challenge problem even in the presence of significant disturbances in feed composition and reaction kinetics. The decentralized control strategy proposed by Ricker³³ demonstrates superior performance in terms of reducing variability in product rate and quality compared to previous investigations. The implemented control structure incorporates two critical overrides: a reactor pressure override that reduces production when necessary, and a reactor level override that reduces the recycle flow.

The optimization decision variables serve as setpoints for the control layer and are carefully selected based on their substantial impact on process economics. These variables presented in Table 2 are chosen to ensure that their theoretically optimal values Ricker³⁴ fall within the specified lower and upper bounds across all three operational modes of the TEP. The decision variables are discretized with a step size of 5 × 10⁻⁴. The variables y_A and y_AC defined within the decentralized control strategy are specifically designed to regulate the proportions of component A and the sum of components A and C in the reactor feed, respectively³³. This targeted control of feed composition plays an important role in optimizing the process performance and product quality.

Table 2 Decision variables for the optimization policy in the proposed framework, along with their lower and upper bounds

Full size table

The main objective and contribution of this work is to directly learn a dynamic optimization policy from historical plant data by using a value function to assign weights to trajectories in historical plant data. There are various types of value functions and ways to derive policies from them. Value functions include standard Q-functions, Conservative Q-functions, and advantage functions. Once a value function has been learned, policies can be derived using methods such as greedy maximization, constrained policy optimization, value-weighted regression, or AWR²².

The approach selected here is proposed by Kostrikov et al.³⁵, which involves learning an advantage function as the value function and employing AWR for policy extraction. The advantage function treats the state value function as a variable influenced by action choices and computes an upper expectile of this variable to estimate the value of optimal actions per state. This process sidesteps the need to directly evaluate the value function with unseen actions. The policy is then derived through AWR, which also avoids out-of-sample actions²⁵. While other approaches exist, this particular combination of value function and policy extraction method has demonstrated state-of-the-art performance on various benchmarks²³.

The neural network architecture of the policy is a sequential model that consists of an initial linear transformation of the input observation to a 400-dimensional space, an LSTM layer for sequential processing that outputs a 300-dimensional representation, and a final linear layer that maps this to a specified feature size. ReLU activations are applied after both linear transformations. The architecture is designed to process sequential data, taking into account the temporal dependencies within the input features. The selected dimensions of the neural network were found to yield satisfactory results, and we observed that the performance of the learned optimization policy is not strongly dependent on the exact number of layers and units. Additional details on the learning algorithm and hyperparameters can be found in Supplementary Note 2.

The dataset for training the dynamic optimizer was generated from a revised simulation of the TE process, originally introduced by Downs and Vogel³¹. The revision, made by Bathelt et al.³⁶, addresses key aspects of usability and reproducibility. The training data was collected from an unoptimized plant operating under the control policy.

Considering that the TE process operates at different product compositions, it is reasonable to assume that historical data would include transitions between various operating modes. Therefore, the training data was collected from a scenario where the production setpoint was increased from 17 m³/h to 23 m³/h, while simultaneously decreasing the mole percentage of product G from 90% to 10%. This transition was followed by a rapid ramp-down in the production rate to 16.1 m³/h. All process measurements in the training data include Gaussian noise with standard deviations typical of each measurement type. Additionally, the training data incorporate a slow drift in reaction kinetics.

The total duration of the collected data is 4000 h, with a sampling time of 0.01 h. Figure 4 illustrates the production rate and composition ramps, along with the associated production cost. To mimic realistic operational conditions, all process measurements include Gaussian noise, with standard deviations typical of each measurement type, as specified by Downs and Vogel³¹. Furthermore, the training data incorporates periods of random variation in the feed compositions of components A, B, and C, which is a common occurrence in large-scale chemical processes and is suggested by the authors of the TE problem.

**Fig. 4: Training data for the dynamic optimizer, generated from TE process simulations.**

The final training dataset consists of 49 columns: 41 for measured variables, 7 for decision variables (optimization setpoints), and one for the stage cost. The stage cost, which captures the operational objectives and is used to label the training data, is given in Eq. (2), with the exact mathematical expression provided in Supplementary Note 1.

$${{\rm{Production}}} \, {{\rm{cost}}}\, = \,({{\rm{purge}}} \, {{\rm{costs}}})\times ({{\rm{purge}}} \, {{\rm{rate}}})\\ +\,({{\rm{product}}} \, {{\rm{stream}}} \, {{\rm{costs}}})\times ({{\rm{product}}} \, {{\rm{rate}}})\\ +\,({{\rm{compressor}}} \, {{\rm{costs}}})\times ({{\rm{compressor}}} \, {{\rm{work}}})\\ +\,({{\rm{steam}}} \, {{\rm{costs}}})\times ({{\rm{steam}}} \, {{\rm{rate}}})$$

(2)

Performance comparison with existing approaches

The performance of the learning-based dynamic optimizer is evaluated based on the average hourly production cost defined in Eq. (2). We benchmark our approach against several existing methods, including the base case presented by ref. ³¹, the theoretical optimum calculated by ref. ³⁴ (which assumes full state observability and noise-free measurements), a RTO strategy reported by ref. ³⁷ (an improvement over the RTO in ref. ³⁸), and the performance of an experienced operator, also reported in ref. ³⁷. The results are discussed further in the Discussion.

To account for the variability in training outcomes due to the random initialization of the neural network weights representing the optimization policy, we report the average hourly cost and standard deviation across three policies, each trained with a different random seed. This approach provides insight into the stability and robustness of the learning process and the resulting policies.

As demonstrated in Fig. 5, the learning-based dynamic optimizer significantly outperforms the base case, achieving cost reductions of 27.48% in Mode 1, 53.04% in Mode 2, and 10.32% in Mode 3. Notably, our approach not only matches but also surpasses the performance of the model-based RTO implemented by ref. ³⁷ in certain cases. In particular, the learning-based optimizer exhibits superior performance in Mode 2, bringing it closer to the theoretical optimum. Moreover, the variability in performance across policies trained with different random seeds is minimal, indicating that the improvements are consistently maintained regardless of the initialization.

**Fig. 5: Comparison of average hourly production costs for the TE process under different strategies.**

An interesting observation is the apparent direct proportionality between the hourly cost and the magnitude of improvement. Higher costs are associated with more significant improvements, which can be attributed to the impact of the cost on the gradient magnitude within the learning algorithm’s optimization process. The learning algorithm refines the policy by adjusting its parameters to minimize the expected cost. The magnitude of the gradient indicates the degree of adjustment needed in the policy parameters to enhance the policy or value estimation. Since the dataset encompasses all modes of operation and these are learned simultaneously, a higher cost leads to a larger gradient calculation, resulting in more substantial policy improvements.

Table 3 provides a detailed breakdown of the total hourly production costs, while Table 4 compares the setpoints selected by different strategies. The measured production rates under all compared methods are similar, with values of 22.95 m³/h in the Base case, 22.89 m³/h in the Optimal case, 23.0 m³/h in Duvall’s RTO, and 22.91 m³/h in our proposed approach. The learning-based optimizer achieves significant reductions in product losses, primarily by operating at higher reactor pressures and lower liquid levels. The lower liquid levels are economically advantageous due to decreased gas residence time, which has a similar effect to increased pressure. This strategy enables sustained production with either a higher inert concentration (lowering the purge rate) or reduced reactor temperatures (enhancing selectivity)³⁴. The decrease in purge losses is achieved by operating at elevated temperatures to reduce the purge rate, although excessively high temperatures could adversely affect the reactor’s selectivity for products G and H. Unlike in RTO, the steam cost in the learning-based optimizer does not diminish significantly since it does not close the steam valve. Finally, the compressor cost is reduced due to decreasing the recycle rate, but this is offset by the increase in reactor pressure.

Table 3 Breakdown of average hourly production costs for the TE process in Mode 1, comparing the base case, model-based RTO by Duvall and Riggs³⁷, the learning-based optimizer (this work), and the performance of an experienced operator

Full size table

Table 4 Comparison of setpoints selected by different optimization strategies across various operational modes of the TE process

Full size table

Performance under setpoint changes

Here we examine the effect of the dynamic optimizer presented in this work on the control of the process. Figure 6 illustrates two important results. The first is that the learning-based controller does not destabilize the process during significant setpoint changes in the production rate and production composition. That is, the process under the dynamic optimizer can achieve specified product composition and production rate setpoints within the allowable 5% variation. The second result is that the learning-based controller can dynamically define setpoints that lead to a significant reduction in the operating costs as can be seen in the cumulative cost figure.

**Fig. 6: Setpoint adjustments by the learning-based optimizer during significant changes in TE process conditions.**

Figure 6 also presents the normalized change in decision variables, with the direction of changes (i.e., increase or decrease) in setpoints being consistent with the findings in the optimality study by Ricker³⁴. The optimizer keeps the pressure near maximum, maintains the reactor level near the lower bound, increases the recycling rate (by closing the compressor recycle valve) except at high %G, reduces the steam valve, increases the reactor temperature, and reduces y_AC.

The positive change in the reactor pressure aligns with the fact that a higher reactor pressure is more optimal due to its effect on increasing the inert partial pressure without decreasing the reactant's partial pressure, which reduces the purge rate and decreases the purge cost. Similarly, reducing the reactor level has an effect comparable to increasing the reactor pressure. The reduction in the compressor recycle valve and steam valve leads to a decrease in compression cost and steam cost, respectively. The optimizer decreases the temperature to cut the purge rate and reduces y_AC because an excess of A + C requires a decrease in the inert concentration, which would otherwise increase purge losses.

Evidence of learning

To evaluate the learning-based optimizer’s ability to select more optimal setpoints for states not explicitly encountered in the training dataset, we analyze the reactor pressure setpoints chosen by the optimizer under specific operating conditions. Figure 7 illustrates the pressure training data, revealing that for product compositions between 90% and 80% G, the highest recorded pressure in the training set is 2750 kPa. However, during testing in Mode 3, which corresponds to a 90/10 G/H product composition, the learning-based optimizer selects a reactor pressure setpoint as high as 2879 kPa, as shown in the bottom plot of Fig. 7.

**Fig. 7: Evidence of the learning-based optimizer’s ability to generalize and select optimal setpoints beyond the training data.**

This behavior demonstrates that the optimizer has learned to associate higher pressures with more optimal performance, prompting it to select a higher pressure for Mode 3, even in the absence of such instances in the training dataset. This finding confirms the optimizer’s ability to generalize from the learned data and adaptively choose setpoints that improve operational efficiency.

Effect of disturbances on optimization

In the subsequent analysis, our objective is to assess the impact of the learning-based optimizer on process stability. This evaluation focuses not only on its performance during setpoint changes but also on its response to a variety of disturbances typical of large-scale industrial operations. Furthermore, disturbances can alter the optimal conditions under which the process operates. Therefore, this analysis seeks to explore the capacity of the learning-based controller to dynamically adjust the setpoints in response to disturbances. The code provided by Downs and Vogel³¹ enables the activation or deactivation of specified disturbances.

Variations in feed composition are a significant disturbance in the operation of a chemical plant because they directly influence product selectivity and separation efficiency. Figure 8 illustrates the performance of the control system under random variations in the A, B, and C feed compositions, with and without the learning-based optimizer. It can be observed that the product composition and production rate with the learning-based optimizer are comparable to those in the base case without the optimizer. Furthermore, as demonstrated by the cumulative cost figure, the optimizer is capable of defining setpoints that reduce operating costs, even in the presence of feed composition disturbances. The observed fluctuations in the production rate can be attributed to the random variations in the feed composition of A, B, and C. These variations directly impact the production rate, as it is controlled by manipulating the feed rates.

**Fig. 8: Impact of the optimizer on process performance under feed composition disturbances (IDV(8)³¹).**

Drift in reaction kinetics can occur in real processes due to catalyst degradation and reactor fouling. Figure 9 demonstrates that the learning-based optimizer does not exacerbate variations in product composition and production rate as a result of the shift in kinetics, thereby maintaining process stability. Furthermore, Fig. 9 illustrates that the optimizer is capable of selecting setpoints that minimize cumulative costs.

**Fig. 9: Robustness of the optimizer under reaction kinetics drift (IDV(13)³¹).**

Another prevalent disturbance in industrial processes includes temperature variations in utility streams, such as those found in cooling water, and valve malfunctions. Figure 10 illustrates the impact of random fluctuations in the condenser cooling water inlet temperature on product composition and production rate, compounded by issues with sticking in the condenser cooling water valve. Despite these challenges, the learning-based optimizer successfully maintains process stability while selecting setpoints that result in a reduction of cumulative costs.

**Fig. 10: Handling of concurrent disturbances (IDV(12,15)³¹) by the optimizer.**

Online computational efficiency

The proposed method, which relies on evaluating a trained neural network, offers considerable advantages in terms of online computational efficiency compared to other methods that solve an online optimization problem. In the TE problem, conventional model-based dynamic optimization can lead to a large-scale optimization problem with 11,400 optimization variables, 10,740 constraints, and 660 degrees of freedom, which is computationally intensive for online execution with a sampling time of 100 s³⁹. In contrast, our online policy, based on a trained neural network, involves straightforward computations such as matrix multiplications and simple activation functions, which are well-optimized and can be easily parallelized. The online computational time of the learning-based dynamic optimization policy in this work takes only 31 s on an Intel Core i9, 2.3 GHz CPU, with 16 GB RAM.

Sensitivity to training data

We present a comparative analysis of two distinct datasets to examine the sensitivity of policy performance to the composition of training data. The first dataset includes instances of both an increase in the production setpoint from 17 m³/h to 23 m³/h and a decrease in the Mole % of G from 90% to 10%, followed by a rapid reduction in the production rate to 16.1 m³/h, as depicted in Fig. 4. Conversely, the second dataset replicates these conditions but omits the rapid ramp-down event, which constitutes 7% of the original dataset, as illustrated in Fig. 11. Although the two datasets compared are fairly similar, it remains critical to investigate whether minor variations in the dataset can result in significant performance degradation. This would indicate a lack of robustness in the learning algorithm.

**Fig. 11: Alternative training dataset for evaluating the sensitivity of the optimizer’s performance to variations in training data composition.**

Table 5 compares the performance of the optimization policy when trained on Datasets 1 and 2 in terms of the average hourly production cost across the three operation modes. We observe an increase in production cost for the policy trained with the smaller dataset. However, it still represents a significant improvement over the base case and maintains comparable performance with RTO.

Table 5 Sensitivity analysis of optimizer performance to variations in training datasets

Full size table

Variability in learned optimization policies

In the Performance comparison with existing approaches section, we analyze the variability of the overall performance of the learned policy as influenced by the initial weights of the neural network. Here, we examine the behavioral differences between optimization policies derived from the same learning algorithm but initialized with different random seeds.

Figure 12 illustrates the normalized differences in setpoints selected by two policies trained under identical conditions, with the sole distinction being the random seed used for initializing the neural network weights. If the two policies were identical, the normalized differences for all setpoints would remain at zero throughout the operation. However, the observed deviations from zero indicate that the policies achieve their optimization objectives through different adjustments to the setpoints.

**Fig. 12: Normalized differences in setpoints selected by two policies trained under identical conditions, differing only in the random seed used to initialize neural network weights.**

Notably, the second policy opts for a relatively lower reactor pressure and a higher recycle valve position compared to the first policy. From a process optimization perspective, these setpoint choices are considered suboptimal. However, the second policy compensates for these suboptimal decisions by selecting a lower reactor level, which aligns with the optimal operating strategy.

Discussion

Our data-driven optimization approach, distinct from conventional RTO, eliminates the necessity for dynamic modeling, online parameters, and state estimation. This strategy offers a practical advantage by reducing convergence issues through offline optimization, even though model-based RTO may slightly outperform it in specific operational modes. The practical applicability of our method is underscored by the difficulty in maintaining and updating the rigorous nonlinear steady-state RTO model in the face of frequent changes to plant equipment and operating conditions, the challenge of estimating the plant steady-state for RTO execution, and the inherent limitations in estimating the true plant state from noisy measurements in the presence of disturbances and transient behavior.

One of the key advantages of our approach is its ability to learn from historical data, which inherently captures the unique characteristics and constraints of the plant. This enables the optimization policy to be tailored to the specific operating conditions and requirements of the process, without relying on complex and often inaccurate process models. By learning directly from the data, the proposed method can handle the nonlinearities, disturbances, and uncertainties that are prevalent in industrial processes, leading to improved performance and robustness.

However, our approach is not without limitations. The dependence on historical data for setpoint selection prevents the optimizer from choosing optimal setpoints beyond the observed range. This limitation can be mitigated by expanding the training dataset to cover a wider range of operating conditions. It is important to note that the learned policy is inherently conservative, as the learning algorithm restricts actions to those observed in the dataset. While less conservative actions could potentially lead to more optimal performance, they also introduce the risk of destabilizing the plant. The effectiveness and safety of the optimizer are directly correlated with the volume and quality of historical data used for training. Therefore, careful curation of the training dataset is essential to ensure the optimizer’s performance and reliability.

The introduction of neural networks as the foundation of our dynamic optimization policy presents both opportunities and challenges. Although neural networks enable the capture of complex nonlinear relationships and the learning of optimizing policies directly from data, their “black-box" nature can hinder interpretability and raise concerns about the specific characteristics of the optimization policy. Although we can guarantee baseline policy improvements, predicting the exact manner in which process improvements manifest remains a challenge.

To address these challenges, the field of data-driven control presents a promising avenue for future research²⁸. Advances in this area can provide vital insights and methodologies to improve the interpretability and effectiveness of neural network-based control systems. Techniques such as physics-informed neural networks, which incorporate prior knowledge of the system’s physical laws into the learning process, could help bridge the gap between data-driven and model-based approaches, enhancing the interpretability and reliability of the learned policies.

As illustrated in Section a, the analysis of behavioral differences between optimization policies highlights the importance of considering multiple training iterations and evaluating the range of learned strategies. By examining the variability in the learned policy and its impact on process performance, engineers can gain a more comprehensive understanding of the optimization landscape and the potential for alternative operating strategies.

Finally, the integration of domain expertise and process knowledge into the data-driven optimization framework is important for successful implementation. The selection of appropriate control structures, the identification of critical setpoints, and the definition of safety limits are highly based on the insight and experience of process engineers. The complementarity of engineering expertise and AI-driven optimization highlights the importance of interdisciplinary collaboration in the development and deployment of data-driven solutions in the process industries.

Conclusion

This research has successfully demonstrated the feasibility and effectiveness of a data-driven dynamic optimization strategy within the CPI, presenting a significant departure from traditional RTO methodologies. By focusing on direct policy extraction from historical data, we have shown that it is possible to achieve considerable operational improvements without the intricate requirements of dynamic modeling and online parameter estimation. The application to the TE challenge problem highlights our method’s practicality, showing promising results in efficiency and cost savings. However, challenges related to the “black-box" nature of neural networks and the generalization of learned policies highlight the need for ongoing research in data-driven control. Future work should aim to improve the interpretability, reliability, and expansion of training datasets to further enhance the scope and applicability of this approach in real-world industrial settings. As research in this field advances, we anticipate the development of improved data-driven optimization frameworks that will drive the transformation of the process industries toward a more sustainable and efficient future.

Data availability

The data were generated by the revised TE Process model, which can be found at https://depts.washington.edu/control/LARRY/TE/download.html.

Code availability

The code used in this work is available on GitHub at https://github.com/khalidlabs/D2A.

References

Bakshi, B. R. Toward sustainable chemical engineering: the role of process systems engineering. Annu. Rev. Chem. Biomol. Eng. 10, 265–288 (2019).
Article MATH Google Scholar
Torrente-Murciano, L. et al. The forefront of chemical engineering research. Nat. Chem. Eng. 1, 18–27 (2024).
Article MATH Google Scholar
Roberts, R., Flin, R., Millar, D. & Corradi, L. Psychological factors influencing technology adoption: a case study from the oil and gas industry. Technovation 102, 102219 (2021).
Article Google Scholar
Matias, J., Oliveira, J. P., Le Roux, G. A. & Jäschke, J. Steady-state real-time optimization using transient measurements on an experimental rig. J. Process Control 115, 181–196 (2022).
Article Google Scholar
Powell, K. M., Machalek, D. & Quah, T. Real-time optimization using reinforcement learning. Comput. Chem. Eng. 143, 107077 (2020).
Article Google Scholar
Remigio, J. E. & Swartz, C. L. Production scheduling in dynamic real-time optimization with closed-loop prediction. J. Process Control 89, 95–107 (2020).
Article MATH Google Scholar
Darby, M. L., Nikolaou, M., Jones, J. & Nicholson, D. Rto: an overview and assessment of current practice. J. Process Control 21, 874–884 (2011).
Article MATH Google Scholar
Krishnamoorthy, D., Foss, B. & Skogestad, S. Steady-state real-time optimization using transient measurements. Comput. Chem. Eng. 115, 34–45 (2018).
Article MATH Google Scholar
Ellis, M., Durand, H. & Christofides, P. D. A tutorial review of economic model predictive control methods. J. Process Control 24, 1156–1178 (2014).
Article MATH Google Scholar
Simkoff, J. M., Lejarza, F., Kelley, M. T., Tsay, C. & Baldea, M. Process control and energy efficiency. Annu. Rev. Chem. Biomol. Eng. 11, 423–445 (2020).
Article MATH Google Scholar
Rawlings, J. B., Angeli, D. & Bates, C. N. Fundamentals of economic model predictive control. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), 3851–3861 (IEEE, 2012).
Krishnamoorthy, D. & Skogestad, S. Real-time optimization as a feedback control problem–a review. Comput. Chem. Eng. 161, 107723 (2022).
Article MATH Google Scholar
Yip, W. & Marlin, T. E. The effect of model fidelity on real-time optimization performance. Comput. Chem. Eng. 28, 267–280 (2004).
Article MATH Google Scholar
Ning, C. & You, F. Optimization under uncertainty in the era of big data and deep learning: when machine learning meets mathematical programming. Comput. Chem. Eng. 125, 434–448 (2019).
Article MATH Google Scholar
Hewing, L., Wabersich, K. P., Menner, M. & Zeilinger, M. N. Learning-based model predictive control: toward safe learning in control. Annu. Rev. Control Robot. Auton. Syst. 3, 269–296 (2020).
Article MATH Google Scholar
Alhazmi, K., Albalawi, F. & Sarathy, S. M. A reinforcement learning-based economic model predictive control framework for autonomous operation of chemical reactors. Chem. Eng. J. 428, 130993 (2022).
Article Google Scholar
Qiao, J., Yang, R. & Wang, D. Offline data-driven adaptive critic design with variational inference for wastewater treatment process control. IEEE Trans. Autom. Sci. Eng. PP, 1–12 (2023).
Nian, R., Liu, J. & Huang, B. A review on reinforcement learning: Introduction and applications in industrial process control. Comput. Chem. Eng. 139, 106886 (2020).
Article MATH Google Scholar
Shin, J., Badgwell, T. A., Liu, K.-H. & Lee, J. H. Reinforcement learning—overview of recent progress and implications for process control. Comput. Chem. Eng. 127, 282–294 (2019).
Article MATH Google Scholar
Hassanpour, H., Mhaskar, P. & Corbett, B. A practically implementable reinforcement learning control approach by leveraging offset-free model predictive control. Comput. Chem. Eng. 181, 108511 (2024).
Article Google Scholar
Dulac-Arnold, G. et al. Challenges of real-world reinforcement learning: definitions, benchmarks and analysis. Mach. Learn. 110, 2419–2468 (2021).
Article MathSciNet MATH Google Scholar
Levine, S., Kumar, A., Tucker, G. & Fu, J. Offline reinforcement learning: tutorial, review, and perspectives on open problems. Preprint at https://doi.org/10.48550/arXiv.2005.01643 (2020).
Prudencio, R. F., Maximo, M. R. & Colombini, E. L. A survey on offline reinforcement learning: taxonomy, review, and open problems. IEEE Trans. Neural Netw. Learn. Syst. 35, 10237–10257 (2023).
Alhazmi, K. & Sarathy, S. M. Direct learning of improved control policies from historical plant data. Comput. Chem. Eng. 185, 108662 (2024).
Peng, X. B., Kumar, A., Zhang, G. & Levine, S. Advantage-weighted regression: simple and scalable off-policy reinforcement learning. Preprint at https://doi.org/10.48550/arXiv.1910.00177 (2019).
Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C. & Misener, R. Maximizing information from chemical engineering data sets: applications to machine learning. Chem. Eng. Sci. 252, 117469 (2022).
Article MATH Google Scholar
Van Waarde, H. J., Eising, J., Trentelman, H. L. & Camlibel, M. K. Data informativity: a new perspective on data-driven analysis and control. IEEE Trans. Autom. Control 65, 4753–4768 (2020).
Article MathSciNet MATH Google Scholar
De Persis, C. & Tesi, P. Learning controllers for nonlinear systems from data. Ann. Rev. Control 56, 100915 (2023).
Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F. & Hu, X. Data-centric ai: perspectives and challenges. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM), 945–948 (SIAM, 2023).
Sutton, R. S. & Barto, A. G. (eds) Reinforcement Learning: An Introduction (MIT Press, 2018).
Downs, J. J. & Vogel, E. F. A plant-wide industrial process control problem. Comput. Chem. Eng. 17, 245–255 (1993).
Article MATH Google Scholar
Baldea, M. & Daoutidis, P. (eds) Dynamics and Nonlinear Control of Integrated Process Systems (Cambridge University Press, 2012).
Ricker, N. L. Decentralized control of the Tennessee Eastman challenge process. J. Process Control 6, 205–221 (1996).
Article MATH Google Scholar
Ricker, N. Optimal steady-state operation of the Tennessee Eastman challenge process. Comput. Chem. Eng. 19, 949–959 (1995).
Article MATH Google Scholar
Kostrikov, I., Nair, A. & Levine, S. Offline reinforcement learning with implicit q-learning. Preprint at https://doi.org/10.48550/arXiv.2110.06169 (2021).
Bathelt, A., Ricker, N. L. & Jelali, M. Revision of the Tennessee Eastman process model. IFAC-PapersOnLine 48, 309–314 (2015).
Article MATH Google Scholar
Duvall, P. M. & Riggs, J. On-line optimization of the Tennessee Eastman challenge problem. J. Process Control 10, 19–33 (2000).
Article MATH Google Scholar
Yan, M. & Ricker, N. On-line optimization of the Tennessee Eastman challenge process. In Proceedings of the 1997 American Control Conference (Cat. No. 97CH36041), Vol. 5, 2960–2965 (IEEE, 1997).
Jockenhövel, T., Biegler, L. T. & Wächter, A. Dynamic optimization of the Tennessee Eastman process using the optcontrolcentre. Comput. Chem. Eng. 27, 1513–1531 (2003).
Article MATH Google Scholar

Download references

Acknowledgements

The research reported in this publication was supported by the Competitive Research Grant funding from King Abdullah University of Science and Technology (KAUST) under grant number URF/1/4051-01-01.

Author information

Authors and Affiliations

Clean Energy Research Platform, Physical Sciences and Engineering (PSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
Khalid Alhazmi & S. Mani Sarathy

Authors

Khalid Alhazmi
View author publications
You can also search for this author in PubMed Google Scholar
S. Mani Sarathy
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Khalid Alhazmi conceptualized, performed the research, analyzed the data, and wrote the manuscript. S. Mani Sarathy, acquired funding, guided and supervised the work, and edited the manuscript.

Corresponding author

Correspondence to Khalid Alhazmi.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Engineering thanks Bohan Feng, Ruixiang Zheng, and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling Editors: Saleem Denholme, Miranda Vinay.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Alhazmi, K., Sarathy, S.M. Dynamic optimizers for complex industrial systems via direct data-driven synthesis. Commun Eng 4, 25 (2025). https://doi.org/10.1038/s44172-025-00368-8

Download citation

Received: 23 April 2024
Accepted: 04 February 2025
Published: 17 February 2025
DOI: https://doi.org/10.1038/s44172-025-00368-8

Subjects

Abstract

Similar content being viewed by others

Solving real-world optimization tasks using physics-informed neural computing

Advanced control parameter optimization in DC motors and liquid level systems

Machine learning-assisted discovery of flow reactor designs

Introduction

Methods

Problem formulation

Key components of the proposed framework

Algorithm 1

Control structure design

Learning algorithm selection

Training data curation and informativity

Results

Application to the TE process

Implementation of the proposed framework

Performance comparison with existing approaches

Performance under setpoint changes

Evidence of learning

Effect of disturbances on optimization

Online computational efficiency

Sensitivity to training data

Variability in learned optimization policies

Discussion

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links