Introduction

The chemical process industry (CPI) provides the world with essential products fundamental to modern civilization. Food, medicine, energy, and transport are made possible on a grand scale through CPI products. Concurrently, the industry is confronted with significant challenges pertaining to environmental sustainability and the availability of energy and mineral resources. CPI process improvement has immense potential to address these challenges1,2. However, to adopt such improvements and realize their benefits, they must consider the human factors involved, particularly the inherent resistance to change within industrial settings3. This resistance and risk aversion is often rooted in concerns about the cost, complexity, and uncertainty associated with implementing new technologies or methodologies. Addressing these legitimate concerns requires demonstrating that advancements in process optimization can be achieved through methods that are not only theoretically sound and economically viable but also straightforward to integrate into existing operations. The objective of this work is, therefore, to take a step forward in developing a process control and optimization policy that embodies these principles.

Steady-state optimization methods, including model-based and model-free real-time optimization (RTO)4,5, are limited in their effectiveness for processes with frequent changes due to the presence of long transient dynamics in most integrated plants6,7. Repeatedly treating each change as a new optimization case is rarely practical, as steady-state RTO requires the process to reach steady-state, which may be significantly delayed8. This leads to infrequent re-optimization and suboptimal performance, with potential stability issues in the closed-loop system. Moreover, the effectiveness of steady-state operation itself has been questioned due to time-varying process economics9.

In contrast, dynamic real-time optimization (D-RTO) and economic model predictive control (EMPC) offer advanced solutions by integrating economic optimization with dynamic process control. D-RTO advances beyond steady-state RTO by utilizing dynamic models to continuously optimize process performance, thereby catering to time-varying conditions without waiting for a steady-state10. Meanwhile, EMPC combines the principles of optimization and control by incorporating stability to ensure that operations remain optimal and robust over time11. Both methodologies provide a more responsive and adaptive approach to managing dynamic operational environments resulting in improved performance and economic outcomes.

Constructing and updating a detailed D-RTO or EMPC dynamic model poses significant challenges for managing economic outcomes and ensuring real-world constraints are met, which negatively impacts the prospect of their widespread adoption in industry12,13. Direct or indirect data-driven dynamic optimization emerges as a promising solution to these challenges. The indirect approach involves formulating an uncertainty model based on data14, or performing sequential system identification for model adaptation15,16, followed by optimization using methods such as EMPC or D-RTO. The indirect approach has seen considerable application; however, the direct approach has yet to gain widespread attention in process control research. The latter aims to derive optimizing policies directly from data, bypassing the conventional model identification step17. An example of the direct approach is online reinforcement learning (RL), which involves learning from interactions with the actual process or its simulations18,19,20. However, applying RL to safety-critical systems is fraught with obstacles, including considerable safety risks during the online learning phase21.

All of the approaches above have limitations and merits. Here, we introduce an alternative approach to CPI real-time dynamic optimization motivated by advances in offline dynamic programming algorithms22,23. We leverage the rich information embedded in operational historical plant data to construct a value function—a measure of the expected return of being in a particular state, or of taking a specific action from that state, under a given policy. The quality of trajectories in the plant data is evaluated according to this value function24. Subsequently, we employ weighted regression to derive a policy that imitates the operational trajectories with the highest potential for expected returns.

The advantages of this data-driven approach are multifold: first, it sidesteps the need for constructing a detailed process model and online parameter and state estimation, which is especially beneficial for systems that are poorly conditioned or inherently unstable. Second, this method eliminates the need for dynamic simulations, which some data-driven methods rely on, thus minimizing the discrepancies that often arise when transferring simulated experiences to real-world operations. By training on actual plant data from regular operations rather than on a simulation, the policy is inherently tailored to the plant’s unique characteristics. Lastly, computational efficiency is markedly improved for online applications, as the optimizing policy is derived offline, and only policy execution is required in real time.

This learning-based optimization policy has been rigorously applied to the Tennessee Eastman (TE) challenge problem, a widely recognized benchmark simulation for evaluating process control strategies. The complexity and dynamics of the TE process mimic those of a real-world chemical production environment, providing a substantive testbed for our method. Through this application, we demonstrate the feasibility and utility of our approach for complex industrial processes. The training dataset utilized for learning is not synthetic, reflecting the size, complexity, and variability of actual plant operations.

This work makes three main contributions. First, it proposes deriving dynamic optimization policies from historical plant data based on a learning-based approach, where a value function learns to assign weights to trajectories based on their expected returns, and a policy learns to select setpoints that maximize the expected value function through weighted regression. Second, it demonstrates that readily available historical plant data can provide sufficient information for learning improved dynamic optimizers, challenging the assumption that specially designed experiments are necessary. Third, it applies this approach to the complex TE challenge problem, showcasing the scalability and effectiveness of the method in tackling the challenges of interconnected systems and external perturbations prevalent in industrial settings. Our approach significantly reduces production costs relative to the base case, with decreases of 27.48% in Mode 1, 53.04% in Mode 2, and 10.32% in Mode 3 of the TE challenge problem. Additionally, the online computational time of the learning-based dynamic optimizer is only 31 seconds. These contributions underscore the potential of learning-based optimization to drive tangible improvements in process performance.

Methods

Problem formulation

Consider a chemical plant where \({y}_{t}\in {{\mathbb{R}}}^{n}\) represents the outputs (measured variables) of the plant at time t, and \({u}_{t}\in {{\mathbb{R}}}^{m}\) denotes the control action applied to the plant. We assume the existence of a control policy \(\mu :{{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{q}\to {{\mathbb{R}}}^{m}\), which determines the control action ut based on the current output yt and a set point vt, i.e., ut = μ(ytvt). We also assume a layered control architecture where a higher-level optimization policy \(\pi :{{\mathbb{R}}}^{n}\to {{\mathbb{R}}}^{q}\) selects the set point vt based on the current state yt, i.e., vt = π(yt). Note that the decision variable for the control policy, μ, is the control input ut, while the decision variable for the optimization policy, π, is the setpoint for the control system vt.

The objective is to learn the optimization policy π that minimizes the expected cumulative cost, reflecting the performance criteria of the plant, as described in Eq. (1).

$${\pi }^{* }=\arg {\min }_{\pi }{\mathbb{E}}\left[{\sum }_{t=0}^{\infty }{\gamma }^{t}c({y}_{t},{v}_{t})\right]$$
(1)

where \(c:{{\mathbb{R}}}^{n}\times {{\mathbb{R}}}^{q}\to {\mathbb{R}}\) is the stage cost function that quantifies the performance of the system given the state yt and the set point vt, and γ (0, 1] is the discount factor that weights immediate cost more heavily than future cost. This optimization problem is subject to the system dynamics, the control policy μ, and the constraints inherent to the system states yt and actions ut.

We aim to solve this problem using a precollected dataset of system interactions \({{\mathcal{D}}}={\{({y}_{i},{v}_{i},{c}_{i},{y}_{i}^{{\prime} })\}}_{i = 1}^{N}\) where each tuple contains the current state yi, set point vi, observed cost ci, and the subsequent output \({y}_{i}^{{\prime} }\). This dataset represents historical operation data under various policies or operational conditions. We note that the optimization policy maps outputs, not states, to actions, eliminating the need for state estimation as required in model-based methods.

Key components of the proposed framework

Figure 1 illustrates the D-RTO framework, integrating both offline and online components. The offline segment involves data collection and processing from historical operations to inform the development of a dynamic optimizer through a learning algorithm. This learned optimizer is then applied in real time to the controlled plant operations, adjusting setpoints to minimize the objective function based on real-time data.

Fig. 1: Overview of the proposed data-driven optimization framework.
figure 1

a The offline phase processes historical data to learn a dynamic optimization policy. b The learned optimizer operates online, using real-time data to determine optimal setpoints for supervisory control. c The conventional RTO approach uses online models, parameter/state estimation, and RTO to compute setpoints.

Next, we outline the critical elements underpinning our direct data-driven dynamic optimization framework. This framework integrates a control structure for stability and setpoint tracking, a learning algorithm for adaptive optimization, and a data curation strategy to ensure the relevance and quality of input data. Algorithm 1 provides a summary of the deployment of the proposed data-driven dynamic optimization framework in industrial settings.

Algorithm 1

Design Procedure for the proposed framework

1: Control design: Select regulatory and supervisory control layers suitable for the process dynamics, interactions, and constraints.

2: Setpoint identification: Identify setpoints with the greatest impact on operational costs and product quality using historical data and process expertise. These will become the focus of offline learning.

3: Safety override integration: Define strict safety limits for critical process variables. Implement overrides within the control layer to supersede RL-suggested setpoints that could violate these limits.

4: Offline policy learning:

Data curation: Carefully prepare a dataset according to “Training Data and Informativity" guidelines.

Value function learning: Choose a value function representation aligned with process complexity and data characteristics.

Policy extraction: Apply a method such as advantage-weighted regression (AWR) to extract a policy from the learned value function, focusing on actions with high expected long-term rewards25.

5: Policy deployment: Deploy the learned policy online to the system.

Control structure design

As presented in Fig. 1, the control structure consists of three layers: regulatory control, supervisory control, and RTO. Regulatory control maintains process variables near setpoints. For complex systems, supervisory control techniques like model predictive control (MPC) or well-designed decentralized control are necessary. MPC uses a dynamic model to optimize input trajectories considering constraints and interactions. Decentralized control partitions the system into sub-units with individual controllers, focusing on key objectives and process characteristics.

The proposed dynamic RTO approach operates atop the control structure, relying on it for accurate execution of optimal setpoints. The supervisory and regulatory layers ensure disturbance rejection and operational stability. The RTO layer optimizes a metric defining operating profit or cost, and sends the computed setpoints to the supervisory control layer. Safety overrides should be integrated to prevent the data-driven optimizer from violating safety limits, acting as a failsafe when process conditions approach unsafe boundaries.

Learning algorithm selection

The learning algorithm addresses the optimization problem described in Eq. (1). In this work, our selected approach to solving this problem involves training two separate neural networks using the historical operational data: one for the value function and another for the policy.

The first neural network is trained to approximate a value function, which evaluates the long-term optimality of trajectories in the dataset \({{\mathcal{D}}}\). The value function can take different forms, such as a state-value function (evaluating the expected return of being in a certain state under a specific policy), a Q-function, or an action-value function (assessing the worth of taking a particular action in a given state), or an advantage function (calculating the relative advantage of an action compared to the average action in a state). The value function network takes the process measurements (observations) and setpoints (actions) as inputs and learns to predict the expected return (i.e., the sum of discounted future rewards) for each state-action pair. By focusing on minimizing the value function instead of solely the immediate cost c, this approach effectively captures the long-term impact of actions in processes with long transient dynamics due to time delays and recycle loops.

The second neural network represents the optimization policy (i.e., the dynamic optimizer). Once the value function network is trained, it is used to assign weights to the trajectories in the dataset \({{\mathcal{D}}}\) based on their expected returns. Here, a trajectory refers to a series of observations (measured process variables), setpoints (actions), and the resulting production costs over a certain period of time. The policy network takes the process measurements (observations) as inputs and learns to predict the setpoints (actions) that lead to the highest expected returns, as determined by the learned value function. This policy extraction process can be viewed as a weighted regression problem, where the aim is to find a policy π that maximizes the expected value function. Figure 2 provides a conceptual illustration of this process.

Fig. 2: Conceptual illustration of the policy extraction process in the proposed framework.
figure 2

The learning algorithm evaluates the optimality of each trajectory in the historical operational data using a learned value function. Trajectories with lower long-term costs are assigned higher weights. The optimization policy (dynamic optimizer) is then derived using weighted regression, prioritizing trajectories with the highest expected returns. The resulting policy selects plant setpoints that lead to lower long-term production costs.

Training data curation and informativity

The topic of what can be learned from data and measuring data informativity is a promising area of research that will help us to systematically select operational data for the training of dynamic optimizers26,27,28. However, we provide general guidelines for data inclusion that we have empirically found to lead to better optimizers. Rather than treating process data as fixed, careful curation is useful for effective data-driven optimization.

In the data curation process, we carefully select operational data representing desirable operating conditions. This focus on high-quality data aligns with literature emphasizing its importance29. We also incorporate periods of ramping up and down, transients, and data from previous experimental campaigns into the dataset. Although not explicitly modeled, manipulated variable limits are implicitly considered, as the policy extraction step limits controls to those observed within the dataset. For example, if the desired maximum reactor pressure is 2900 kPa, we exclude data where the reactor pressure exceeds this limit. By curating a dataset that spans the full operational range, the optimization framework gains exposure to the feasible input space.

Our approach diverges significantly from the RL-RTO method outlined in ref. 5, which involves learning a reward function from simulation-generated data. The method in ref. 5 reduces to deriving a greedy policy, which maximizes the immediate reward rather than the expected return, diverging from the essence of RL approaches30. Then, an evolutionary optimization algorithm is applied to adjust policy weights, aiming to maximize the reward. However, generating synthetic training data through random perturbations of decision variables and disturbances is impractical. Moreover, the described method results in a steady-state RTO that is solved repeatedly when environmental conditions change.

Results

Application to the TE process

The TE Problem, introduced in 1990 by Downs and Vogel31, is based on an actual industrial process and a widely acknowledged benchmark for evaluating the efficacy of process control techniques. The system is designed around five major units: a reactor, a product condenser, a vapor-liquid separator, a recycle compressor, and a product stripper, as illustrated in Fig. 3. Products, along with unreacted components, exit the reactor in vapor form, while the nonvolatile catalyst remains in the reactor. Following this, the reactor’s product stream is directed through a cooler to condense the products, and then it proceeds to a vapor-liquid separator. The noncondenables are recycled back to the reactor via a centrifugal compressor. The condensed components are then transferred to a product stripping column, where the remaining reactants are stripped using the A/B/C feed stream. Products G and H leave from the base of the stripper. The inert and the byproduct are primarily purged from the system as vapor from the vapor-liquid separator.

Fig. 3: Process flow diagram (PFD) of the TE challenge problem, a benchmark simulation for evaluating process control strategies31.
figure 3

The TE process involves five main units: a reactor, a product condenser, a vapor-liquid separator, a recycle compressor, and a product stripper. The process produces two main products (G and H) from four reactants (A, C, D, and E), with an inert (B) and a byproduct (F) present. The complexity and dynamics of the TE process closely resemble those of a real-world chemical production plant.

The process includes 41 measurements and 12 manipulated variables. The TEP mirrors the complexity and challenges of a real chemical plant due to its nonlinear dynamics, multivariable interactions, and dynamic behavior. It incorporates realistic disturbances, noise, and a range of operational modes, including various fault scenarios akin to those encountered in industrial settings. Significantly, the TEP is an integrated process with a material recycle stream. The presence of recycle streams in process systems significantly increases their complexity, which is evidenced by increased sensitivity to disturbances, extended time constants, and potential instability issues. The temporal multi-scale behavior from diverse physical and chemical phenomena leads to stiff differential equations that further complicate dynamic modeling and optimization. The disparity between these complex systems and simpler non-integrated models challenges the transferability of optimization schemes to real-world industrial applications32.

The process m two products from four reactants. Also present are an inert and a byproduct making a total of eight components: A, B, C, D, E, F, G, and H. The reactions are:

$$\begin{array}{ll}\,{{\mbox{A}}}(g)+{{\mbox{C}}}(g)+{{\rm{D}}}(g)\to {{\mbox{G}}}\,(liq),&\,{\mbox{Product \, 1,}}\,\\ \,{{\mbox{A}}}(g)+{{\mbox{C}}}(g)+{{\mbox{E}}}(g)\to {{\mbox{H}}}\,(liq),&\,{\mbox{Product \, 2,}}\,\\ \,{{\mbox{A}}}(g)+{{\mbox{E}}}(g)\to {{\mbox{F}}}\,(liq),&\,{\mbox{Byproduct,}}\,\\ 3\,{{\mbox{D}}}(g)\to 2{{\mbox{F}}}\,(liq),&\,{\mbox{Byproduct.}}\,\end{array}$$

In this study, we examine three different operating modes of the TEP. Each mode is characterized by a target G/H product mass ratio and a target production rate, as detailed in Table 1.

Table 1 Operational modes for the TE process, characterized by the target product mass ratio (G/H) and production rate

Implementation of the proposed framework

The TEP’s control objectives are focused on maintaining process variables at their desired levels, ensuring operations stay within equipment constraints, minimizing the variability of product rate and quality in response to disturbances, reducing the movement of valves that could affect other processes, and achieving a quick recovery from any disturbances, including shifts in production rates or changes in the product composition31.

In this work, we adopt a decentralized control approach for the TEP, which has been shown to effectively satisfy the specifications of the challenge problem even in the presence of significant disturbances in feed composition and reaction kinetics. The decentralized control strategy proposed by Ricker33 demonstrates superior performance in terms of reducing variability in product rate and quality compared to previous investigations. The implemented control structure incorporates two critical overrides: a reactor pressure override that reduces production when necessary, and a reactor level override that reduces the recycle flow.

The optimization decision variables serve as setpoints for the control layer and are carefully selected based on their substantial impact on process economics. These variables presented in Table 2 are chosen to ensure that their theoretically optimal values Ricker34 fall within the specified lower and upper bounds across all three operational modes of the TEP. The decision variables are discretized with a step size of 5 × 10−4. The variables yA and yAC defined within the decentralized control strategy are specifically designed to regulate the proportions of component A and the sum of components A and C in the reactor feed, respectively33. This targeted control of feed composition plays an important role in optimizing the process performance and product quality.

Table 2 Decision variables for the optimization policy in the proposed framework, along with their lower and upper bounds

The main objective and contribution of this work is to directly learn a dynamic optimization policy from historical plant data by using a value function to assign weights to trajectories in historical plant data. There are various types of value functions and ways to derive policies from them. Value functions include standard Q-functions, Conservative Q-functions, and advantage functions. Once a value function has been learned, policies can be derived using methods such as greedy maximization, constrained policy optimization, value-weighted regression, or AWR22.

The approach selected here is proposed by Kostrikov et al.35, which involves learning an advantage function as the value function and employing AWR for policy extraction. The advantage function treats the state value function as a variable influenced by action choices and computes an upper expectile of this variable to estimate the value of optimal actions per state. This process sidesteps the need to directly evaluate the value function with unseen actions. The policy is then derived through AWR, which also avoids out-of-sample actions25. While other approaches exist, this particular combination of value function and policy extraction method has demonstrated state-of-the-art performance on various benchmarks23.

The neural network architecture of the policy is a sequential model that consists of an initial linear transformation of the input observation to a 400-dimensional space, an LSTM layer for sequential processing that outputs a 300-dimensional representation, and a final linear layer that maps this to a specified feature size. ReLU activations are applied after both linear transformations. The architecture is designed to process sequential data, taking into account the temporal dependencies within the input features. The selected dimensions of the neural network were found to yield satisfactory results, and we observed that the performance of the learned optimization policy is not strongly dependent on the exact number of layers and units. Additional details on the learning algorithm and hyperparameters can be found in Supplementary Note 2.

The dataset for training the dynamic optimizer was generated from a revised simulation of the TE process, originally introduced by Downs and Vogel31. The revision, made by Bathelt et al.36, addresses key aspects of usability and reproducibility. The training data was collected from an unoptimized plant operating under the control policy.

Considering that the TE process operates at different product compositions, it is reasonable to assume that historical data would include transitions between various operating modes. Therefore, the training data was collected from a scenario where the production setpoint was increased from 17 m3/h to 23 m3/h, while simultaneously decreasing the mole percentage of product G from 90% to 10%. This transition was followed by a rapid ramp-down in the production rate to 16.1 m3/h. All process measurements in the training data include Gaussian noise with standard deviations typical of each measurement type. Additionally, the training data incorporate a slow drift in reaction kinetics.

The total duration of the collected data is 4000 h, with a sampling time of 0.01 h. Figure 4 illustrates the production rate and composition ramps, along with the associated production cost. To mimic realistic operational conditions, all process measurements include Gaussian noise, with standard deviations typical of each measurement type, as specified by Downs and Vogel31. Furthermore, the training data incorporates periods of random variation in the feed compositions of components A, B, and C, which is a common occurrence in large-scale chemical processes and is suggested by the authors of the TE problem.

Fig. 4: Training data for the dynamic optimizer, generated from TE process simulations.
figure 4

a Transition in production rate, b change in mole percentage of product G, and c corresponding production costs. Data includes noise and random feed composition variations to simulate realistic conditions. Points are sampled for readability while preserving trends.

The final training dataset consists of 49 columns: 41 for measured variables, 7 for decision variables (optimization setpoints), and one for the stage cost. The stage cost, which captures the operational objectives and is used to label the training data, is given in Eq. (2), with the exact mathematical expression provided in Supplementary Note 1.

$${{\rm{Production}}} \, {{\rm{cost}}}\, = \,({{\rm{purge}}} \, {{\rm{costs}}})\times ({{\rm{purge}}} \, {{\rm{rate}}})\\ +\,({{\rm{product}}} \, {{\rm{stream}}} \, {{\rm{costs}}})\times ({{\rm{product}}} \, {{\rm{rate}}})\\ +\,({{\rm{compressor}}} \, {{\rm{costs}}})\times ({{\rm{compressor}}} \, {{\rm{work}}})\\ +\,({{\rm{steam}}} \, {{\rm{costs}}})\times ({{\rm{steam}}} \, {{\rm{rate}}})$$
(2)

Performance comparison with existing approaches

The performance of the learning-based dynamic optimizer is evaluated based on the average hourly production cost defined in Eq. (2). We benchmark our approach against several existing methods, including the base case presented by ref. 31, the theoretical optimum calculated by ref. 34 (which assumes full state observability and noise-free measurements), a RTO strategy reported by ref. 37 (an improvement over the RTO in ref. 38), and the performance of an experienced operator, also reported in ref. 37. The results are discussed further in the Discussion.

To account for the variability in training outcomes due to the random initialization of the neural network weights representing the optimization policy, we report the average hourly cost and standard deviation across three policies, each trained with a different random seed. This approach provides insight into the stability and robustness of the learning process and the resulting policies.

As demonstrated in Fig. 5, the learning-based dynamic optimizer significantly outperforms the base case, achieving cost reductions of 27.48% in Mode 1, 53.04% in Mode 2, and 10.32% in Mode 3. Notably, our approach not only matches but also surpasses the performance of the model-based RTO implemented by ref. 37 in certain cases. In particular, the learning-based optimizer exhibits superior performance in Mode 2, bringing it closer to the theoretical optimum. Moreover, the variability in performance across policies trained with different random seeds is minimal, indicating that the improvements are consistently maintained regardless of the initialization.

Fig. 5: Comparison of average hourly production costs for the TE process under different strategies.
figure 5

The proposed learning-based D-RTO approach is benchmarked against the base case31, a model-based RTO strategy37, the theoretical optimum34, and the performance of an experienced operator37. The learning-based D-RTO approach outperforms the base case, model-based RTO, and operator performance in key operating modes. Error bars represent the standard deviation of the learning-based D-RTO performance across three policies trained with different random seeds, indicating the robustness and consistency of the proposed approach.

An interesting observation is the apparent direct proportionality between the hourly cost and the magnitude of improvement. Higher costs are associated with more significant improvements, which can be attributed to the impact of the cost on the gradient magnitude within the learning algorithm’s optimization process. The learning algorithm refines the policy by adjusting its parameters to minimize the expected cost. The magnitude of the gradient indicates the degree of adjustment needed in the policy parameters to enhance the policy or value estimation. Since the dataset encompasses all modes of operation and these are learned simultaneously, a higher cost leads to a larger gradient calculation, resulting in more substantial policy improvements.

Table 3 provides a detailed breakdown of the total hourly production costs, while Table 4 compares the setpoints selected by different strategies. The measured production rates under all compared methods are similar, with values of 22.95 m3/h in the Base case, 22.89 m3/h in the Optimal case, 23.0 m3/h in Duvall’s RTO, and 22.91 m3/h in our proposed approach. The learning-based optimizer achieves significant reductions in product losses, primarily by operating at higher reactor pressures and lower liquid levels. The lower liquid levels are economically advantageous due to decreased gas residence time, which has a similar effect to increased pressure. This strategy enables sustained production with either a higher inert concentration (lowering the purge rate) or reduced reactor temperatures (enhancing selectivity)34. The decrease in purge losses is achieved by operating at elevated temperatures to reduce the purge rate, although excessively high temperatures could adversely affect the reactor’s selectivity for products G and H. Unlike in RTO, the steam cost in the learning-based optimizer does not diminish significantly since it does not close the steam valve. Finally, the compressor cost is reduced due to decreasing the recycle rate, but this is offset by the increase in reactor pressure.

Table 3 Breakdown of average hourly production costs for the TE process in Mode 1, comparing the base case, model-based RTO by Duvall and Riggs37, the learning-based optimizer (this work), and the performance of an experienced operator
Table 4 Comparison of setpoints selected by different optimization strategies across various operational modes of the TE process

Performance under setpoint changes

Here we examine the effect of the dynamic optimizer presented in this work on the control of the process. Figure 6 illustrates two important results. The first is that the learning-based controller does not destabilize the process during significant setpoint changes in the production rate and production composition. That is, the process under the dynamic optimizer can achieve specified product composition and production rate setpoints within the allowable 5% variation. The second result is that the learning-based controller can dynamically define setpoints that lead to a significant reduction in the operating costs as can be seen in the cumulative cost figure.

Fig. 6: Setpoint adjustments by the learning-based optimizer during significant changes in TE process conditions.
figure 6

a, b Stable control of product composition and production rate. c Cumulative cost reduction. d Normalized changes in decision variables demonstrate adaptation to optimal setpoints. Here, yA and yAC denote the respective proportions of A and A + C in the reactor feed.

Figure 6 also presents the normalized change in decision variables, with the direction of changes (i.e., increase or decrease) in setpoints being consistent with the findings in the optimality study by Ricker34. The optimizer keeps the pressure near maximum, maintains the reactor level near the lower bound, increases the recycling rate (by closing the compressor recycle valve) except at high %G, reduces the steam valve, increases the reactor temperature, and reduces yAC.

The positive change in the reactor pressure aligns with the fact that a higher reactor pressure is more optimal due to its effect on increasing the inert partial pressure without decreasing the reactant's partial pressure, which reduces the purge rate and decreases the purge cost. Similarly, reducing the reactor level has an effect comparable to increasing the reactor pressure. The reduction in the compressor recycle valve and steam valve leads to a decrease in compression cost and steam cost, respectively. The optimizer decreases the temperature to cut the purge rate and reduces yAC because an excess of A + C requires a decrease in the inert concentration, which would otherwise increase purge losses.

Evidence of learning

To evaluate the learning-based optimizer’s ability to select more optimal setpoints for states not explicitly encountered in the training dataset, we analyze the reactor pressure setpoints chosen by the optimizer under specific operating conditions. Figure 7 illustrates the pressure training data, revealing that for product compositions between 90% and 80% G, the highest recorded pressure in the training set is 2750 kPa. However, during testing in Mode 3, which corresponds to a 90/10 G/H product composition, the learning-based optimizer selects a reactor pressure setpoint as high as 2879 kPa, as shown in the bottom plot of Fig. 7.

Fig. 7: Evidence of the learning-based optimizer’s ability to generalize and select optimal setpoints beyond the training data.
figure 7

Plot a shows the maximum reactor pressure recorded in the training data for product compositions between 90% and 80% G, with the highest pressure being 2750 kPa. In contrast, plot b demonstrates that during testing in Mode 3 (90/10 G/H composition), the learning-based optimizer selects a reactor pressure setpoint as high as 2879 kPa, which was not encountered in the training data. This behavior indicates that the optimizer has learned to associate higher pressures with improved economic performance and can adaptively choose setpoints that enhance operational efficiency, even in scenarios not explicitly represented in the training dataset.

This behavior demonstrates that the optimizer has learned to associate higher pressures with more optimal performance, prompting it to select a higher pressure for Mode 3, even in the absence of such instances in the training dataset. This finding confirms the optimizer’s ability to generalize from the learned data and adaptively choose setpoints that improve operational efficiency.

Effect of disturbances on optimization

In the subsequent analysis, our objective is to assess the impact of the learning-based optimizer on process stability. This evaluation focuses not only on its performance during setpoint changes but also on its response to a variety of disturbances typical of large-scale industrial operations. Furthermore, disturbances can alter the optimal conditions under which the process operates. Therefore, this analysis seeks to explore the capacity of the learning-based controller to dynamically adjust the setpoints in response to disturbances. The code provided by Downs and Vogel31 enables the activation or deactivation of specified disturbances.

Variations in feed composition are a significant disturbance in the operation of a chemical plant because they directly influence product selectivity and separation efficiency. Figure 8 illustrates the performance of the control system under random variations in the A, B, and C feed compositions, with and without the learning-based optimizer. It can be observed that the product composition and production rate with the learning-based optimizer are comparable to those in the base case without the optimizer. Furthermore, as demonstrated by the cumulative cost figure, the optimizer is capable of defining setpoints that reduce operating costs, even in the presence of feed composition disturbances. The observed fluctuations in the production rate can be attributed to the random variations in the feed composition of A, B, and C. These variations directly impact the production rate, as it is controlled by manipulating the feed rates.

Fig. 8: Impact of the optimizer on process performance under feed composition disturbances (IDV(8)31).
figure 8

a Product composition, b production rate, and c cumulative costs show the optimizer maintains stability and reduces costs despite random feed variations. The shaded area indicates the allowable 5% variation in product composition and production rate setpoints.

Drift in reaction kinetics can occur in real processes due to catalyst degradation and reactor fouling. Figure 9 demonstrates that the learning-based optimizer does not exacerbate variations in product composition and production rate as a result of the shift in kinetics, thereby maintaining process stability. Furthermore, Fig. 9 illustrates that the optimizer is capable of selecting setpoints that minimize cumulative costs.

Fig. 9: Robustness of the optimizer under reaction kinetics drift (IDV(13)31).
figure 9

a Product composition, b production rate, and c cumulative costs indicate stability and cost reductions, showcasing adaptability to disturbances, such as catalyst degradation or fouling. The shaded area indicates the allowable 5% variation in product composition and production rate setpoints.

Another prevalent disturbance in industrial processes includes temperature variations in utility streams, such as those found in cooling water, and valve malfunctions. Figure 10 illustrates the impact of random fluctuations in the condenser cooling water inlet temperature on product composition and production rate, compounded by issues with sticking in the condenser cooling water valve. Despite these challenges, the learning-based optimizer successfully maintains process stability while selecting setpoints that result in a reduction of cumulative costs.

Fig. 10: Handling of concurrent disturbances (IDV(12,15)31) by the optimizer.
figure 10

a Product composition, b production rate, and c cumulative costs demonstrate stability and economic performance under temperature and valve disturbances. The shaded area indicates the allowable 5% variation in product composition and production rate setpoints.

Online computational efficiency

The proposed method, which relies on evaluating a trained neural network, offers considerable advantages in terms of online computational efficiency compared to other methods that solve an online optimization problem. In the TE problem, conventional model-based dynamic optimization can lead to a large-scale optimization problem with 11,400 optimization variables, 10,740 constraints, and 660 degrees of freedom, which is computationally intensive for online execution with a sampling time of 100 s39. In contrast, our online policy, based on a trained neural network, involves straightforward computations such as matrix multiplications and simple activation functions, which are well-optimized and can be easily parallelized. The online computational time of the learning-based dynamic optimization policy in this work takes only 31 s on an Intel Core i9, 2.3 GHz CPU, with 16 GB RAM.

Sensitivity to training data

We present a comparative analysis of two distinct datasets to examine the sensitivity of policy performance to the composition of training data. The first dataset includes instances of both an increase in the production setpoint from 17 m3/h to 23 m3/h and a decrease in the Mole % of G from 90% to 10%, followed by a rapid reduction in the production rate to 16.1 m3/h, as depicted in Fig. 4. Conversely, the second dataset replicates these conditions but omits the rapid ramp-down event, which constitutes 7% of the original dataset, as illustrated in Fig. 11. Although the two datasets compared are fairly similar, it remains critical to investigate whether minor variations in the dataset can result in significant performance degradation. This would indicate a lack of robustness in the learning algorithm.

Fig. 11: Alternative training dataset for evaluating the sensitivity of the optimizer’s performance to variations in training data composition.
figure 11

a Transition in production rate from 17 m3/h to 23 m3/h and mole percentage of product G from 90% to 10% over 4000 hours; b mole percentage of product G; and c associated production cost. This dataset omits the rapid ramp-down event in production rate seen in Fig. 4. Data points were sampled for readability without losing overall trends.

Table 5 compares the performance of the optimization policy when trained on Datasets 1 and 2 in terms of the average hourly production cost across the three operation modes. We observe an increase in production cost for the policy trained with the smaller dataset. However, it still represents a significant improvement over the base case and maintains comparable performance with RTO.

Table 5 Sensitivity analysis of optimizer performance to variations in training datasets

Variability in learned optimization policies

In the Performance comparison with existing approaches section, we analyze the variability of the overall performance of the learned policy as influenced by the initial weights of the neural network. Here, we examine the behavioral differences between optimization policies derived from the same learning algorithm but initialized with different random seeds.

Figure 12 illustrates the normalized differences in setpoints selected by two policies trained under identical conditions, with the sole distinction being the random seed used for initializing the neural network weights. If the two policies were identical, the normalized differences for all setpoints would remain at zero throughout the operation. However, the observed deviations from zero indicate that the policies achieve their optimization objectives through different adjustments to the setpoints.

Fig. 12: Normalized differences in setpoints selected by two policies trained under identical conditions, differing only in the random seed used to initialize neural network weights.
figure 12

Deviations from zero indicate differences in the setpoints selected by the policies. For instance, the second policy chooses lower reactor pressure and higher recycle valve position (suboptimal) but compensates by selecting a lower reactor level (optimal). These variations illustrate the learned policies' behavioral variability and the optimizer’s ability to achieve objectives via alternative strategies. Here, yA and yAC denote the respective proportions of A and A + C in the reactor feed.

Notably, the second policy opts for a relatively lower reactor pressure and a higher recycle valve position compared to the first policy. From a process optimization perspective, these setpoint choices are considered suboptimal. However, the second policy compensates for these suboptimal decisions by selecting a lower reactor level, which aligns with the optimal operating strategy.

Discussion

Our data-driven optimization approach, distinct from conventional RTO, eliminates the necessity for dynamic modeling, online parameters, and state estimation. This strategy offers a practical advantage by reducing convergence issues through offline optimization, even though model-based RTO may slightly outperform it in specific operational modes. The practical applicability of our method is underscored by the difficulty in maintaining and updating the rigorous nonlinear steady-state RTO model in the face of frequent changes to plant equipment and operating conditions, the challenge of estimating the plant steady-state for RTO execution, and the inherent limitations in estimating the true plant state from noisy measurements in the presence of disturbances and transient behavior.

One of the key advantages of our approach is its ability to learn from historical data, which inherently captures the unique characteristics and constraints of the plant. This enables the optimization policy to be tailored to the specific operating conditions and requirements of the process, without relying on complex and often inaccurate process models. By learning directly from the data, the proposed method can handle the nonlinearities, disturbances, and uncertainties that are prevalent in industrial processes, leading to improved performance and robustness.

However, our approach is not without limitations. The dependence on historical data for setpoint selection prevents the optimizer from choosing optimal setpoints beyond the observed range. This limitation can be mitigated by expanding the training dataset to cover a wider range of operating conditions. It is important to note that the learned policy is inherently conservative, as the learning algorithm restricts actions to those observed in the dataset. While less conservative actions could potentially lead to more optimal performance, they also introduce the risk of destabilizing the plant. The effectiveness and safety of the optimizer are directly correlated with the volume and quality of historical data used for training. Therefore, careful curation of the training dataset is essential to ensure the optimizer’s performance and reliability.

The introduction of neural networks as the foundation of our dynamic optimization policy presents both opportunities and challenges. Although neural networks enable the capture of complex nonlinear relationships and the learning of optimizing policies directly from data, their “black-box" nature can hinder interpretability and raise concerns about the specific characteristics of the optimization policy. Although we can guarantee baseline policy improvements, predicting the exact manner in which process improvements manifest remains a challenge.

To address these challenges, the field of data-driven control presents a promising avenue for future research28. Advances in this area can provide vital insights and methodologies to improve the interpretability and effectiveness of neural network-based control systems. Techniques such as physics-informed neural networks, which incorporate prior knowledge of the system’s physical laws into the learning process, could help bridge the gap between data-driven and model-based approaches, enhancing the interpretability and reliability of the learned policies.

As illustrated in Section a, the analysis of behavioral differences between optimization policies highlights the importance of considering multiple training iterations and evaluating the range of learned strategies. By examining the variability in the learned policy and its impact on process performance, engineers can gain a more comprehensive understanding of the optimization landscape and the potential for alternative operating strategies.

Finally, the integration of domain expertise and process knowledge into the data-driven optimization framework is important for successful implementation. The selection of appropriate control structures, the identification of critical setpoints, and the definition of safety limits are highly based on the insight and experience of process engineers. The complementarity of engineering expertise and AI-driven optimization highlights the importance of interdisciplinary collaboration in the development and deployment of data-driven solutions in the process industries.

Conclusion

This research has successfully demonstrated the feasibility and effectiveness of a data-driven dynamic optimization strategy within the CPI, presenting a significant departure from traditional RTO methodologies. By focusing on direct policy extraction from historical data, we have shown that it is possible to achieve considerable operational improvements without the intricate requirements of dynamic modeling and online parameter estimation. The application to the TE challenge problem highlights our method’s practicality, showing promising results in efficiency and cost savings. However, challenges related to the “black-box" nature of neural networks and the generalization of learned policies highlight the need for ongoing research in data-driven control. Future work should aim to improve the interpretability, reliability, and expansion of training datasets to further enhance the scope and applicability of this approach in real-world industrial settings. As research in this field advances, we anticipate the development of improved data-driven optimization frameworks that will drive the transformation of the process industries toward a more sustainable and efficient future.