Wuppertal. Deep neural networks provably solve Bellman equations for Markov decision processes without the curse of dimensionality

I’m Konrad Kleinberg, a PhD student in the Applied and Computational Mathematics (ACM) working group at the University of Wuppertal, Germany.

Within the broader scope of the DFG-funded priority program 2298, Theoretical Foundations of Deep Learning, one of our primary goals is to rigorously understand why deep neural networks are so effective at solving high-dimensional problems, specifically stochastic control problems.

In autonomous driving, a robot must navigate through a cluttered and busy environment by determining a sequence of actions to reach a target while avoiding obstacles. Each action influences future decisions and the robot’s position, and it also requires processing high-dimensional sensory data (camera and LiDAR inputs). In finance, particularly in portfolio optimization, traders make sequential decisions about asset allocations. Here, high-dimensionality arises from the large number of assets and complex market data, with each decision affecting future portfolio values and associated risks. In smart energy systems, a controller must decide how to store and distribute energy over time across multiple sources and multiple consumers. These are three real-world examples of sequential decision-making problems in which agents interact with their environment in order to maximize some specified accumulated reward over the entire time period. Specifically, one is interested in identifying an optimal strategy – a decision rule that prescribes optimal actions in each given state of the system.

Mathematically, such stochastic control problems are typically modeled as Markov decision processes (MDPs). MDPs are stochastic processes in which the state transition probabilities are influenced by the agent’s actions. In real-world applications such as those mentioned above, the state space of an MDP can be high-dimensional. To find an optimal strategy that maximizes cumulative rewards, one must approximate the optimal value function, the solution to a specific stochastic fixed-point equation known as the Bellman equation. However, approximating the optimal value function in high-dimensional state spaces poses significant computational challenges. Many algorithms exhibit an exponential growth in the required computational resources to achieve a given accuracy as the dimension increases. This phenomenon is referred to as the curse of dimensionality (cf., e.g., [1], [2]). In fact, it has been shown (cf., e.g., [3], [4]) that all deterministic algorithms suffer from this curse.

In our recent works [5] and [6], we addressed this issue in two steps. First, by combining ideas from the full-history recursive multilevel Picard (MLP) approximation method, a nonlinear Monte Carlo scheme which was recently introduced to solve certain nonlinear partial differential equations [7], with concepts from Q-learning [8], we proposed in [5] a variant of MLP approximations tailored to Bellman equations. This Monte Carlo method approximates Bellman equations such that the computational effort grows at most polynomially with both the dimension of the state space and the reciprocal of the desired approximation accuracy. In particular, this algorithm avoids the curse of dimensionality.

Second, in [6], we proved that deep neural networks can overcome the curse of dimensionality in the approximation of the solution of the Bellman equation. To this end (see [9] for similar techniques), we showed that the aforementioned algorithm can be represented through deep neural networks. This representation, combined with the curse-of-dimensionality-free approximation of the Monte Carlo scheme in [5], establishes the existence of deep neural networks that approximate the solution of the Bellman equation such that the number of parameters of the employed deep neural networks grows at most polynomially with both the dimension of the state space and the reciprocal of the desired approximation accuracy.

References
[1] Bellman, R., (1957) Dynamic Programming, Princeton University Press
[2] Novak, E., and Woźniakowski, H. Tractability of multivariate problems. Vol. 1: Linear information, vol. 6 of EMS Tracts in Mathematics. European Mathematical Society (EMS), Zürich, 2008.
[3] Chow, C.-S., Tsitsiklis, J. N., (1989), The complexity of dynamic programming, 1989, J. Complexity 5, 4, 466-488
[4] Chow, C.-S., Tsitsiklis, J. N., (1991), An optimal one-way multigrid algorithm for discrete-time stochastic control, IEEE Trans. Automat. Control 36, 8, 898-914
[5] Beck, C., Jentzen, A., Kleinberg, K., Kruse, T., (2025), Nonlinear Monte Carlo Methods with Polynomial Runtime for Bellman Equations of Discrete Time High-Dimensional Stochastic Optimal Control Problems. Appl. Math. Optim. 91, 26
[6] Jentzen, A., Kleinberg, K., Kruse, T., (2025), Deep neural networks can provably solve Bellman equations for Markov decision processes without the curse of dimensionality, arXiv:2506.22851
[7] Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A., and von Wurstemberger, P., (2020), Overcoming the curse of dimensionality in the numerical approximation of semilinear parabolic partial differential equations, Proc. of the Royal Soc. A. 476, 2244
[8] Watkins, C., (1989), Learning from delayed rewards, Ph.D. Thesis, University of Cambridge
[9] Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A., (2020), A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations, SN Partial Differ. Equ. Appl. 1, 10