![]() |
VOOZH | about |
Value iteration is a fundamental algorithm in the field of Reinforcement learning(RL) used to find the optimal policy for a Markov Decision Process (MDP). It iteratively updates the value of each state by considering all possible actions and states, progressively refining its estimates until it converges to the optimal solution. This allows an agent to find the best strategy.
The goal of an MDP is to find an optimal policy that maximizes the expected cumulative reward for the agent over time.
Start by initializing the value function for all states. Typically, this value is set to zero for all states at the beginning.
Iteratively update the value function using the Bellman equation:
This equation calculates the expected cumulative reward for taking action in state , transitioning to state and then following the optimal policy thereafter.
Continue the iteration until the value function converges i.e the change in the value function between iterations is smaller than a predefined threshold .
Once the value function has converged, the optimal policy can be derived by selecting the action that maximizes the expected cumulative reward for each state:
Letβs implement the Value Iteration algorithm using a simple MDP with three states: and two actions.
Using the value iteration algorithm, we can find the optimal policy and value function for this MDP.
In this step, we will be using Numpy library and we define the states, actions and the transition model and reward function that govern the system.
Here we implement the Value Iteration process that iteratively updates the value of each state until convergence.
Now we call the value_iteration function with the defined parameters and display the results (optimal policy and value function).
Output: