代写Machine Learning中的Reinforcement Learning.
Introduction
In this project, we will help Wheelbot determine an optimal policy that will
enable it to gather an optimal amount of reward reward (in expectation) in a
stochastic but fully-observable environment.
In Project 2, Wheelbot moved deterministically. Telling Wheelbot to move a
certain direction resulted, with probability 1, in Wheelbot moving the correct
direction. In this project, Wheelbot will move stochastically around its
environment. Additionally, Wheelbot will be given a reward signal upon
transitioning into a new state. Obstacles of known location will still be
present.
Reinforcement Learning
We can formalize this environment as a Markov Decision Process with states S,
actions A, transition probabilities T and rewards R. The reward function,
R(s) will give Wheelbot a reward signal for being in state s ∈ S. There will a
be a transition probability for each possible combination of a A.
The states of the environment are the grid cells. The environment is fully
observable because Wheelbot always knows which grid cell it is in. Wheelbot’s
transition model is as follows: Wheelbot makes the “correct” transition with
probability 0 ≤ pc ≤ 1 for any action a. The remaining probability mass is
divided equally amongst incorrect transitions to neighboring grid cells. For
example, if Wheelbot is told to go Up and pc = 0.85, it will transition Up
with probability 0.85. With probability 0.05 it will go Right. With
probability 0.05 it will go Left. With probability 0.05 it will go Down. Note
that care must be taken when considering the edges of the grid. When Wheelbot
tries to transition into a wall, it will stay where it is. Along the edges of
the environment and in the corners, self-transitions will have non-zero
probability, and the transition function will be different than it is in the
rest of the grid. To simplify calculations, we will not allow diagonal
transitions in this project. Only the actions A = {Up, Down, Left, Right} are
valid.
Project Details
Your task in this project is as follows: given a list of known obstacle
locations, a goal location, and pc , solve this MDP using the value iteration
algorithm to generate an optimal policy π(s) → a that maps states to optimal
actions so as to maximize Wheelbot’s expected utility. Wheelbot’s expected
utility will be defined in terms of a reward function R(s) that you define.
This reward function should be generalizable to any set of obstacle locations
and goal a location. You will also need to define the specifics of given the
definition of Wheelbot’s transition model above.
Once you have generated an optimal policy, Wheelbot should follow that policy
to (hopefully) make it to the goal. Note that, due to the stochastic nature of
the environment, it is possible for Wheelbot to run into an obstacle, even
when following an optimal policy. Therefore, it is important to test your
program with a number of values of pc , including pc = 1, to ensure that this
does not happen when the environment becomes deterministic again. Another
possible method of testing is to set pc to several different (increasingly
large) values and try your program a large number of times for each such
value, keeping track of the number of times Wheelbot runs into an obstacle.
The environment should be kept the same through all runs. In general, the more
deterministic the environment becomes, the less often Wheelbot should run into
an obstacle. The environments that we use to test your program will be
solvable in the sense that there will be at least one possible path to the
goal state from the start state.
Wheelbot will have the same sensors as in Project 2, with the exception that
the local obstacle sensor will not be used in this project (as there are not
hidden obstacles).
Implementation Details
This project will require you to modify the files Project3.h and Project3.cpp
files in the project source code. You are not to modify any other file that is
part of the simulator. You can, however, add new files to the project to
implement new classes as you see fit. We will test your project by copying
your Project3.h and Project3.cpp files, as well as any files you have added to
the project, into our simulation environment and running it.
Feel free to use the C++ STL and STD library data structures and containers.
Additionally, if you have previously implemented data structures you wish to
reuse, please do. However, you must not use anything that trivializes the
problem. For instance, do not use a downloaded value iteration algorithm
package. You must implement value iteration yourself.
For full credit your R(s) must be valid (capable of solving the problem), and
your program must compute and follow the optimal policy relative to this
reward function. Importantly, if pc = 1, your robot should follow a shortest
path (there may be multiple such paths) to the goal. pc may be adjusted in
main.cpp during your testing. Set the discount factor = 0.9 for all testing.
Submission
Please submit your modified Project3.h and Project3.cpp, as well as any
additional files you added to complete this project, to Blackboard.