sim_pomdp
sim_pomdp(transition, observation, reward, discount, state_prior = rep(1, dim(observation)[[1]])/dim(observation)[[1]], x0, a0 = 1, Tmax = 20, policy = NULL, alpha = NULL, reps = 1, ...)
transition | Transition matrix, dimension n_s x n_s x n_a |
---|---|
observation | Observation matrix, dimension n_s x n_z x n_a |
reward | reward matrix, dimension n_s x n_a |
discount | the discount factor |
state_prior | initial belief state, optional, defaults to uniform over states |
x0 | initial state |
a0 | initial action (default is action 1, e.g. can be arbitrary if the observation process is independent of the action taken) |
Tmax | duration of simulation |
policy | Simulate using a pre-computed policy (e.g. MDP policy) instead of POMDP |
alpha | the matrix of alpha vectors returned by |
reps | number of replicate simulations to compute |
... | additional arguments to mclapply |
a data frame with columns for time, state, obs, action, and (discounted) value.
simulation assumes the following order of updating: For system in state[t] at time t, an observation of the system obs[t] is made, and then action[t] is based on that observation and the given policy, returning (discounted) reward[t].
# NOT RUN { ## Takes > 5s ## Use example code to generate matrices for pomdp problem: source(system.file("examples/fisheries-ex.R", package = "sarsop")) alpha <- sarsop(transition, observation, reward, discount, precision = 10) sim <- sim_pomdp(transition, observation, reward, discount, x0 = 5, Tmax = 20, alpha = alpha) ## compare to a simple constant harvest policy, with 4 replicates: sim <- sim_pomdp(transition, observation, reward, discount, x0 = 5, Tmax = 20, policy = rep(2, length(states)), reps = 4) # }