sim_pomdp

sim_pomdp(transition, observation, reward, discount, state_prior = rep(1,
  dim(observation)[[1]])/dim(observation)[[1]], x0, a0 = 1, Tmax = 20,
  policy = NULL, alpha = NULL, reps = 1, ...)

Arguments

transition

Transition matrix, dimension n_s x n_s x n_a

observation

Observation matrix, dimension n_s x n_z x n_a

reward

reward matrix, dimension n_s x n_a

discount

the discount factor

state_prior

initial belief state, optional, defaults to uniform over states

x0

initial state

a0

initial action (default is action 1, e.g. can be arbitrary if the observation process is independent of the action taken)

Tmax

duration of simulation

policy

Simulate using a pre-computed policy (e.g. MDP policy) instead of POMDP

alpha

the matrix of alpha vectors returned by sarsop

reps

number of replicate simulations to compute

...

additional arguments to mclapply

Value

a data frame with columns for time, state, obs, action, and (discounted) value.

Details

simulation assumes the following order of updating: For system in state[t] at time t, an observation of the system obs[t] is made, and then action[t] is based on that observation and the given policy, returning (discounted) reward[t].

Examples

# NOT RUN {
 ## Takes > 5s
## Use example code to generate matrices for pomdp problem:
source(system.file("examples/fisheries-ex.R", package = "sarsop"))
alpha <- sarsop(transition, observation, reward, discount, precision = 10)
sim <- sim_pomdp(transition, observation, reward, discount,
                 x0 = 5, Tmax = 20, alpha = alpha)

 ## compare to a simple constant harvest policy, with 4 replicates:
 sim <- sim_pomdp(transition, observation, reward, discount,
                 x0 = 5, Tmax = 20, policy = rep(2, length(states)),
                 reps = 4)

# }