Skip to content

Evaluation & Testing

Jacob Marshall edited this page Jan 21, 2024 · 1 revision

Evaluation is handled by the Tester class, with additional functionality for two-player environments implemented in the TwoPlayerTester class.

Evaluation can be run every epoch as part of a training process, or can be run independently via --mode=test. Both are configured the same way. Multi-Player environments are tested against any number of baselines, while single-player environments are evaluated using additional self-play episodes with separate hyperparameters from training episodes (perhaps with temperature = 0.0 or less to no added noise).

Evaluation/Testing Parameters

  • algo_config: hyperparameters/configuration for the trained algorithm, see algorithm wiki page for more details
  • episodes_per_epoch: number of episodes to collect for each test/baseline
  • [Multi-Player envs] baselines: list of baseline evaluator configurations, see below

Baselines

Random

The random baseline chooses a random move from the set of legal moves in a given state.

Parameters

None

Greedy

Greedy baselines choose a move based upon a specified heuristic. Environments implement specific heuristics to choose from. For example, one of the heuristics implemented by Othello is corners, which rates actions higher when they place a tile in the corner. Each environment wiki page lists the implemented heuristics

Parameters

  • heuristic: the environment-specific heuristic used to evaluate the current state

Greedy MCTS

Much like AlphaZero, uses Monte Carlo Tree Search to explore game states, but rather than use a neural network to evaluate a state, Greedy MCTS uses a heuristic at each leaf node. Greedy MCTS just relies on backpropogating Q-values, using a uniform distribution as the policy for any given state.

Parameters

  • heuristic: the environment-specific heuristic used to evaluate a leaf node
  • num_iters: an iteration of vectorized MCTS moves along an edge from one node to the next. A budget of size num_iters is given, which determines how many edges will be traversed during search. Note: this differs from other implementations of MCTS, where an iteration is defined as the process of traveling from the root node to a leaf node, and iteration for M iters will yield a search tree with M+1 nodes.
  • max_nodes: max size of the search tree
  • dirichlet epsilon: proportion of root node policy composed of dirichlet noise. This is more useful for AlphaZero/LazyZero, I would advise setting this to 0.0 (no noise) when using Greedy MCTS.
  • dirichlet alpha: magnitude of dirichlet noise

Example Configurations

Here are a few example Evaluation configuration files:

Clone this wiki locally