Evaluation & Testing

Evaluation is handled by the Tester class, with additional functionality for two-player environments implemented in the TwoPlayerTester class.

Evaluation can be run every epoch as part of a training process, or can be run independently via --mode=test. Both are configured the same way. Multi-Player environments are tested against any number of baselines, while single-player environments are evaluated using additional self-play episodes with separate hyperparameters from training episodes (perhaps with temperature = 0.0 or less to no added noise).

Evaluation/Testing Parameters

algo_config: hyperparameters/configuration for the trained algorithm, see algorithm wiki page for more details
episodes_per_epoch: number of episodes to collect for each test/baseline
[Multi-Player envs] baselines: list of baseline evaluator configurations, see below

Baselines

Random

The random baseline chooses a random move from the set of legal moves in a given state.

Parameters

None

Greedy

Greedy baselines choose a move based upon a specified heuristic. Environments implement specific heuristics to choose from. For example, one of the heuristics implemented by Othello is corners, which rates actions higher when they place a tile in the corner. Each environment wiki page lists the implemented heuristics

Parameters

heuristic: the environment-specific heuristic used to evaluate the current state

Greedy MCTS

Much like AlphaZero, uses Monte Carlo Tree Search to explore game states, but rather than use a neural network to evaluate a state, Greedy MCTS uses a heuristic at each leaf node. Greedy MCTS just relies on backpropogating Q-values, using a uniform distribution as the policy for any given state.

Parameters

heuristic: the environment-specific heuristic used to evaluate a leaf node
num_iters: an iteration of vectorized MCTS moves along an edge from one node to the next. A budget of size num_iters is given, which determines how many edges will be traversed during search. Note: this differs from other implementations of MCTS, where an iteration is defined as the process of traveling from the root node to a leaf node, and iteration for M iters will yield a search tree with M+1 nodes.
max_nodes: max size of the search tree
dirichlet epsilon: proportion of root node policy composed of dirichlet noise. This is more useful for AlphaZero/LazyZero, I would advise setting this to 0.0 (no noise) when using Greedy MCTS.
dirichlet alpha: magnitude of dirichlet noise

Example Configurations

Here are a few example Evaluation configuration files:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evaluation & Testing

Evaluation/Testing Parameters

Baselines

Random

Parameters

Greedy

Parameters

Greedy MCTS

Parameters

Example Configurations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally