qgym.envs.routing.routing_rewarders module

This module will contain some vanilla Rewarders for the Routing environment.

Usage:

The rewarders in this module can be customized by initializing the rewarders with different values.

from qgym.envs.routing import BasicRewarder

rewarder = BasicRewarder(
    illegal_action_penalty = -1,
    update_cycle_penalty = -2,
    schedule_gate_bonus: = 3,
    )

After initialization, the rewarders can be given to the Routing environment.

Note

When implementing custom rewarders, they should inherit from Rewarder. Furthermore, they must implement the compute_reward() method. Which takes as input the old state, the new state and the given action. See the documentation of the routing module for more information on the state and action space.

class qgym.envs.routing.routing_rewarders.BasicRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]

Bases: Rewarder

RL Rewarder, for computing rewards on the RoutingState.

__init__(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]

Set the rewards and penalties.

Parameters:

illegal_action_penalty (float) – Penalty for performing an illegal action. An action is illegal when the action means ‘surpass’ even though the next gate cannot be surpassed. This value should be negative (but is not required) and defaults to -50.
penalty_per_swap (float) – Penalty for placing a swap. In general, we want to have as little swaps as possible. Therefore, this value should be negative and defaults to -10.
reward_per_surpass (float) – Reward given for surpassing a gate. In general, we want to have go to the end of the circuit as fast as possible. Therefore, this value should be positive and defaults to 10.

compute_reward(*, old_state, action, new_state)[source]

Compute a reward, based on the old state, new state, and the given action.

Parameters:

old_state (RoutingState) – RoutingState before the current action.
action (int) – Action that has just been taken.
new_state (RoutingState) – RoutingState after the current action.

Return type:

float

Returns:

The reward for this action.

class qgym.envs.routing.routing_rewarders.EpisodeRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]

Bases: BasicRewarder

Rewarder for the Routing environment, which only gives a reward after at the end of a full episode. The reward is the highest for the lowest amount of SWAPs. This could be improved for setting for taking into account the fidelity of edges and scoring good and looking at what edges the circuit is executed.

compute_reward(*, old_state, action, new_state)[source]

Compute a reward, based on the new state, and the given action.

Parameters:

old_state (RoutingState) – RoutingState before the current action.
action (int) – Action that has just been taken.
new_state (RoutingState) – RoutingState after the current action.

Return type:

float

Returns:

If an action is illegal returns the illegal_action_penalty. If the episode is finished returns the reward calculated over the episode, otherwise returns 0.

class qgym.envs.routing.routing_rewarders.SwapQualityRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10, good_swap_reward=5)[source]

Bases: BasicRewarder

Rewarder for the Routing environment which takes swap qualities into account.

The SwapQualityRewarder has an adjusted reward w.r.t. the BasicRewarder in the sense that good SWAPs give lower penalties and bad SWAPs give higher penalties.

__init__(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10, good_swap_reward=5)[source]

Set the rewards and penalties and a flag.

Parameters:

illegal_action_penalty (float) – Penalty for performing an illegal action. An action is illegal when the action means ‘surpass’ even though the next gate cannot be surpassed. This value should be negative (but is not required) and defaults to -50.
penalty_per_swap (float) – Penalty for placing a swap. In general, we want to have as little swaps as possible. Therefore, this value should be negative and defaults to -10.
reward_per_surpass (float) – Reward given for surpassing a gate. In general, we want to have go to the end of the circuit as fast as possible. Therefore, this value should be positive and defaults to 10.
good_swap_reward (float) – Reward given for placing a good swap. In general, we want to place as little swaps as possible. However, when they are good, the penalty for the placement should be suppressed. That happens with this reward. So, the value should be positive and smaller than the penalty_per_swap, in order not to get positive rewards for swaps, defaults to 5.

compute_reward(*, old_state, action, new_state)[source]

Compute a reward, based on the old state, the given action and the new state.

Specifically, the change in observation reach is used.

Parameters:

old_state (RoutingState) – RoutingState before the current action.
action (int) – Action that has just been taken.
new_state (RoutingState) – RoutingState after the current action.

Return type:

float

Returns:

The reward for this action. If the action is illegal, then the reward is illegal_action_penalty. If the action is legal, then the reward for a surpass is just reward_per_surpass. But, for a legal swap the reward adjusted with respect to the BasicRewarder. Namely, the penalty of a swap is reduced if it increases the observation_reach and the penalty is increased if the observation_reach is decreases.