qgym.envs.routing.routing_rewarders module
This module will contain some vanilla Rewarders for the Routing
environment.
- Usage:
The rewarders in this module can be customized by initializing the rewarders with different values.
from qgym.envs.routing import BasicRewarder rewarder = BasicRewarder( illegal_action_penalty = -1, update_cycle_penalty = -2, schedule_gate_bonus: = 3, )
After initialization, the rewarders can be given to the
Routing
environment.
Note
When implementing custom rewarders, they should inherit from
Rewarder
. Furthermore, they must implement the
compute_reward()
method. Which takes as input the
old state, the new state and the given action. See the documentation of the
routing
module for more information on the state and
action space.
- class qgym.envs.routing.routing_rewarders.BasicRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]
Bases:
Rewarder
RL Rewarder, for computing rewards on the
RoutingState
.- __init__(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]
Set the rewards and penalties.
- Parameters:
illegal_action_penalty (
float
) – Penalty for performing an illegal action. An action is illegal when the action means ‘surpass’ even though the next gate cannot be surpassed. This value should be negative (but is not required) and defaults to -50.penalty_per_swap (
float
) – Penalty for placing a swap. In general, we want to have as little swaps as possible. Therefore, this value should be negative and defaults to -10.reward_per_surpass (
float
) – Reward given for surpassing a gate. In general, we want to have go to the end of the circuit as fast as possible. Therefore, this value should be positive and defaults to 10.
- compute_reward(*, old_state, action, new_state)[source]
Compute a reward, based on the old state, new state, and the given action.
- Parameters:
old_state (
RoutingState
) –RoutingState
before the current action.action (
int
) – Action that has just been taken.new_state (
RoutingState
) –RoutingState
after the current action.
- Return type:
- Returns:
The reward for this action.
- class qgym.envs.routing.routing_rewarders.EpisodeRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10)[source]
Bases:
BasicRewarder
Rewarder for the
Routing
environment, which only gives a reward after at the end of a full episode. The reward is the highest for the lowest amount of SWAPs. This could be improved for setting for taking into account the fidelity of edges and scoring good and looking at what edges the circuit is executed.- compute_reward(*, old_state, action, new_state)[source]
Compute a reward, based on the new state, and the given action.
- Parameters:
old_state (
RoutingState
) –RoutingState
before the current action.action (
int
) – Action that has just been taken.new_state (
RoutingState
) –RoutingState
after the current action.
- Return type:
- Returns:
If an action is illegal returns the illegal_action_penalty. If the episode is finished returns the reward calculated over the episode, otherwise returns 0.
- class qgym.envs.routing.routing_rewarders.SwapQualityRewarder(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10, good_swap_reward=5)[source]
Bases:
BasicRewarder
Rewarder for the
Routing
environment which takes swap qualities into account.The
SwapQualityRewarder
has an adjusted reward w.r.t. theBasicRewarder
in the sense that good SWAPs give lower penalties and bad SWAPs give higher penalties.- __init__(illegal_action_penalty=-50, penalty_per_swap=-10, reward_per_surpass=10, good_swap_reward=5)[source]
Set the rewards and penalties and a flag.
- Parameters:
illegal_action_penalty (
float
) – Penalty for performing an illegal action. An action is illegal when the action means ‘surpass’ even though the next gate cannot be surpassed. This value should be negative (but is not required) and defaults to -50.penalty_per_swap (
float
) – Penalty for placing a swap. In general, we want to have as little swaps as possible. Therefore, this value should be negative and defaults to -10.reward_per_surpass (
float
) – Reward given for surpassing a gate. In general, we want to have go to the end of the circuit as fast as possible. Therefore, this value should be positive and defaults to 10.good_swap_reward (
float
) – Reward given for placing a good swap. In general, we want to place as little swaps as possible. However, when they are good, the penalty for the placement should be suppressed. That happens with this reward. So, the value should be positive and smaller than the penalty_per_swap, in order not to get positive rewards for swaps, defaults to 5.
- compute_reward(*, old_state, action, new_state)[source]
Compute a reward, based on the old state, the given action and the new state.
Specifically, the change in observation reach is used.
- Parameters:
old_state (
RoutingState
) –RoutingState
before the current action.action (
int
) – Action that has just been taken.new_state (
RoutingState
) –RoutingState
after the current action.
- Return type:
- Returns:
The reward for this action. If the action is illegal, then the reward is illegal_action_penalty. If the action is legal, then the reward for a surpass is just reward_per_surpass. But, for a legal swap the reward adjusted with respect to the BasicRewarder. Namely, the penalty of a swap is reduced if it increases the observation_reach and the penalty is increased if the observation_reach is decreases.