State of repo for ISOLA paper

This commit is contained in:
Julian Schönberger
2024-10-25 17:24:11 +02:00
parent 95749d8238
commit e37b23c20c
120 changed files with 1487 additions and 6439 deletions

View File

@ -1,23 +0,0 @@
stages: # List of stages for jobs, and their order of execution
- build
build-job: # This job runs in the build stage, which runs first.
stage: build
rules:
- if: $CI_COMMIT_REF_NAME == "pypi" #when commit pushed in this branch it will trigger this job
variables:
TWINE_USERNAME: $USER_NAME
TWINE_PASSWORD: $API_KEY
TWINE_REPOSITORY: rl-factory-grid
image: python:slim
script:
- echo "Compiling the code..."
- pip install -U twine
- python setup.py sdist bdist_wheel
- twine check dist/*
# try upload in test platform before the oficial
- twine upload --repository-url https://upload.pypi.org/legacy/ dist/*
- echo "Upload complete."

View File

@ -1,19 +0,0 @@
# Required
version: 2
# Set the OS, Python version and other tools you might need
build:
os: ubuntu-22.04
tools:
python: "3.12"
# Optional but recommended, declare the Python requirements required
# to build your documentation
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
python:
install:
- requirements: docs/requirements.txt
# Build documentation in the "docs/" directory with Sphinx
sphinx:
configuration: docs/source/conf.py

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2024 TRAIL lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@ -1,5 +1,7 @@
# About EDYS
by Steffen Illium, Joel Friedrich, Julian Schönberger, Robert Müller, Fabian Ritz
## Tackling emergent dysfunctions (EDYs) in cooperation with Fraunhofer-IKS.
Collaborating with Fraunhofer-IKS, this project is dedicated to investigating Emergent Dysfunctions (EDYs) within
@ -46,42 +48,32 @@ systems.
- This allows for processes such as retraining on an already initialized policy and fine-tuning to enhance the
agent's performance based on the enriched information.
## Setup
Install this environment using `pip install marl-factory-grid`. For more information refer
to ['installation'](docs/source/installation.rst).
Refer to [quickstart](_quickstart) for specific scenarios.
## Usage
The majority of environment objects, including entities, rules, and assets, can be loaded automatically.
Simply specify the requirements of your environment in a [
*yaml*-config file](marl_factory_grid/configs/default_config.yaml).
If you only plan on using the environment without making any modifications, use ``quickstart_use``.
This creates a default config-file and another one that lists all possible options of the environment.
Also, it generates an initial script where an agent is executed in the specified environment.
For further details on utilizing the environment, refer to ['usage'](docs/source/usage.rst).
Two example scripts, that show how you can execute different agents in varying configurations of the environment can be
found in ```env_examples```.
Existing modules include a variety of functionalities within the environment:
- [Agents](marl_factory_grid/algorithms) implement either static strategies or learning algorithms based on the specific
configuration.
- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/coin cleaning
- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/entitites.py) cleaning
[dirt](marl_factory_grid/modules/clean_up/entitites.py), picking
up [items](marl_factory_grid/modules/items/entitites.py) and
delivering them to designated drop-off locations.
- Agents are equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
- Agents can be equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
time if not charged at a chargepod.
- The [maintainer](marl_factory_grid/modules/maintenance/entities.py) aims to
repair [machines](marl_factory_grid/modules/machines/entitites.py) that lose health over time.
## Customization
If you plan on modifying the environment by for example adding entities or rules, use ``quickstart_modify``.
This creates a template module and a script that runs an agent, incorporating the generated module.
More information on how to modify the levels, entities, groups, rules and assets
goto [modifications](docs/source/modifications.rst).
You can modify the environment in various ways, by for example adding level, entities or rules.
### Levels
@ -96,7 +88,7 @@ General:
... or create your own , maybe with the help of [asciiflow.com](https://asciiflow.com/#/).
Make sure to use `#` as [Walls](marl_factory_grid/environment/entity/wall.py), `-` as free (walkable) floor, `D`
for [Doors](./modules/doors/entities.py).
for [Doors](marl_factory_grid/modules/doors/entitites.py).
Other Entites (define you own) may bring their own `Symbols`
### Entites
@ -104,19 +96,26 @@ Other Entites (define you own) may bring their own `Symbols`
Entites are [Objects](marl_factory_grid/environment/entity/object.py) that can additionally be assigned a position.
Abstract Entities are provided.
If you wish to introduce new entities to the environment just create a new module that implements the entity class.
If necessary, provide additional classe such as custom actions or rewards and load the entity into the environment
using the config file.
### Groups
[Groups](marl_factory_grid/environment/groups/objects.py) are entity Sets that provide administrative access to all
group members.
All [Entites](marl_factory_grid/environment/entity/global_entities.py) are available at runtime as EnvState property.
All [Entites](marl_factory_grid/environment/entity/entity.py) are available at runtime as EnvState property.
### Rules
[Rules](marl_factory_grid/environment/entity/object.py) define how the environment behaves on microscale.
[Rules](marl_factory_grid/environment/rules.py) define how the environment behaves on microscale.
Each of the hookes (`on_init`, `pre_step`, `on_step`, '`post_step`', `on_done`)
provide env-access to implement custom logic, calculate rewards, or gather information.
![Hooks](../../images/Hooks_FIKS.png)
If you wish to introduce new rules to the environment make sure it implements the Rule class and override its' hooks to
implement your own rule logic.
![Hooks](images/Hooks_FIKS.png)
[Results](marl_factory_grid/environment/entity/object.py) provide a way to return `rule` evaluations such as rewards and
state reports back to the environment.

65
README-EMAS.md Normal file
View File

@ -0,0 +1,65 @@
# Emergence in Multi-Agent Systems: A Safety Perspective
by Philipp Altmann, Julian Schönberger, Steffen Illium, Maximilian Zorn, Fabian Ritz, Tom Haider, Simon Burton, Thomas Gabor
## About
This is the code for the experiments of our paper. The experiments are build on top of the ```EDYS environment``` ,
which we developed specifically for studying emergent behaviour in multi-agent systems. This environment is versatile
and can be configured in various ways with different degrees of complexity. We refer to [README-EDYS.md](README-EDYS.md) for a
detailed overview of the functionalities of the environment and an explanation of the project context.
## Setup
1. Set up a virtualenv with python 3.10 or higher. You can use pyvenv or conda for this.
2. Run ```pip install -r requirements.txt``` to get requirements.
3. In case there is no ```study_out/``` folder in the root directory, create one.
## Rerunning the Experiments
The respective experiments from our paper can be reenacted in [main.py](main.py).
Just select the method representing the part of our experiments you want to rerun and
execute it via the ```__main__``` function.
## Further Remarks
1. We use config files located in the [configs](marl_factory_grid/configs) and the
[multi_agent_configs](marl_factory_grid/algorithms/marl/multi_agent_configs),
[single_agent_configs](marl_factory_grid/algorithms/marl/single_agent_configs) folders to configure the environments and the RL
algorithm for our experiments, respectively. You don't need to change anything to rerun the
experiments, but we provided some additional comments in the configs for an overall better
understanding of the functionalities.
2. The results of the experiment runs are stored in [study_out](study_out).
3. We reuse the ```coin-quadrant``` implementation of the RL agent for the ```two_rooms``` environment. The coin assets
are masked with flags in the visualization. This masking does not affect the RL agents in any way.
4. The code for the cost contortion for preventing the emergent behavior of the TSP agents can
be found in [contortions.py](marl_factory_grid/algorithms/static/contortions.py).
5. The functionalities that drive the emergence prevention mechanisms for the RL agents is mainly
located in the utility methods ```get_ordered_coin_piles (line 94)``` (for solving the emergence in the
coin-quadrant environment) and ```distribute_indices (line 171)``` (mechanism for two_doors), that are part of
[utils.py](marl_factory_grid/algorithms/marl/utils.py)
6. [agent_models](marl_factory_grid/algorithms/agent_models) contains the parameters of the trained models for the RL
agents. You can repeat the training by executing the training procedures in [main.py](main.py). Alternatively, you can
use your own trained agents, which you have obtained by modifying the training configurations in [single_agent_configs](marl_factory_grid/algorithms/marl/single_agent_configs)
, for the evaluation experiments by inserting the names of the run folders, e.g. “run9” and “run 12”, into the list in
the methods ```coin_quadrant_multi_agent_rl_eval``` and ```two_rooms_multi_agent_rl_eval``` in [RL_runner.py](marl_factory_grid/algorithms/marl/RL_runner.py).
## Requirements
Python 3.10
```
numpy==1.26.4
pygame>=2.0
numba>=0.56
gymnasium>=0.26
seaborn
pandas
PyYAML
networkx
torch
tqdm
packaging
pillow
scipy
```

View File

@ -1,75 +0,0 @@
# About EDYS
## Tackling emergent dysfunctions (EDYs) in cooperation with Fraunhofer-IKS.
Collaborating with Fraunhofer-IKS, this project is dedicated to investigating Emergent Dysfunctions (EDYs) within
multi-agent environments. In multi-agent reinforcement learning (MARL), a population of agents learns by interacting
with each other in a shared environment and adapt their behavior based on the feedback they receive from the environment
and the actions of other agents.
In this context, emergent behavior describes spontaneous behaviors resulting from interactions among agents and
environmental stimuli, rather than explicit programming. This promotes natural, adaptable behavior, increases system
unpredictability for dynamic learning , enables diverse strategies, and encourages collective intelligence for complex
problem-solving. However, the complex dynamics of the environment also give rise to emerging dysfunctions—unexpected
issues from agent interactions. This research aims to enhance our understanding of EDYs and their impact on multi-agent
systems.
### Project Objectives:
- Create an environment that provokes emerging dysfunctions.
- This is achieved by creating a high level of background noise in the domain, where various entities perform
diverse tasks, resulting in a deliberately chaotic dynamic.
- The goal is to observe and analyze naturally occurring emergent dysfunctions within the complexity generated in
this dynamic environment.
- Observational Framework:
- The project introduces an environment that is designed to capture dysfunctions as they naturally occur.
- The environment allows for continuous monitoring of agent behaviors, actions, and interactions.
- Tracking emergent dysfunctions in real-time provides valuable data for analysis and understanding.
- Compatibility
- The Framework allows learning entities from different manufacturers and projects with varying representations
of actions and observations to interact seamlessly within the environment.
## Setup
Install this environment using `pip install marl-factory-grid`. For more information refer
to ['installation'](docs/source/installation.rst).
## Usage
The environment is configured to automatically load necessary objects, including entities, rules, and assets, based on your requirements.
You can utilize existing configurations to replicate the experiments from [this paper](PAPER).
- Preconfigured Studies:
The studies folder contains predefined studies that can be used to replicate the experiments.
These studies provide a structured way to validate and analyze the outcomes observed in different scenarios.
- Creating your own scenarios:
If you want to use the environment with custom entities, rules or levels refer to the [complete repository]().
Existing modules include a variety of functionalities within the environment:
- [Agents](marl_factory_grid/algorithms) implement either static strategies or learning algorithms based on the specific
configuration.
- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/entitites.py) cleaning
[dirt](marl_factory_grid/modules/clean_up/entitites.py), picking
up [items](marl_factory_grid/modules/items/entitites.py) and
delivering them to designated drop-off locations.
- Agents are equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
time if not charged at a chargepod.
## Limitations
The provided code and documentation are tailored for replicating and validating experiments as described in the paper.
Modifications to the environment, such as adding new entities, creating additional rules, or customizing behavior beyond the provided scope are not supported in this release.
If you are interested in accessing the complete project, including features not covered in this release, refer to the [full repository](LINK FULL REPO).
For further details on running the experiments, please consult the relevant documentation provided in the studies' folder.

View File

@ -1,112 +0,0 @@
---
General:
level_name: large
env_seed: 69
verbose: !!bool False
pomdp_r: 3
individual_rewards: !!bool True
Entities:
Defaults: {}
DirtPiles:
initial_dirt_ratio: 0.01 # On INIT, on max how many tiles does the dirt spawn in percent.
dirt_spawn_r_var: 0.5 # How much does the dirt spawn amount vary?
initial_amount: 1
max_local_amount: 3 # Max dirt amount per tile.
max_global_amount: 30 # Max dirt amount in the whole environment.
Doors:
closed_on_init: True
auto_close_interval: 10
indicate_area: False
Batteries: {}
ChargePods: {}
Destinations: {}
ReachedDestinations: {}
Items: {}
Inventories: {}
DropOffLocations: {}
Agents:
Wolfgang:
Actions:
- Noop
- Noop
- Noop
- CleanUp
Observations:
- Self
- Placeholder
- Walls
- DirtPiles
- Placeholder
- Doors
- Doors
Bjoern:
Actions:
# Move4, Noop
- Move8
- DoorUse
- ItemAction
Observations:
- Defaults
- Combined:
- Other
- Walls
- Items
- Inventory
Karl-Heinz:
Actions:
- Move8
- DoorUse
Observations:
# Wall, Only Other Agents
- Defaults
- Combined:
- Other
- Self
- Walls
- Doors
- Destinations
Manfred:
Actions:
- Move8
- ItemAction
- DoorUse
- CleanUp
- DestAction
- BtryCharge
Observations:
- Defaults
- Battery
- Destinations
- DirtPiles
- Doors
- Items
- Inventory
- DropOffLocations
Rules:
Defaults: {}
Collision:
done_at_collisions: !!bool False
DirtRespawnRule:
spawn_freq: 15
DirtSmearOnMove:
smear_amount: 0.12
DoorAutoClose: {}
DirtAllCleanDone: {}
Btry: {}
BtryDoneAtDischarge: {}
DestinationReach: {}
DestinationSpawn: {}
DestinationDone: {}
ItemRules: {}
Assets:
- Defaults
- Dirt
- Door
- Machine
- Item
- Destination
- DropOffLocation
- Chargepod

View File

@ -1,189 +0,0 @@
import sys
from pathlib import Path
##############################################
# keep this for stand alone script execution #
##############################################
from environments.factory.base.base_factory import BaseFactory
from environments.logging.recorder import EnvRecorder
try:
# noinspection PyUnboundLocalVariable
if __package__ is None:
DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(DIR.parent))
__package__ = DIR.name
else:
DIR = None
except NameError:
DIR = None
pass
##############################################
##############################################
##############################################
import simplejson
from environments import helpers as h
from environments.factory.additional.combined_factories import DestBatteryFactory
from environments.factory.additional.dest.factory_dest import DestFactory
from environments.factory.additional.dirt.factory_dirt import DirtFactory
from environments.factory.additional.item.factory_item import ItemFactory
from environments.helpers import ObservationTranslator, ActionTranslator
from environments.logging.envmonitor import EnvMonitor
from environments.utility_classes import ObservationProperties, AgentRenderOptions, MovementProperties
def policy_model_kwargs():
return dict(ent_coef=0.01)
def dqn_model_kwargs():
return dict(buffer_size=50000,
learning_starts=64,
batch_size=64,
target_update_interval=5000,
exploration_fraction=0.25,
exploration_final_eps=0.025
)
def encapsule_env_factory(env_fctry, env_kwrgs):
def _init():
with env_fctry(**env_kwrgs) as init_env:
return init_env
return _init
if __name__ == '__main__':
render = False
# Define Global Env Parameters
# Define properties object parameters
factory_kwargs = dict(
max_steps=400, parse_doors=True,
level_name='rooms',
doors_have_area=True, verbose=False,
mv_prop=MovementProperties(allow_diagonal_movement=True,
allow_square_movement=True,
allow_no_op=False),
obs_prop=ObservationProperties(
frames_to_stack=3,
cast_shadows=True,
omit_agent_self=True,
render_agents=AgentRenderOptions.LEVEL,
additional_agent_placeholder=None,
)
)
# Bundle both environments with global kwargs and parameters
# Todo: find a better solution, like outo module loading
env_map = {'DirtFactory': DirtFactory,
'ItemFactory': ItemFactory,
'DestFactory': DestFactory,
'DestBatteryFactory': DestBatteryFactory
}
env_names = list(env_map.keys())
# Put all your multi-seed agends in a single folder, we do not need specific names etc.
available_models = dict()
available_envs = dict()
available_runs_kwargs = dict()
available_runs_agents = dict()
max_seed = 0
# Define this folder
combinations_path = Path('combinations')
# Those are all differently trained combinations of mdoels, environment and parameters
for combination in (x for x in combinations_path.iterdir() if x.is_dir()):
# These are all the models for this specific combination
for model_run in (x for x in combination.iterdir() if x.is_dir()):
model_name, env_name = model_run.name.split('_')[:2]
if model_name not in available_models:
available_models[model_name] = h.MODEL_MAP[model_name]
if env_name not in available_envs:
available_envs[env_name] = env_map[env_name]
# Those are all available seeds
for seed_run in (x for x in model_run.iterdir() if x.is_dir()):
max_seed = max(int(seed_run.name.split('_')[0]), max_seed)
# Read the environment configuration from ROM
with next(seed_run.glob('env_params.json')).open('r') as f:
env_kwargs = simplejson.load(f)
available_runs_kwargs[seed_run.name] = env_kwargs
# Read the trained model_path from ROM
model_path = next(seed_run.glob('model.zip'))
available_runs_agents[seed_run.name] = model_path
# We start by combining all SAME MODEL CLASSES per available Seed, across ALL available ENVIRONMENTS.
for model_name, model_cls in available_models.items():
for seed in range(max_seed):
combined_env_kwargs = dict()
model_paths = list()
comparable_runs = {key: val for key, val in available_runs_kwargs.items() if (
key.startswith(str(seed)) and model_name in key and key != 'key')
}
for name, run_kwargs in comparable_runs.items():
# Select trained agent as a candidate:
model_paths.append(available_runs_agents[name])
# Sort Env Kwars:
for key, val in run_kwargs.items():
if key not in combined_env_kwargs:
combined_env_kwargs.update(dict(key=val))
else:
assert combined_env_kwargs[key] == val, "Check the combinations you try to make!"
# Update and combine all kwargs to account for multiple agent etc.
# We cannot capture all configuration cases!
for key, val in factory_kwargs.items():
if key not in combined_env_kwargs:
combined_env_kwargs[key] = val
else:
assert combined_env_kwargs[key] == val
del combined_env_kwargs['key']
combined_env_kwargs.update(n_agents=len(comparable_runs))
with type("CombinedEnv", tuple(available_envs.values()), {})(**combined_env_kwargs) as combEnv:
# EnvMonitor Init
comb = f'comb_{model_name}_{seed}'
comb_monitor_path = combinations_path / comb / f'{comb}_monitor.pick'
comb_recorder_path = combinations_path / comb / f'{comb}_recorder.json'
comb_monitor_path.parent.mkdir(parents=True, exist_ok=True)
monitoredCombEnv = EnvMonitor(combEnv, filepath=comb_monitor_path)
monitoredCombEnv = EnvRecorder(monitoredCombEnv, filepath=comb_recorder_path, freq=1)
# Evaluation starts here #####################################################
# Load all models
loaded_models = [available_models[model_name].load(model_path) for model_path in model_paths]
obs_translators = ObservationTranslator(
monitoredCombEnv.named_observation_space,
*[agent.named_observation_space for agent in loaded_models],
placeholder_fill_value='n')
act_translators = ActionTranslator(
monitoredCombEnv.named_action_space,
*(agent.named_action_space for agent in loaded_models)
)
for episode in range(1):
obs = monitoredCombEnv.reset()
if render: monitoredCombEnv.render()
rew, done_bool = 0, False
while not done_bool:
actions = []
for i, model in enumerate(loaded_models):
pred = model.predict(obs_translators.translate_observation(i, obs[i]))[0]
actions.append(act_translators.translate_action(i, pred))
obs, step_r, done_bool, info_obj = monitoredCombEnv.step(actions)
rew += step_r
if render: monitoredCombEnv.render()
if done_bool:
break
print(f'Factory run {episode} done, reward is:\n {rew}')
# Eval monitor outputs are automatically stored by the monitor object
# TODO: Plotting
monitoredCombEnv.save_records()
monitoredCombEnv.save_run()
pass

View File

@ -1,203 +0,0 @@
import sys
import time
from pathlib import Path
import simplejson
import stable_baselines3 as sb3
# This is needed, when you put this file in a subfolder.
try:
# noinspection PyUnboundLocalVariable
if __package__ is None:
DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(DIR.parent))
__package__ = DIR.name
else:
DIR = None
except NameError:
DIR = None
pass
from environments import helpers as h
from environments.factory.additional.dest.dest_util import DestModeOptions, DestProperties
from environments.factory.additional.btry.btry_util import BatteryProperties
from environments.logging.envmonitor import EnvMonitor
from environments.logging.recorder import EnvRecorder
from environments.factory.additional.combined_factories import DestBatteryFactory
from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
from plotting.compare_runs import compare_seed_runs
"""
Welcome to this quick start file. Here we will see how to:
0. Setup I/O Paths
1. Setup parameters for the environments (dirt-factory).
2. Setup parameters for the agent training (SB3: PPO) and save metrics.
Run the training.
3. Save environment and agent for later analysis.
4. Load the agent from drive
5. Rendering the environment with a run of the trained agent.
6. Plot metrics
"""
if __name__ == '__main__':
#########################################################
# 0. Setup I/O Paths
# Define some general parameters
train_steps = 1e6
n_seeds = 3
model_class = sb3.PPO
env_class = DestBatteryFactory
env_params_json = 'env_params.json'
# Define a global studi save path
start_time = int(time.time())
study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
# Create an _identifier, which is unique for every combination and easy to read in filesystem
identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
exp_path = study_root_path / identifier
#########################################################
# 1. Setup parameters for the environments (dirt-factory).
# Define property object parameters.
# 'ObservationProperties' are for specifying how the agent sees the environment.
obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT, # Agents won`t be shown in the obs at all
omit_agent_self=True, # This is default
additional_agent_placeholder=None, # We will not take care of future agent
frames_to_stack=3, # To give the agent a notion of time
pomdp_r=2 # the agent view-radius
)
# 'MovementProperties' are for specifying how the agent is allowed to move in the environment.
move_props = MovementProperties(allow_diagonal_movement=True, # Euclidean style (vertices)
allow_square_movement=True, # Manhattan (edges)
allow_no_op=False) # Pause movement (do nothing)
# 'DirtProperties' control if and how dirt is spawned
# TODO: Comments
dest_props = DestProperties(
n_dests = 2, # How many destinations are there
dwell_time = 0, # How long does the agent need to "wait" on a destination
spawn_frequency = 0,
spawn_in_other_zone = True, #
spawn_mode = DestModeOptions.DONE,
)
btry_props = BatteryProperties(
initial_charge = 0.9, #
charge_rate = 0.4, #
charge_locations = 3, #
per_action_costs = 0.01,
done_when_discharged = True,
multi_charge = False,
)
# These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
# TODO: Comments
factory_kwargs = dict(n_agents=1,
max_steps=400,
parse_doors=True,
level_name='rooms',
doors_have_area=True, #
verbose=False,
mv_prop=move_props, # See Above
obs_prop=obs_props, # See Above
done_at_collision=True,
dest_prop=dest_props,
btry_prop=btry_props
)
#########################################################
# 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
agent_kwargs = dict()
#########################################################
# Run the Training
for seed in range(n_seeds):
# Make a copy if you want to alter things in the training loop; like the seed.
env_kwargs = factory_kwargs.copy()
env_kwargs.update(env_seed=seed)
# Output folder
seed_path = exp_path / f'{str(seed)}_{identifier}'
seed_path.mkdir(parents=True, exist_ok=True)
# Parameter Storage
param_path = seed_path / env_params_json
# Observation (measures) Storage
monitor_path = seed_path / 'monitor.pick'
recorder_path = seed_path / 'recorder.json'
# Model save Path for the trained model
model_save_path = seed_path / f'model.zip'
# Env Init & Model kwargs definition
with env_class(**env_kwargs) as env_factory:
# EnvMonitor Init
env_monitor_callback = EnvMonitor(env_factory)
# EnvRecorder Init
env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
# Model Init
model = model_class("MlpPolicy", env_factory, verbose=1, seed=seed, device='cpu')
# Model train
model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
#########################################################
# 3. Save environment and agent for later analysis.
# Save the trained Model, the monitor (environment measures) and the environment parameters
model.named_observation_space = env_factory.named_observation_space
model.named_action_space = env_factory.named_action_space
model.save(model_save_path)
env_factory.save_params(param_path)
env_monitor_callback.save_run(monitor_path)
env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
# Compare performance runs, for each seed within a model
try:
compare_seed_runs(exp_path, use_tex=False)
except ValueError:
pass
# Train ends here ############################################################
# Evaluation starts here #####################################################
# First Iterate over every model and monitor "as trained"
print('Start Measurement Tracking')
# For trained policy in study_root_path / _identifier
for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
# retrieve model class
model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
# Load the agent agent
model = model_cls.load(policy_path / 'model.zip', device='cpu')
# Load old environment kwargs
with next(policy_path.glob(env_params_json)).open('r') as f:
env_kwargs = simplejson.load(f)
# Make the environment stop ar collisions
# (you only want to have a single collision per episode hence the statistics)
env_kwargs.update(done_at_collision=True)
# Init Env
with env_class(**env_kwargs) as env_factory:
monitored_env_factory = EnvMonitor(env_factory)
# Evaluation Loop for i in range(n Episodes)
for episode in range(100):
# noinspection PyRedeclaration
env_state = monitored_env_factory.reset()
rew, done_bool = 0, False
while not done_bool:
action = model.predict(env_state, deterministic=True)[0]
env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
rew += step_r
if done_bool:
break
print(f'Factory run {episode} done, reward is:\n {rew}')
monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
print('Measurements Done')

View File

@ -1,193 +0,0 @@
import sys
import time
from pathlib import Path
import simplejson
import stable_baselines3 as sb3
# This is needed, when you put this file in a subfolder.
try:
# noinspection PyUnboundLocalVariable
if __package__ is None:
DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(DIR.parent))
__package__ = DIR.name
else:
DIR = None
except NameError:
DIR = None
pass
from environments import helpers as h
from environments.factory.additional.dest.dest_util import DestModeOptions, DestProperties
from environments.logging.envmonitor import EnvMonitor
from environments.logging.recorder import EnvRecorder
from environments.factory.additional.dest.factory_dest import DestFactory
from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
from plotting.compare_runs import compare_seed_runs
"""
Welcome to this quick start file. Here we will see how to:
0. Setup I/O Paths
1. Setup parameters for the environments (dest-factory).
2. Setup parameters for the agent training (SB3: PPO) and save metrics.
Run the training.
3. Save environment and agent for later analysis.
4. Load the agent from drive
5. Rendering the environment with a run of the trained agent.
6. Plot metrics
"""
if __name__ == '__main__':
#########################################################
# 0. Setup I/O Paths
# Define some general parameters
train_steps = 1e6
n_seeds = 3
model_class = sb3.PPO
env_class = DestFactory
env_params_json = 'env_params.json'
# Define a global studi save path
start_time = int(time.time())
study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
# Create an _identifier, which is unique for every combination and easy to read in filesystem
identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
exp_path = study_root_path / identifier
#########################################################
# 1. Setup parameters for the environments (dest-factory).
# Define property object parameters.
# 'ObservationProperties' are for specifying how the agent sees the environment.
obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT, # Agents won`t be shown in the obs at all
omit_agent_self=True, # This is default
additional_agent_placeholder=None, # We will not take care of future agent
frames_to_stack=3, # To give the agent a notion of time
pomdp_r=2 # the agent view-radius
)
# 'MovementProperties' are for specifying how the agent is allowed to move in the environment.
move_props = MovementProperties(allow_diagonal_movement=True, # Euclidean style (vertices)
allow_square_movement=True, # Manhattan (edges)
allow_no_op=False) # Pause movement (do nothing)
# 'DestProperties' control if and how dest is spawned
# TODO: Comments
dest_props = DestProperties(
n_dests = 2, # How many destinations are there
dwell_time = 0, # How long does the agent need to "wait" on a destination
spawn_frequency = 0,
spawn_in_other_zone = True, #
spawn_mode = DestModeOptions.DONE,
)
# These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
# TODO: Comments
factory_kwargs = dict(n_agents=1,
max_steps=400,
parse_doors=True,
level_name='rooms',
doors_have_area=True, #
verbose=False,
mv_prop=move_props, # See Above
obs_prop=obs_props, # See Above
done_at_collision=True,
dest_prop=dest_props
)
#########################################################
# 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
agent_kwargs = dict()
#########################################################
# Run the Training
for seed in range(n_seeds):
# Make a copy if you want to alter things in the training loop; like the seed.
env_kwargs = factory_kwargs.copy()
env_kwargs.update(env_seed=seed)
# Output folder
seed_path = exp_path / f'{str(seed)}_{identifier}'
seed_path.mkdir(parents=True, exist_ok=True)
# Parameter Storage
param_path = seed_path / env_params_json
# Observation (measures) Storage
monitor_path = seed_path / 'monitor.pick'
recorder_path = seed_path / 'recorder.json'
# Model save Path for the trained model
model_save_path = seed_path / f'model.zip'
# Env Init & Model kwargs definition
with env_class(**env_kwargs) as env_factory:
# EnvMonitor Init
env_monitor_callback = EnvMonitor(env_factory)
# EnvRecorder Init
env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
# Model Init
model = model_class("MlpPolicy", env_factory,verbose=1, seed=seed, device='cpu')
# Model train
model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
#########################################################
# 3. Save environment and agent for later analysis.
# Save the trained Model, the monitor (environment measures) and the environment parameters
model.named_observation_space = env_factory.named_observation_space
model.named_action_space = env_factory.named_action_space
model.save(model_save_path)
env_factory.save_params(param_path)
env_monitor_callback.save_run(monitor_path)
env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
# Compare performance runs, for each seed within a model
try:
compare_seed_runs(exp_path, use_tex=False)
except ValueError:
pass
# Train ends here ############################################################
# Evaluation starts here #####################################################
# First Iterate over every model and monitor "as trained"
print('Start Measurement Tracking')
# For trained policy in study_root_path / _identifier
for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
# retrieve model class
model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
# Load the agent agent
model = model_cls.load(policy_path / 'model.zip', device='cpu')
# Load old environment kwargs
with next(policy_path.glob(env_params_json)).open('r') as f:
env_kwargs = simplejson.load(f)
# Make the environment stop ar collisions
# (you only want to have a single collision per episode hence the statistics)
env_kwargs.update(done_at_collision=True)
# Init Env
with env_class(**env_kwargs) as env_factory:
monitored_env_factory = EnvMonitor(env_factory)
# Evaluation Loop for i in range(n Episodes)
for episode in range(100):
# noinspection PyRedeclaration
env_state = monitored_env_factory.reset()
rew, done_bool = 0, False
while not done_bool:
action = model.predict(env_state, deterministic=True)[0]
env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
rew += step_r
if done_bool:
break
print(f'Factory run {episode} done, reward is:\n {rew}')
monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
print('Measurements Done')

View File

@ -1,195 +0,0 @@
import sys
import time
from pathlib import Path
import simplejson
import stable_baselines3 as sb3
# This is needed, when you put this file in a subfolder.
try:
# noinspection PyUnboundLocalVariable
if __package__ is None:
DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(DIR.parent))
__package__ = DIR.name
else:
DIR = None
except NameError:
DIR = None
pass
from environments import helpers as h
from environments.logging.envmonitor import EnvMonitor
from environments.logging.recorder import EnvRecorder
from environments.factory.additional.dirt.dirt_util import DirtProperties
from environments.factory.additional.dirt.factory_dirt import DirtFactory
from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
from plotting.compare_runs import compare_seed_runs
"""
Welcome to this quick start file. Here we will see how to:
0. Setup I/O Paths
1. Setup parameters for the environments (dirt-factory).
2. Setup parameters for the agent training (SB3: PPO) and save metrics.
Run the training.
3. Save environment and agent for later analysis.
4. Load the agent from drive
5. Rendering the environment with a run of the trained agent.
6. Plot metrics
"""
if __name__ == '__main__':
#########################################################
# 0. Setup I/O Paths
# Define some general parameters
train_steps = 1e6
n_seeds = 3
model_class = sb3.PPO
env_class = DirtFactory
env_params_json = 'env_params.json'
# Define a global studi save path
start_time = int(time.time())
study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
# Create an _identifier, which is unique for every combination and easy to read in filesystem
identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
exp_path = study_root_path / identifier
#########################################################
# 1. Setup parameters for the environments (dirt-factory).
# Define property object parameters.
# 'ObservationProperties' are for specifying how the agent sees the environment.
obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT, # Agents won`t be shown in the obs at all
omit_agent_self=True, # This is default
additional_agent_placeholder=None, # We will not take care of future agent
frames_to_stack=3, # To give the agent a notion of time
pomdp_r=2 # the agent' view-radius
)
# 'MovementProperties' are for specifying how the agent is allowed to move in the environment.
move_props = MovementProperties(allow_diagonal_movement=True, # Euclidean style (vertices)
allow_square_movement=True, # Manhattan (edges)
allow_no_op=False) # Pause movement (do nothing)
# 'DirtProperties' control if and how dirt is spawned
# TODO: Comments
dirt_props = DirtProperties(initial_dirt_ratio=0.35,
initial_dirt_spawn_r_var=0.1,
clean_amount=0.34,
max_spawn_amount=0.1,
max_global_amount=20,
max_local_amount=1,
spawn_frequency=0,
max_spawn_ratio=0.05,
dirt_smear_amount=0.0)
# These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
# TODO: Comments
factory_kwargs = dict(n_agents=1,
max_steps=400,
parse_doors=True,
level_name='rooms',
doors_have_area=True, #
verbose=False,
mv_prop=move_props, # See Above
obs_prop=obs_props, # See Above
done_at_collision=True,
dirt_prop=dirt_props
)
#########################################################
# 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
agent_kwargs = dict()
#########################################################
# Run the Training
for seed in range(n_seeds):
# Make a copy if you want to alter things in the training loop; like the seed.
env_kwargs = factory_kwargs.copy()
env_kwargs.update(env_seed=seed)
# Output folder
seed_path = exp_path / f'{str(seed)}_{identifier}'
seed_path.mkdir(parents=True, exist_ok=True)
# Parameter Storage
param_path = seed_path / env_params_json
# Observation (measures) Storage
monitor_path = seed_path / 'monitor.pick'
recorder_path = seed_path / 'recorder.json'
# Model save Path for the trained model
model_save_path = seed_path / f'model.zip'
# Env Init & Model kwargs definition
with env_class(**env_kwargs) as env_factory:
# EnvMonitor Init
env_monitor_callback = EnvMonitor(env_factory)
# EnvRecorder Init
env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
# Model Init
model = model_class("MlpPolicy", env_factory, verbose=1, seed=seed, device='cpu')
# Model train
model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
#########################################################
# 3. Save environment and agent for later analysis.
# Save the trained Model, the monitor (environment measures) and the environment parameters
model.named_observation_space = env_factory.named_observation_space
model.named_action_space = env_factory.named_action_space
model.save(model_save_path)
env_factory.save_params(param_path)
env_monitor_callback.save_run(monitor_path)
env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
# Compare performance runs, for each seed within a model
try:
compare_seed_runs(exp_path, use_tex=False)
except ValueError:
pass
# Train ends here ############################################################
# Evaluation starts here #####################################################
# First Iterate over every model and monitor "as trained"
print('Start Measurement Tracking')
# For trained policy in study_root_path / _identifier
for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
# retrieve model class
model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
# Load the agent
model = model_cls.load(policy_path / 'model.zip', device='cpu')
# Load old environment kwargs
with next(policy_path.glob(env_params_json)).open('r') as f:
env_kwargs = simplejson.load(f)
# Make the environment stop ar collisions
# (you only want to have a single collision per episode hence the statistics)
env_kwargs.update(done_at_collision=True)
# Init Env
with env_class(**env_kwargs) as env_factory:
monitored_env_factory = EnvMonitor(env_factory)
# Evaluation Loop for i in range(n Episodes)
for episode in range(100):
# noinspection PyRedeclaration
env_state = monitored_env_factory.reset()
rew, done_bool = 0, False
while not done_bool:
action = model.predict(env_state, deterministic=True)[0]
env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
rew += step_r
if done_bool:
break
print(f'Factory run {episode} done, reward is:\n {rew}')
monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
print('Measurements Done')

View File

@ -1,191 +0,0 @@
import sys
import time
from pathlib import Path
import simplejson
import stable_baselines3 as sb3
# This is needed, when you put this file in a subfolder.
try:
# noinspection PyUnboundLocalVariable
if __package__ is None:
DIR = Path(__file__).resolve().parent
sys.path.insert(0, str(DIR.parent))
__package__ = DIR.name
else:
DIR = None
except NameError:
DIR = None
pass
from environments import helpers as h
from environments.factory.additional.item.factory_item import ItemFactory
from environments.factory.additional.item.item_util import ItemProperties
from environments.logging.envmonitor import EnvMonitor
from environments.logging.recorder import EnvRecorder
from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
from plotting.compare_runs import compare_seed_runs
"""
Welcome to this quick start file. Here we will see how to:
0. Setup I/O Paths
1. Setup parameters for the environments (item-factory).
2. Setup parameters for the agent training (SB3: PPO) and save metrics.
Run the training.
3. Save environment and agent for later analysis.
4. Load the agent from drive
5. Rendering the environment with a run of the trained agent.
6. Plot metrics
"""
if __name__ == '__main__':
#########################################################
# 0. Setup I/O Paths
# Define some general parameters
train_steps = 1e6
n_seeds = 3
model_class = sb3.PPO
env_class = ItemFactory
env_params_json = 'env_params.json'
# Define a global studi save path
start_time = int(time.time())
study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
# Create an _identifier, which is unique for every combination and easy to read in filesystem
identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
exp_path = study_root_path / identifier
#########################################################
# 1. Setup parameters for the environments (item-factory).
#
# Define property object parameters.
# 'ObservationProperties' are for specifying how the agent sees the environment.
obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT, # Agents won`t be shown in the obs at all
omit_agent_self=True, # This is default
additional_agent_placeholder=None, # We will not take care of future agent
frames_to_stack=3, # To give the agent a notion of time
pomdp_r=2 # the agent view-radius
)
# 'MovementProperties' are for specifying how the agent is allowed to move in the environment.
move_props = MovementProperties(allow_diagonal_movement=True, # Euclidean style (vertices)
allow_square_movement=True, # Manhattan (edges)
allow_no_op=False) # Pause movement (do nothing)
# 'ItemProperties' control if and how item is spawned
# TODO: Comments
item_props = ItemProperties(
n_items = 7, # How many items are there at the same time
spawn_frequency = 50, # Spawn Frequency in Steps
n_drop_off_locations = 10, # How many DropOff locations are there at the same time
max_dropoff_storage_size = 0, # How many items are needed until the dropoff is full
max_agent_inventory_capacity = 5, # How many items are needed until the agent inventory is full)
)
# These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
# TODO: Comments
factory_kwargs = dict(n_agents=1,
max_steps=400,
parse_doors=True,
level_name='rooms',
doors_have_area=True, #
verbose=False,
mv_prop=move_props, # See Above
obs_prop=obs_props, # See Above
done_at_collision=True,
item_prop=item_props
)
#########################################################
# 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
agent_kwargs = dict()
#########################################################
# Run the Training
for seed in range(n_seeds):
# Make a copy if you want to alter things in the training loop; like the seed.
env_kwargs = factory_kwargs.copy()
env_kwargs.update(env_seed=seed)
# Output folder
seed_path = exp_path / f'{str(seed)}_{identifier}'
seed_path.mkdir(parents=True, exist_ok=True)
# Parameter Storage
param_path = seed_path / env_params_json
# Observation (measures) Storage
monitor_path = seed_path / 'monitor.pick'
recorder_path = seed_path / 'recorder.json'
# Model save Path for the trained model
model_save_path = seed_path / f'model.zip'
# Env Init & Model kwargs definition
with ItemFactory(**env_kwargs) as env_factory:
# EnvMonitor Init
env_monitor_callback = EnvMonitor(env_factory)
# EnvRecorder Init
env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
# Model Init
model = model_class("MlpPolicy", env_factory,verbose=1, seed=seed, device='cpu')
# Model train
model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
#########################################################
# 3. Save environment and agent for later analysis.
# Save the trained Model, the monitor (environment measures) and the environment parameters
model.named_observation_space = env_factory.named_observation_space
model.named_action_space = env_factory.named_action_space
model.save(model_save_path)
env_factory.save_params(param_path)
env_monitor_callback.save_run(monitor_path)
env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
# Compare performance runs, for each seed within a model
try:
compare_seed_runs(exp_path, use_tex=False)
except ValueError:
pass
# Train ends here ############################################################
# Evaluation starts here #####################################################
# First Iterate over every model and monitor "as trained"
print('Start Measurement Tracking')
# For trained policy in study_root_path / _identifier
for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
# retrieve model class
model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
# Load the agent agent
model = model_cls.load(policy_path / 'model.zip', device='cpu')
# Load old environment kwargs
with next(policy_path.glob(env_params_json)).open('r') as f:
env_kwargs = simplejson.load(f)
# Make the environment stop ar collisions
# (you only want to have a single collision per episode hence the statistics)
env_kwargs.update(done_at_collision=True)
# Init Env
with ItemFactory(**env_kwargs) as env_factory:
monitored_env_factory = EnvMonitor(env_factory)
# Evaluation Loop for i in range(n Episodes)
for episode in range(100):
# noinspection PyRedeclaration
env_state = monitored_env_factory.reset()
rew, done_bool = 0, False
while not done_bool:
action = model.predict(env_state, deterministic=True)[0]
env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
rew += step_r
if done_bool:
break
print(f'Factory run {episode} done, reward is:\n {rew}')
monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
print('Measurements Done')

View File

@ -1,25 +0,0 @@
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
buildapi:
sphinx-apidoc.exe -fEM -T -t _templates -o source/source ../marl_factory_grid "../**/marl", "../**/proto"
@echo "Auto-generation of 'SOURCEAPI' documentation finished. " \
"The generated files were placed in 'source/'"
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

View File

@ -1,35 +0,0 @@
@ECHO OFF
pushd %~dp0
REM Command file for Sphinx documentation
if "%SPHINXBUILD%" == "" (
set SPHINXBUILD=sphinx-build
)
set SOURCEDIR=source
set BUILDDIR=build
%SPHINXBUILD% >NUL 2>NUL
if errorlevel 9009 (
echo.
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
echo.installed, then set the SPHINXBUILD environment variable to point
echo.to the full path of the 'sphinx-build' executable. Alternatively you
echo.may add the Sphinx directory to PATH.
echo.
echo.If you don't have Sphinx installed, grab it from
echo.https://www.sphinx-doc.org/
exit /b 1
)
if "%1" == "" goto help
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
goto end
:help
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
:end
popd

View File

@ -1,4 +0,0 @@
myst_parser
sphinx-pdj-theme
sphinx-mdinclude
sphinx-book-theme

View File

@ -1,72 +0,0 @@
# Configuration file for the Sphinx documentation builder.
#
# For the full list of built-in configuration values, see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
project = 'rl-factory-grid'
copyright = '2023, Steffen Illium, Robert Mueller, Joel Friedrich'
author = 'Steffen Illium, Robert Mueller, Joel Friedrich'
release = '2.5.0'
# -- General configuration ---------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
extensions = [#'myst_parser',
'sphinx.ext.todo',
'sphinx.ext.autodoc',
'sphinx.ext.intersphinx',
# 'sphinx.ext.autosummary',
'sphinx.ext.linkcode',
'sphinx_mdinclude',
]
templates_path = ['_templates']
exclude_patterns = ['marl_factory_grid.utils.proto', 'marl_factory_grid.utils.proto.fiksProto_pb2*']
autoclass_content = 'both'
autodoc_class_signature = 'separated'
autodoc_typehints = 'description'
autodoc_inherit_docstrings = True
autodoc_typehints_format = 'short'
autodoc_default_options = {
'members': True,
# 'member-order': 'bysource',
'special-members': '__init__',
'undoc-members': True,
# 'exclude-members': '__weakref__',
'show-inheritance': True,
}
autosummary_generate = True
add_module_names = False
toc_object_entries = False
modindex_common_prefix = ['marl_factory_grid.']
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here.
from pathlib import Path
import sys
sys.path.insert(0, (Path(__file__).parents[2]).resolve().as_posix())
sys.path.insert(0, (Path(__file__).parents[2] / 'marl_factory_grid').resolve().as_posix())
# -- Options for HTML output -------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
html_theme = "sphinx_book_theme" # 'alabaster'
# html_static_path = ['_static']
# In your configuration, you need to specify a linkcode_resolve function that returns an URL based on the object.
# https://www.sphinx-doc.org/en/master/usage/extensions/linkcode.html
def linkcode_resolve(domain, info):
if domain in ['py', '__init__.py']:
return None
if not info['module']:
return None
filename = info['module'].replace('.', '/')
return "https://github.com/illiumst/marl-factory-grid/%s.py" % filename
print(sys.executable)

View File

@ -1,99 +0,0 @@
Creating a New Scenario
=======================
Creating a new scenario in the `marl-factory-grid` environment allows you to customize the environment to fit your specific requirements. This guide provides step-by-step instructions on how to create a new scenario, including defining a configuration file, designing a level, and potentially adding new entities, rules, and assets. See the "modifications.rst" file for more information on how to modify existing entities, levels, rules, groups and assets.
Step 1: Define Configuration File
-----------------
1. **Create a Configuration File:** Start by creating a new configuration file (`.yaml`) for your scenario. This file will contain settings such as the number of agents, environment dimensions, and other parameters. You can use existing configuration files as templates.
2. **Specify Custom Parameters:** Modify the configuration file to include any custom parameters specific to your scenario. For example, you can set the respawn rate of entities or define specific rewards.
Step 2: Design the Level
-----------------
1. **Create a Level File:** Design the layout of your environment by creating a new level file (`.txt`). Use symbols such as `#` for walls, `-` for walkable floors, and introduce new symbols for custom entities.
2. **Define Entity Locations:** Specify the initial locations of entities, including agents and any new entities introduced in your scenario. These spawn locations are typically provided in the conf file.
Step 3: Introduce New Entities
-----------------
1. **Create New Entity Modules:** If your scenario involves introducing new entities, create new entity modules in the `marl_factory_grid/environment/entity` directory. Define their behavior, properties, and any custom actions they can perform. Check out the template module.
2. **Update Configuration:** Update the configuration file to include settings related to your new entities, such as spawn rates, initial quantities, or any specific behaviors.
Step 4: Implement Custom Rules
-----------------
1. **Create Rule Modules:** If your scenario requires custom rules, create new rule modules in the `marl_factory_grid/environment/rules` directory. Implement the necessary logic to govern the behavior of entities in your scenario and use the provided environment hooks.
2. **Update Configuration:** If your custom rules have configurable parameters, update the configuration file to include these settings and activate the rule by adding it to the conf file.
Step 5: Add Custom Assets (Optional)
-----------------
1. **Include Custom Asset Files:** If your scenario introduces new assets (e.g., images for entities), include the necessary asset files in the appropriate directories, such as `marl_factory_grid/environment/assets`.
Step 6: Test and Experiment
-----------------
1. **Run Your Scenario:** Use the provided scripts or write your own script to run the scenario with your customized configuration. Observe the behavior of agents and entities in the environment.
2. **Iterate and Experiment:** Adjust configuration parameters, level design, or introduce new elements based on your observations. Iterate through this process until your scenario meets your desired specifications.
Congratulations! You have successfully created a new scenario in the `marl-factory-grid` environment. Experiment with different configurations, levels, entities, and rules to design unique and engaging environments for your simulations. Below you find an example of how to create a new scenario.
New Example Scenario: Apple Resource Dilemma
-----------------
To provide you with an example, we'll guide you through creating the "Apple Resource Dilemma" scenario using the steps outlined in the tutorial.
In this example scenario, agents face a dilemma of collecting apples. The apples only spawn if there are already enough in the environment. If agents collect them at the beginning, they won't respawn as quickly as if they wait for more to spawn before collecting.
**Step 1: Define Configuration File**
1. **Create a Configuration File:** Start by creating a new configuration file, e.g., `apple_dilemma_config.yaml`. Use the default config file as a good starting point.
2. **Specify Custom Parameters:** Add custom parameters to control the behavior of your scenario. Also delete unused entities, actions and observations from the default config file such as dirt piles.
**Step 2: Design the Level**
1. Create a Level File: Design the layout of your environment by creating a new level file, e.g., apple_dilemma_level.txt.
Of course you can also just use or modify an existing level.
2. Define Entity Locations: Specify the initial locations of entities, including doors (D). Since the apples will likely be spawning randomly, it would not make sense to encode their spawn in the level file.
**Step 3: Introduce New Entities**
1. Create New Entity Modules: Create a new entity module for the apple in the `marl_factory_grid/environment/entity` directory. Use the module template or existing modules as inspiration. Instead of creating a new agent, the item agent can be used as he is already configured to collect all items and drop them off at designated locations.
2. Update Configuration: Update the configuration file to include settings related to your new entities. Agents need to be able to interact and observe them.
**Step 4: Implement Custom Rules**
1. Create Rule Modules: You might want to create new rule modules. For example, apple_respawn_rule.py could be inspired from the dirt respawn rule:
>>> from marl_factory_grid.environment.rules.rule import Rule
class AppleRespawnRule(Rule):
def __init__(self, apple_spawn_rate=0.1):
super().__init__()
self.apple_spawn_rate = apple_spawn_rate
def tick_post_step(self, state):
# Logic to respawn apples based on spawn rate
pass
2. Update Configuration: Update the configuration file to include the new rule.
**Step 5: Add Custom Assets (Optional)**
1. Include Custom Asset Files: If your scenario introduces new assets (e.g., images for entities), include the necessary files in the appropriate directories, such as `marl_factory_grid/environment/assets`.
**Step 6: Test and Experiment**

View File

@ -1,23 +0,0 @@
.. toctree::
:maxdepth: 1
:caption: Table of Contents
:titlesonly:
installation
usage
modifications
creating a new scenario
testing
source
.. note::
This project is under active development.
.. mdinclude:: ../../README.md
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,22 +0,0 @@
Installation
============
How to install the environment
------------------------------
To use `marl-factory-grid`, first install it using pip:
.. code-block:: console
(.venv) $ pip install marl-factory-grid
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,92 +0,0 @@
Custom Modifications
====================
This section covers main aspects of working with the environment.
Modifying levels
----------------
Varying levels are created by defining Walls, Floor or Doors in *.txt*-files (see `levels`_ for examples).
Define which *level* to use in your *config file* as:
.. _levels: marl_factory_grid/levels
>>> General:
level_name: rooms # 'simple', 'narrow_corridor', 'eight_puzzle',...
... or create your own. Maybe with the help of `asciiflow.com <https://asciiflow.com/#/>`_.
Make sure to use `#` as `Walls`_ , `-` as free (walkable) floor and `D` for `Doors`_.
Other Entities (define your own) may bring their own `Symbols`.
.. _Walls: marl_factory_grid/environment/entity/wall.py
.. _Doors: modules/doors/entities.py
Modifying Entites
-----------------
Entities are `Objects`_ that can additionally be assigned a position.
Abstract Entities are provided.
If you wish to introduce new entities to the environment just create a new module that implements the entity class. If
necessary, provide additional classe such as custom actions or rewards and load the entity into the environment using
the config file.
.. _Objects: marl_factory_grid/environment/entity/object.py
Modifying Groups
----------------
`Groups`_ are entity Sets that provide administrative access to all group members.
All `Entity Collections`_ are available at runtime as a property of the env state.
If you add an entity, you probably also want a collection of that entity.
.. _Groups: marl_factory_grid/environment/groups/objects.py
.. _Entity Collections: marl_factory_grid/environment/entity/global_entities.py
Modifying Rules
---------------
`Rules <https://marl-factory-grid.readthedocs.io/en/latest/code/marl_factory_grid.environment.rules.html>`_ define how
the environment behaves on micro scale. Each of the hooks (`on_init`, `pre_step`, `on_step`, '`post_step`', `on_done`)
provide env-access to implement custom logic, calculate rewards, or gather information.
If you wish to introduce new rules to the environment make sure it implements the Rule class and override its' hooks
to implement your own rule logic.
.. image:: ../../images/Hooks_FIKS.png
:alt: Hooks Image
Modifying Constants and Rewards
-------------------------------
Customizing rewards and constants allows you to tailor the environment to specific requirements.
You can set custom rewards in the configuration file. If no specific rewards are defined, the environment
will utilize default rewards, which are provided in the constants file of each module.
In addition to rewards, you can also customize other constants used in the environment's rules or actions. Each module has
its dedicated constants file, while global constants are centrally located in the environment's constants file.
Be careful when making changes to constants, as they can radically impact the behavior of the environment. Only modify
constants if you have a solid understanding of their implications and are confident in the adjustments you're making.
Modifying Results
-----------------
`Results <https://marl-factory-grid.readthedocs.io/en/latest/code/marl_factory_grid.utils.results.html>`_
provide a way to return `rule` evaluations such as rewards and state reports back to the environment.
Modifying Assets
----------------
Make sure to bring your own assets for each Entity living in the Gridworld as the `Renderer` relies on it.
PNG-files (transparent background) of square aspect-ratio should do the job, in general.
.. image:: ../../marl_factory_grid/environment/assets/wall.png
:alt: Wall Image
.. image:: ../../marl_factory_grid/environment/assets/agent/agent.png
:alt: Agent Image
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,17 +0,0 @@
Source
======
.. toctree::
:maxdepth: 2
:glob:
:caption: Table of Contents
:titlesonly:
source/*
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.environment.entity package
==============================================
.. automodule:: marl_factory_grid.environment.entity
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.environment.entity.agent
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.entity.entity
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.entity.object
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.entity.util
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.entity.wall
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,52 +0,0 @@
marl\_factory\_grid.environment.groups package
==============================================
.. automodule:: marl_factory_grid.environment.groups
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.environment.groups.agents
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.collection
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.global_entities
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.mixins
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.objects
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.utils
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.groups.walls
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,49 +0,0 @@
marl\_factory\_grid.environment package
=======================================
.. automodule:: marl_factory_grid.environment
:members:
:undoc-members:
:show-inheritance:
Subpackages
-----------
.. toctree::
:maxdepth: 4
marl_factory_grid.environment.entity
marl_factory_grid.environment.groups
Submodules
----------
.. automodule:: marl_factory_grid.environment.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.factory
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.rewards
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.environment.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,7 +0,0 @@
marl\_factory\_grid.levels package
==================================
.. automodule:: marl_factory_grid.levels
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.batteries package
=============================================
.. automodule:: marl_factory_grid.modules.batteries
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.batteries.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.batteries.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.batteries.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.batteries.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.batteries.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.clean\_up package
=============================================
.. automodule:: marl_factory_grid.modules.clean_up
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.clean_up.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.clean_up.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.clean_up.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.clean_up.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.clean_up.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.destinations package
================================================
.. automodule:: marl_factory_grid.modules.destinations
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.destinations.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.destinations.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.destinations.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.destinations.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.destinations.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.doors package
=========================================
.. automodule:: marl_factory_grid.modules.doors
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.doors.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.doors.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.doors.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.doors.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.doors.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.items package
=========================================
.. automodule:: marl_factory_grid.modules.items
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.items.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.items.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.items.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.items.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.items.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,40 +0,0 @@
marl\_factory\_grid.modules.machines package
============================================
.. automodule:: marl_factory_grid.modules.machines
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.machines.actions
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.machines.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.machines.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.machines.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.machines.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,34 +0,0 @@
marl\_factory\_grid.modules.maintenance package
===============================================
.. automodule:: marl_factory_grid.modules.maintenance
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.maintenance.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.maintenance.entities
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.maintenance.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.maintenance.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,22 +0,0 @@
marl\_factory\_grid.modules package
===================================
.. automodule:: marl_factory_grid.modules
:members:
:undoc-members:
:show-inheritance:
Subpackages
-----------
.. toctree::
:maxdepth: 4
marl_factory_grid.modules.batteries
marl_factory_grid.modules.clean_up
marl_factory_grid.modules.destinations
marl_factory_grid.modules.doors
marl_factory_grid.modules.items
marl_factory_grid.modules.machines
marl_factory_grid.modules.maintenance
marl_factory_grid.modules.zones

View File

@ -1,34 +0,0 @@
marl\_factory\_grid.modules.zones package
=========================================
.. automodule:: marl_factory_grid.modules.zones
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.modules.zones.constants
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.zones.entitites
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.zones.groups
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.modules.zones.rules
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,28 +0,0 @@
marl\_factory\_grid package
===========================
.. automodule:: marl_factory_grid
:members:
:undoc-members:
:show-inheritance:
Subpackages
-----------
.. toctree::
:maxdepth: 4
marl_factory_grid.algorithms
marl_factory_grid.environment
marl_factory_grid.levels
marl_factory_grid.modules
marl_factory_grid.utils
Submodules
----------
.. automodule:: marl_factory_grid.quickstart
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,22 +0,0 @@
marl\_factory\_grid.utils.logging package
=========================================
.. automodule:: marl_factory_grid.utils.logging
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.utils.logging.envmonitor
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.logging.recorder
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,28 +0,0 @@
marl\_factory\_grid.utils.plotting package
==========================================
.. automodule:: marl_factory_grid.utils.plotting
:members:
:undoc-members:
:show-inheritance:
Submodules
----------
.. automodule:: marl_factory_grid.utils.plotting.plot_compare_runs
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.plotting.plot_single_runs
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.plotting.plotting_utils
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,79 +0,0 @@
marl\_factory\_grid.utils package
=================================
.. automodule:: marl_factory_grid.utils
:members:
:undoc-members:
:show-inheritance:
Subpackages
-----------
.. toctree::
:maxdepth: 4
marl_factory_grid.utils.logging
marl_factory_grid.utils.plotting
Submodules
----------
.. automodule:: marl_factory_grid.utils.config_parser
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.helpers
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.level_parser
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.observation_builder
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.ray_caster
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.renderer
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.results
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.states
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.tools
:members:
:undoc-members:
:show-inheritance:
.. automodule:: marl_factory_grid.utils.utility_classes
:members:
:undoc-members:
:show-inheritance:

View File

@ -1,15 +0,0 @@
Testing
=======
In EDYS, tests are seamlessly integrated through environment hooks, mirroring the organization of rules, as explained in the README.md file.
Running tests
-------------
To include specific tests in your run, simply append them to the "tests" section within the configuration file.
If the test requires a specific entity in the environment (i.e the clean up test requires a TSPDirtAgent that can observe
and clean dirt in its environment), make sure to include it in the config file.
Writing tests
------------
If you intend to create additional tests, refer to the tests.py file for examples.
Ensure that any new tests implement the corresponding test class and make use of its hooks.
There are no additional steps required, except for the inclusion of your custom tests in the config file.

View File

@ -1,75 +0,0 @@
Basic Usage
===========
Environment objects, including agents, entities and rules, that are specified in a *yaml*-configfile will be loaded automatically.
Using ``quickstart_use`` creates a default config-file and another one that lists all possible options of the environment.
Also, it generates an initial script where an agent is executed in the environment specified by the config-file.
After initializing the environment using the specified configuration file, the script enters a reinforcement learning loop.
The loop consists of episodes, where each episode involves resetting the environment, executing actions, and receiving feedback.
Here's a breakdown of the key components in the provided script. Feel free to customize it based on your specific requirements:
1. **Initialization:**
>>> path = Path('marl_factory_grid/configs/default_config.yaml')
factory = Factory(path)
factory = EnvMonitor(factory)
factory = EnvRecorder(factory)
- The `path` variable points to the location of your configuration file. Ensure it corresponds to the correct path.
- `Factory` initializes the environment based on the provided configuration.
- `EnvMonitor` and `EnvRecorder` are optional components. They add monitoring and recording functionalities to the environment, respectively.
2. **Reinforcement Learning Loop:**
>>> for episode in trange(10):
_ = factory.reset()
done = False
if render:
factory.render()
action_spaces = factory.action_space
agents = []
- The loop iterates over a specified number of episodes (in this case, 10).
- `factory.reset()` resets the environment for a new episode.
- `factory.render()` is used for visualization if rendering is enabled.
- `action_spaces` stores the action spaces available for the agents.
- `agents` will store agent-specific information during the episode.
3. **Taking Actions:**
>>> while not done:
a = [randint(0, x.n - 1) for x in action_spaces]
obs_type, _, reward, done, info = factory.step(a)
if render:
factory.render()
- Within each episode, the loop continues until the environment signals completion (`done`).
- `a` represents a list of random actions for each agent based on their action space.
- `factory.step(a)` executes the actions, returning observation types, rewards, completion status, and additional information.
4. **Handling Episode Completion:**
>>> if done:
print(f'Episode {episode} done...')
- After each episode, a message is printed indicating its completion.
Evaluating the run
------------------
If monitoring and recording are enabled, the environment states will be traced and recorded automatically.
The EnvMonitor class acts as a wrapper for Gym environments, monitoring and logging key information during interactions,
while the EnvRecorder class records state summaries during interactions in the environment.
At the end of each run a plot displaying the step reward is generated. The step reward represents the cumulative sum of rewards obtained by all agents throughout the episode.
Furthermore a comparative plot that shows the achieved score (step reward) over several runs with different seeds or different parameter settings can be generated using the methods provided in plotting/plot_compare_runs.py.
For a more comprehensive evaluation, we recommend using the `Weights and Biases (W&B) <https://wandb.ai/site>`_ framework, with the dataframes generated by the monitor and recorder. These can be found in the run path specified in your script. W&B provides a powerful API for logging and visualizing model training metrics, enabling analysis using predefined or also custom metrics.
Indices and tables
------------------
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

View File

@ -1,37 +1,29 @@
from pathlib import Path
from pprint import pprint
from tqdm import trange
from marl_factory_grid.algorithms.static.TSP_coin_agent import TSPCoinAgent
from marl_factory_grid.algorithms.static.TSP_dirt_agent import TSPDirtAgent
from marl_factory_grid.algorithms.static.TSP_item_agent import TSPItemAgent
from marl_factory_grid.algorithms.static.TSP_target_agent import TSPTargetAgent
from marl_factory_grid.environment.factory import Factory
from marl_factory_grid.utils.plotting.plot_single_runs import plot_routes, plot_action_maps
if __name__ == '__main__':
run_path = Path('study_out')
run_path = Path('../study_out')
render = True
monitor = True
record = True
# Path to config File
path = Path('marl_factory_grid/configs/test_config.yaml')
path = Path('../marl_factory_grid/configs/default_config.yaml')
# Env Init
factory = Factory(path)
for episode in trange(1):
for episode in trange(10):
_ = factory.reset()
done = False
if render:
factory.render()
action_spaces = factory.action_space
# agents = [TSPDirtAgent(factory, 0), TSPItemAgent(factory, 1), TSPTargetAgent(factory, 2)]
agents = [TSPCoinAgent(factory, 0)]
agents = [TSPDirtAgent(factory, 0), TSPCoinAgent(factory, 1)]
while not done:
a = [x.predict() for x in agents]
obs_type, _, _, done, info = factory.step(a)

View File

@ -1,41 +1,33 @@
from pathlib import Path
from random import randint
from tqdm import trange
from marl_factory_grid.algorithms.static.TSP_item_agent import TSPItemAgent
from marl_factory_grid.environment.factory import Factory
from marl_factory_grid.utils.logging.envmonitor import EnvMonitor
from marl_factory_grid.utils.logging.recorder import EnvRecorder
from marl_factory_grid.utils.plotting.plot_single_runs import plot_single_run
from marl_factory_grid.utils.tools import ConfigExplainer
if __name__ == '__main__':
# Render at each step?
render = True
run_path = Path('study_out')
run_path = Path('../study_out')
render = True
monitor = True
record = True
# Path to config File
path = Path('marl_factory_grid/configs/_obs_test.yaml')
path = Path('../marl_factory_grid/configs/test_config.yaml')
# Env Init
factory = Factory(path)
# RL learn Loop
for episode in trange(10):
_ = factory.reset()
done = False
if render:
factory.render()
action_spaces = factory.action_space
agents = [TSPItemAgent(factory, 0)]
while not done:
a = [randint(0, x.n - 1) for x in action_spaces]
a = [x.predict() for x in agents]
obs_type, _, _, done, info = factory.step(a)
if render:
factory.render()
if done:
print(f'Episode {episode} done...')
break
print('Done!!! Goodbye....')

82
main.py Normal file
View File

@ -0,0 +1,82 @@
from marl_factory_grid.algorithms.marl.RL_runner import rerun_coin_quadrant_agent1_training, \
rerun_two_rooms_agent1_training, rerun_two_rooms_agent2_training, coin_quadrant_multi_agent_rl_eval, \
two_rooms_multi_agent_rl_eval
from marl_factory_grid.algorithms.static.TSP_runner import coin_quadrant_multi_agent_tsp_eval, \
two_rooms_multi_agent_tsp_eval
###### Coin-quadrant environment ######
def coin_quadrant_single_agent_training():
""" Rerun training of RL-agent in coins_quadrant environment.
The trained model and additional training metrics are saved in the study_out folder. """
rerun_coin_quadrant_agent1_training()
def coin_quadrant_RL_multi_agent_eval_emergent():
""" Rerun multi-agent evaluation of RL-agents in coins_quadrant environment,
with occurring emergent phenomenon. Evaluation takes trained models from study_out/run0 for both agents."""
coin_quadrant_multi_agent_rl_eval(emergent_phenomenon=True)
def coin_quadrant_RL_multi_agent_eval_prevented():
""" Rerun multi-agent evaluation of RL-agents in coins_quadrant environment,
with emergence prevention mechanism. Evaluation takes trained models from study_out/run0 for both agents."""
coin_quadrant_multi_agent_rl_eval(emergent_phenomenon=False)
def coin_quadrant_TSP_multi_agent_eval_emergent():
""" Rerun multi-agent evaluation of TSP-agents in coins_quadrant environment,
with occurring emergent phenomenon. """
coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon=True)
def coin_quadrant_TSP_multi_agent_eval_prevented():
""" Rerun multi-agent evaluation of TSP-agents in coins_quadrant environment,
with emergence prevention mechanism. """
coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon=False)
###### Two-rooms environment ######
def two_rooms_agent1_training():
""" Rerun training of left RL-agent in two_rooms environment.
The trained model and additional training metrics are saved in the study_out folder. """
rerun_two_rooms_agent1_training()
def two_rooms_agent2_training():
""" Rerun training of right RL-agent in two_rooms environment.
The trained model and additional training metrics are saved in the study_out folder. """
rerun_two_rooms_agent2_training()
def two_rooms_RL_multi_agent_eval_emergent():
""" Rerun multi-agent evaluation of RL-agents in two_rooms environment, with
occurring emergent phenomenon. Evaluation takes trained models
from study_out/run1 for agent1 and study_out/run2 for agent2. """
two_rooms_multi_agent_rl_eval(emergent_phenomenon=True)
def two_rooms_RL_multi_agent_eval_prevented():
""" Rerun multi-agent evaluation of RL-agents in two_rooms environment, with
emergence prevention mechanism. Evaluation takes trained models
from study_out/run1 for agent1 and study_out/run2 for agent2. """
two_rooms_multi_agent_rl_eval(emergent_phenomenon=False)
def two_rooms_TSP_multi_agent_eval_emergent():
""" Rerun multi-agent evaluation of TSP-agents in two_rooms environment, with
occurring emergent phenomenon. """
two_rooms_multi_agent_tsp_eval(emergent_phenomenon=True)
def two_rooms_TSP_multi_agent_eval_prevented():
""" Rerun multi-agent evaluation of TSP-agents in two_rooms environment, with
emergence prevention mechanism. """
two_rooms_multi_agent_tsp_eval(emergent_phenomenon=False)
if __name__ == '__main__':
# Select any of the above functions to rerun the respective part
# from our evaluation section of the paper
coin_quadrant_RL_multi_agent_eval_prevented()

View File

@ -1,4 +1,3 @@
from .quickstart import init
from marl_factory_grid.environment.factory import Factory
"""
Main module of the 'rl-factory-grid'-environment.

View File

@ -0,0 +1,80 @@
from pathlib import Path
from marl_factory_grid.algorithms.marl.a2c_coin import A2C
from marl_factory_grid.algorithms.marl.utils import get_algorithms_marl_path
from marl_factory_grid.algorithms.utils import load_yaml_file
####### Training routines ######
def rerun_coin_quadrant_agent1_training():
train_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_train_config.yaml')
eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_eval_config.yaml')
train_cfg = load_yaml_file(train_cfg_path)
eval_cfg = load_yaml_file(eval_cfg_path)
print("Training phase")
agent = A2C(train_cfg=train_cfg, eval_cfg=eval_cfg, mode="train")
agent.train_loop()
print("Evaluation phase")
agent.eval_loop("coin_quadrant", n_episodes=1)
def two_rooms_training(max_steps, agent_name):
train_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_train_config.yaml')
eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_eval_config.yaml')
train_cfg = load_yaml_file(train_cfg_path)
eval_cfg = load_yaml_file(eval_cfg_path)
# train_cfg["algorithm"]["max_steps"] = max_steps
train_cfg["env"]["env_name"] = f"marl/single_agent_configs/two_rooms_{agent_name}_train_config"
eval_cfg["env"]["env_name"] = f"marl/single_agent_configs/two_rooms_{agent_name}_eval_config"
print("Training phase")
agent = A2C(train_cfg=train_cfg, eval_cfg=eval_cfg, mode="train")
agent.train_loop()
print("Evaluation phase")
agent.eval_loop("two_rooms", n_episodes=1)
def rerun_two_rooms_agent1_training():
two_rooms_training(max_steps=190000, agent_name="agent1")
def rerun_two_rooms_agent2_training():
two_rooms_training(max_steps=260000, agent_name="agent2")
####### Eval routines ########
def single_agent_eval(config_name, run_folder_name):
eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/{config_name}_eval_config.yaml')
eval_cfg = load_yaml_file(eval_cfg_path)
# A value for train_cfg is required, but the train environment won't be used
agent = A2C(eval_cfg=eval_cfg, mode="eval")
print("Evaluation phase")
agent.load_agents(config_name, [run_folder_name])
agent.eval_loop(config_name, 1)
def multi_agent_eval(config_name, runs, emergent_phenomenon=False):
eval_cfg_path = Path(f'{get_algorithms_marl_path()}/multi_agent_configs/{config_name}' +
f'_eval_config{"_emergent" if emergent_phenomenon else ""}.yaml')
eval_cfg = load_yaml_file(eval_cfg_path)
# A value for train_cfg is required, but the train environment won't be used
agent = A2C(eval_cfg=eval_cfg, mode="eval")
print("Evaluation phase")
agent.load_agents(config_name, runs)
agent.eval_loop(config_name, 1)
def coin_quadrant_multi_agent_rl_eval(emergent_phenomenon):
# Using an empty list for runs indicates, that the default agents in algorithms/agent_models should be used.
# If you want to use different agents, that were obtained by running the training with a different seed, you can
# load these agents by inserting the names of the runs in study_out/ into the runs list e.g. ["run1", "run2"]
multi_agent_eval("coin_quadrant", [], emergent_phenomenon)
def two_rooms_multi_agent_rl_eval(emergent_phenomenon):
# Using an empty list for runs indicates, that the default agents in algorithms/agent_models should be used.
# If you want to use different agents, that were obtained by running the training with a different seed, you can
# load these agents by inserting the names of the runs in study_out/ into the runs list e.g. ["run1", "run2"]
multi_agent_eval("two_rooms", [], emergent_phenomenon)

View File

@ -0,0 +1 @@

View File

@ -1,53 +1,66 @@
import os
import pickle
import torch
from typing import Union, List
import numpy as np
from tqdm import tqdm
from marl_factory_grid.algorithms.rl.base_a2c import PolicyGradient
from marl_factory_grid.algorithms.rl.constants import Names
from marl_factory_grid.algorithms.rl.utils import transform_observations, _as_torch, is_door_close, \
from marl_factory_grid.algorithms.marl.base_a2c import PolicyGradient, cumulate_discount
from marl_factory_grid.algorithms.marl.constants import Names
from marl_factory_grid.algorithms.marl.utils import transform_observations, _as_torch, is_door_close, \
get_coin_piles_positions, update_target_pile, update_ordered_coin_piles, get_all_collected_coin_piles, \
distribute_indices, set_agents_spawnpoints, get_ordered_coin_piles, handle_finished_episode, save_configs, \
save_agent_models, get_all_observations, get_agents_positions
from marl_factory_grid.algorithms.utils import add_env_props
from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps, plot_reward_development, \
create_info_maps
save_agent_models, get_all_observations, get_agents_positions, has_low_change_phase_started, significant_deviation, \
get_agent_models_path
from marl_factory_grid.algorithms.utils import add_env_props, get_study_out_path
from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps, plot_return_development, \
create_info_maps, plot_return_development_change
nms = Names
ListOrTensor = Union[List, torch.Tensor]
class A2C:
def __init__(self, train_cfg, eval_cfg):
self.results_path = None
self.agents = None
self.act_dim = None
self.obs_dim = None
self.factory = add_env_props(train_cfg)
def __init__(self, train_cfg=None, eval_cfg=None, mode="train"):
self.mode = mode
if mode == nms.TRAIN:
self.train_factory = add_env_props(train_cfg)
self.train_cfg = train_cfg
self.n_agents = train_cfg[nms.ENV][nms.N_AGENTS]
else:
self.n_agents = eval_cfg[nms.ENV][nms.N_AGENTS]
self.eval_factory = add_env_props(eval_cfg)
self.__training = True
self.train_cfg = train_cfg
self.eval_cfg = eval_cfg
self.cfg = train_cfg
self.n_agents = train_cfg[nms.ENV][nms.N_AGENTS]
self.setup()
self.reward_development = []
self.action_probabilities = {agent_idx: [] for agent_idx in range(self.n_agents)}
def setup(self):
""" Initialize agents and create entry for run results according to configuration """
if self.mode == "train":
self.cfg = self.train_cfg
self.factory = self.train_factory
self.gamma = self.cfg[nms.ALGORITHM][nms.GAMMA]
else:
self.cfg = self.eval_cfg
self.factory = self.eval_factory
self.gamma = 0.99
seed = self.cfg[nms.ALGORITHM][nms.SEED]
print("Algorithm Seed: ", seed)
if seed == -1:
seed = np.random.choice(range(1000))
print("Algorithm seed is -1. Pick random seed: ", seed)
self.obs_dim = 2 + 2 * len(get_coin_piles_positions(self.factory)) if self.cfg[nms.ALGORITHM][
nms.PILE_OBSERVABILITY] == nms.ALL else 4
self.act_dim = 4 # The 4 movement directions
self.agents = [PolicyGradient(self.factory, agent_id=i, obs_dim=self.obs_dim, act_dim=self.act_dim) for i in
self.agents = [PolicyGradient(self.factory, seed=seed, gamma=self.gamma, agent_id=i, obs_dim=self.obs_dim, act_dim=self.act_dim) for i in
range(self.n_agents)]
if self.cfg[nms.ENV][nms.SAVE_AND_LOG]:
# Define study_out_path and check if it exists
base_dir = os.path.dirname(os.path.abspath(__file__)) # Directory of the script
study_out_path = os.path.join(base_dir, '../../../study_out')
study_out_path = os.path.abspath(study_out_path)
study_out_path = get_study_out_path()
if not os.path.exists(study_out_path):
raise FileNotFoundError(f"The directory {study_out_path} does not exist.")
@ -62,56 +75,86 @@ class A2C:
# Save settings in results folder
save_configs(self.results_path, self.cfg, self.factory.conf, self.eval_factory.conf)
def set_cfg(self, eval=False):
if eval:
self.cfg = self.eval_cfg
else:
self.cfg = self.train_cfg
def load_agents(self, runs_list):
def load_agents(self, config_name, runs_list):
""" Initialize networks with parameters of already trained agents """
for idx, run in enumerate(runs_list):
run_path = f"./study_out/{run}"
self.agents[idx].pi.load_model_parameters(f"{run_path}/PolicyNet_model_parameters.pth")
self.agents[idx].vf.load_model_parameters(f"{run_path}/ValueNet_model_parameters.pth")
if len(runs_list) == 0 or runs_list is None:
if config_name == "coin_quadrant":
for idx in range(self.n_agents):
self.agents[idx].pi.load_model_parameters(f"{get_agent_models_path()}/PolicyNet_model_parameters_coin_quadrant.pth")
self.agents[idx].vf.load_model_parameters(f"{get_agent_models_path()}/ValueNet_model_parameters_coin_quadrant.pth")
elif config_name == "two_rooms":
for idx in range(self.n_agents):
self.agents[idx].pi.load_model_parameters(f"{get_agent_models_path()}/PolicyNet_model_parameters_two_rooms_agent{idx+1}.pth")
self.agents[idx].vf.load_model_parameters(f"{get_agent_models_path()}/ValueNet_model_parameters_two_rooms_agent{idx+1}.pth")
else:
print("No such config does exist! Abort...")
else:
for idx, run in enumerate(runs_list):
run_path = f"./study_out/{run}"
self.agents[idx].pi.load_model_parameters(f"{run_path}/PolicyNet_model_parameters.pth")
self.agents[idx].vf.load_model_parameters(f"{run_path}/ValueNet_model_parameters.pth")
@torch.no_grad()
def train_loop(self):
""" Function for training agents """
env = self.factory
n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
n_steps, max_steps = [self.train_cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
global_steps, episode = 0, 0
indices = distribute_indices(env, self.cfg, self.n_agents)
indices = distribute_indices(env, self.train_cfg, self.n_agents)
coin_piles_positions = get_coin_piles_positions(env)
target_pile = [partition[0] for partition in
indices] # list of pointers that point to the current target pile for each agent
collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
low_change_phase_start_episode = -1
episode_rewards_development = []
return_change_development = []
pbar = tqdm(total=max_steps)
while global_steps < max_steps:
loop_condition = True if self.train_cfg[nms.ALGORITHM][nms.EARLY_STOPPING] else global_steps < max_steps
while loop_condition:
_ = env.reset()
if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
if self.train_cfg[nms.ENV][nms.TRAIN_RENDER]:
env.render()
set_agents_spawnpoints(env, self.n_agents)
ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.cfg, self.n_agents)
ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.train_cfg, self.n_agents)
# Reset current target pile at episode begin if all piles have to be collected in one episode
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.ALL:
if self.train_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.ALL:
target_pile = [partition[0] for partition in indices]
collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
episode_rewards_development.append([])
# Supply each agent with its local observation
obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
done, rew_log = [False] * self.n_agents, 0
obs = transform_observations(env, ordered_coin_piles, target_pile, self.train_cfg, self.n_agents)
done, ep_return = [False] * self.n_agents, 0
if self.train_cfg[nms.ALGORITHM][nms.EARLY_STOPPING]:
if len(return_change_development) > self.train_cfg[nms.ALGORITHM][
nms.LAST_N_EPISODES] and low_change_phase_start_episode == -1 and has_low_change_phase_started(
return_change_development, self.train_cfg[nms.ALGORITHM][nms.LAST_N_EPISODES],
self.train_cfg[nms.ALGORITHM][nms.MEAN_TARGET_CHANGE]):
low_change_phase_start_episode = len(return_change_development)
print(low_change_phase_start_episode)
# Check if requirements for early stopping are met
if low_change_phase_start_episode != -1 and significant_deviation(return_change_development, low_change_phase_start_episode):
print(f"Early Stopping in Episode: {global_steps} because of significant deviation.")
break
if low_change_phase_start_episode != -1 and (len(return_change_development) - low_change_phase_start_episode) >= 1000:
print(f"Early Stopping in Episode: {global_steps} because of episode time limit")
break
if low_change_phase_start_episode != -1 and global_steps >= max_steps:
print(f"Early Stopping in Episode: {global_steps} because of global steps time limit")
break
while not all(done):
action = self.use_door_or_move(env, obs, collected_coin_piles) \
if nms.DOORS in env.state.entities.keys() else self.get_actions(obs)
_, next_obs, reward, done, info = env.step(action)
next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.train_cfg, self.n_agents)
# Handle case where agent is on field with coin
reward, done = self.handle_coin(env, collected_coin_piles, ordered_coin_piles, target_pile, indices,
reward, done)
reward, done, self.train_cfg)
if n_steps != 0 and (global_steps + 1) % n_steps == 0: done = True
@ -122,50 +165,67 @@ class A2C:
agent._episode[-1] = (next_obs[ag_i], action[ag_i], reward[ag_i], agent._episode[-1][-1])
# Visualize state update
if self.cfg[nms.ENV][nms.TRAIN_RENDER]: env.render()
if self.train_cfg[nms.ENV][nms.TRAIN_RENDER]: env.render()
obs = next_obs
if all(done): handle_finished_episode(obs, self.agents, self.cfg)
global_steps += 1
rew_log += sum(reward)
episode_rewards_development[-1].extend(reward)
if global_steps >= max_steps: break
if all(done):
handle_finished_episode(obs, self.agents, self.train_cfg)
break
self.reward_development.append(rew_log)
if global_steps >= max_steps: break
return_change_development.append(
sum(episode_rewards_development[-1]) - sum(episode_rewards_development[-2])
if len(episode_rewards_development) > 1 else 0.0)
episode += 1
pbar.update(global_steps - pbar.n)
pbar.close()
if self.cfg[nms.ENV][nms.SAVE_AND_LOG]:
plot_reward_development(self.reward_development, self.results_path)
create_info_maps(env, get_all_observations(env, self.cfg, self.n_agents),
if self.train_cfg[nms.ENV][nms.SAVE_AND_LOG]:
return_development = [np.sum(rewards) for rewards in episode_rewards_development]
discounted_return_development = [np.sum([reward * pow(self.gamma, i) for i, reward in enumerate(ep_rewards)]) for ep_rewards in episode_rewards_development]
plot_return_development(return_development, self.results_path)
plot_return_development(discounted_return_development, self.results_path, discounted=True)
plot_return_development_change(return_change_development, self.results_path)
create_info_maps(env, get_all_observations(env, self.train_cfg, self.n_agents),
get_coin_piles_positions(env), self.results_path, self.agents, self.act_dim, self)
metrics_data = {"episode_rewards_development": episode_rewards_development,
"return_development": return_development,
"discounted_return_development": discounted_return_development,
"return_change_development": return_change_development}
with open(f"{self.results_path}/metrics", "wb") as pickle_file:
pickle.dump(metrics_data, pickle_file)
save_agent_models(self.results_path, self.agents)
plot_action_maps(env, [self], self.results_path)
@torch.inference_mode(True)
def eval_loop(self, n_episodes):
def eval_loop(self, config_name, n_episodes):
""" Function for performing inference """
env = self.eval_factory
self.set_cfg(eval=True)
episode, results = 0, []
coin_piles_positions = get_coin_piles_positions(env)
indices = distribute_indices(env, self.cfg, self.n_agents)
if config_name == "coin_quadrant": print("Coin Piles positions", coin_piles_positions)
indices = distribute_indices(env, self.eval_cfg, self.n_agents)
target_pile = [partition[0] for partition in
indices] # list of pointers that point to the current target pile for each agent
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
collected_coin_piles = [{coin_piles_positions[idx]: False for idx in indices[i]} for i in
range(self.n_agents)]
else: collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
else:
collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
collected_coin_piles_per_step = []
while episode < n_episodes:
_ = env.reset()
set_agents_spawnpoints(env, self.n_agents)
if self.cfg[nms.ENV][nms.EVAL_RENDER]:
if self.eval_cfg[nms.ENV][nms.EVAL_RENDER]:
# Don't render auxiliary piles
if self.cfg[nms.ALGORITHM][nms.AUXILIARY_PILES]:
if self.eval_cfg[nms.ALGORITHM][nms.AUXILIARY_PILES]:
auxiliary_piles = [pile for idx, pile in enumerate(env.state.entities[nms.COIN_PILES]) if
idx % 2 == 0]
for pile in auxiliary_piles:
@ -174,19 +234,23 @@ class A2C:
env._renderer.fps = 5 # Slow down agent movement
# Reset current target pile at episode begin if all piles have to be collected in one episode
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED, nms.SHARED]:
if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED, nms.SHARED]:
target_pile = [partition[0] for partition in indices]
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
collected_coin_piles = [{coin_piles_positions[idx]: False for idx in indices[i]} for i in
range(self.n_agents)]
else: collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
else:
collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.cfg, self.n_agents)
ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.eval_cfg, self.n_agents)
# Supply each agent with its local observation
obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
obs = transform_observations(env, ordered_coin_piles, target_pile, self.eval_cfg, self.n_agents)
done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)
collected_coin_piles_per_step.append([])
ep_steps = 0
while not all(done):
action = self.use_door_or_move(env, obs, collected_coin_piles, det=True) \
if nms.DOORS in env.state.entities.keys() else self.execute_policy(obs, env,
@ -195,20 +259,44 @@ class A2C:
# Handle case where agent is on field with coin
reward, done = self.handle_coin(env, collected_coin_piles, ordered_coin_piles, target_pile, indices,
reward, done)
reward, done, self.eval_cfg)
ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.eval_cfg, self.n_agents)
# Get transformed next_obs that might have been updated because of handle_coin
next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.eval_cfg, self.n_agents)
done = [done] * self.n_agents if isinstance(done, bool) else done
if self.cfg[nms.ENV][nms.EVAL_RENDER]: env.render()
if self.eval_cfg[nms.ENV][nms.EVAL_RENDER]: env.render()
obs = next_obs
episode += 1
# Count the overall number of cleaned coin piles in each step
collected_piles = 0
for dict in collected_coin_piles:
for value in dict.values():
if value:
collected_piles += 1
collected_coin_piles_per_step[-1].append(collected_piles)
# -------------------------------------- HELPER FUNCTIONS ------------------------------------------------- #
ep_steps += 1
episode += 1
print("Number of environment steps:", ep_steps)
if config_name == "coin_quadrant":
print("Collected coins per step:", collected_coin_piles_per_step)
else:
# For the RL agent, we encode the flags internally as coins as well.
# Also, we have to subtract the auxiliary pile in the emergence prevention mechanism case
print("Reached flags per step:", [[max(0, coin_pile - 1) for coin_pile in ele] for ele in collected_coin_piles_per_step])
if self.eval_cfg[nms.ENV][nms.SAVE_AND_LOG]:
metrics_data = {"collected_coin_piles_per_step": collected_coin_piles_per_step}
with open(f"{self.results_path}/metrics", "wb") as pickle_file:
pickle.dump(metrics_data, pickle_file)
########## Helper functions ########
def get_actions(self, observations) -> ListOrTensor:
""" Given local observations, get actions for both agents """
@ -247,14 +335,18 @@ class A2C:
a.name == nms.USE_DOOR))
# Don't include action in agent experience
else:
if det: action.append(int(agent.pi(agent_obs, det=True)[0]))
else: action.append(int(agent.step(agent_obs)))
if det:
action.append(int(agent.pi(agent_obs, det=True)[0]))
else:
action.append(int(agent.step(agent_obs)))
else:
if det: action.append(int(agent.pi(agent_obs, det=True)[0]))
else: action.append(int(agent.step(agent_obs)))
if det:
action.append(int(agent.pi(agent_obs, det=True)[0]))
else:
action.append(int(agent.step(agent_obs)))
return action
def handle_coin(self, env, collected_coin_piles, ordered_coin_piles, target_pile, indices, reward, done):
def handle_coin(self, env, collected_coin_piles, ordered_coin_piles, target_pile, indices, reward, done, cfg):
""" Check if agent moved on field with coin. If that is the case collect coin automatically """
agents_positions = get_agents_positions(env, self.n_agents)
coin_piles_positions = get_coin_piles_positions(env)
@ -269,10 +361,10 @@ class A2C:
reward[idx] += 50
collected_coin_piles[idx][pos] = True
# Set pointer to next coin pile
update_target_pile(env, idx, target_pile, indices, self.cfg)
update_target_pile(env, idx, target_pile, indices, cfg)
update_ordered_coin_piles(idx, collected_coin_piles, ordered_coin_piles, env,
self.cfg, self.n_agents)
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SINGLE:
cfg, self.n_agents)
if cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SINGLE:
done = True
if all(collected_coin_piles[idx].values()):
# Reset collected_coin_piles indicator
@ -285,11 +377,15 @@ class A2C:
# Indicate that renderer can hide coin pile
coin_at_position = env.state[nms.COIN_PILES].by_pos(pos)
coin_at_position[0].set_new_amount(0)
"""
coin_at_position = env.state[nms.COIN_PILES].by_pos(pos)[0]
env.state[nms.COIN_PILES].delete_env_object(coin_at_position)
"""
if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED]:
if cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED]:
if all([all(collected_coin_piles[i].values()) for i in range(self.n_agents)]):
done = True
elif self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SHARED:
elif cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SHARED:
# End episode if both agents together have collected all coin piles
if all(get_all_collected_coin_piles(coin_piles_positions, collected_coin_piles, self.n_agents).values()):
done = True

View File

@ -1,755 +0,0 @@
import copy
import os
import random
import imageio # requires ffmpeg install on operating system and imageio-ffmpeg package for python
from scipy import signal
import matplotlib.pyplot as plt
import torch
from typing import Union, List, Dict
import numpy as np
from torch.distributions import Categorical
from marl_factory_grid.algorithms.marl.base_a2c import PolicyGradient, cumulate_discount
from marl_factory_grid.algorithms.marl.memory import MARLActorCriticMemory
from marl_factory_grid.algorithms.utils import add_env_props, instantiate_class
from pathlib import Path
from collections import deque
from marl_factory_grid.environment.actions import Noop
from marl_factory_grid.modules import Clean, DoorUse
from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps
class Names:
REWARD = 'reward'
DONE = 'done'
ACTION = 'action'
OBSERVATION = 'observation'
LOGITS = 'logits'
HIDDEN_ACTOR = 'hidden_actor'
HIDDEN_CRITIC = 'hidden_critic'
AGENT = 'agent'
ENV = 'env'
ENV_NAME = 'env_name'
N_AGENTS = 'n_agents'
ALGORITHM = 'algorithm'
MAX_STEPS = 'max_steps'
N_STEPS = 'n_steps'
BUFFER_SIZE = 'buffer_size'
CRITIC = 'critic'
BATCH_SIZE = 'bnatch_size'
N_ACTIONS = 'n_actions'
TRAIN_RENDER = 'train_render'
EVAL_RENDER = 'eval_render'
nms = Names
ListOrTensor = Union[List, torch.Tensor]
class A2C:
def __init__(self, train_cfg, eval_cfg):
self.factory = add_env_props(train_cfg)
self.eval_factory = add_env_props(eval_cfg)
self.__training = True
self.train_cfg = train_cfg
self.eval_cfg = eval_cfg
self.cfg = train_cfg
self.n_agents = train_cfg[nms.AGENT][nms.N_AGENTS]
self.setup()
self.reward_development = []
self.action_probabilities = {agent_idx:[] for agent_idx in range(self.n_agents)}
def setup(self):
dirt_piles_positions = [self.factory.state.entities['DirtPiles'][pile_idx].pos for pile_idx in
range(len(self.factory.state.entities['DirtPiles']))]
if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
obs_dim = 2 + 2*len(dirt_piles_positions)
else:
obs_dim = 4
self.obs_dim = obs_dim
self.act_dim = 4
# act_dim=4, because we want the agent to only learn a routing problem
self.agents = [PolicyGradient(self.factory, agent_id=i, obs_dim=obs_dim, act_dim=self.act_dim) for i in range(self.n_agents)]
if self.cfg[nms.ENV]["save_and_log"]:
# Create results folder
runs = os.listdir("../study_out/")
run_numbers = [int(run[3:]) for run in runs if run[:3] == "run"]
next_run_number = max(run_numbers)+1 if run_numbers else 0
self.results_path = f"../study_out/run{next_run_number}"
os.mkdir(self.results_path)
# Save settings in results folder
self.save_configs()
if self.cfg[nms.ENV]["record"]:
self.recorder = imageio.get_writer(f'{self.results_path}/pygame_recording.mp4', fps=5)
def set_cfg(self, eval=False):
if eval:
self.cfg = self.eval_cfg
else:
self.cfg = self.train_cfg
@classmethod
def _as_torch(cls, x):
if isinstance(x, np.ndarray):
return torch.from_numpy(x)
elif isinstance(x, List):
return torch.tensor(x)
elif isinstance(x, (int, float)):
return torch.tensor([x])
return x
def get_actions(self, observations) -> ListOrTensor:
# Given an observation, get actions for both agents
actions = [agent.step(self._as_torch(observations[ag_i]).view(-1).to(torch.float32)) for ag_i, agent in enumerate(self.agents)]
return actions
def execute_policy(self, observations, env, cleaned_dirt_piles) -> ListOrTensor:
# Use deterministic policy for inference
actions = [agent.policy(self._as_torch(observations[ag_i]).view(-1).to(torch.float32)) for ag_i, agent in enumerate(self.agents)]
for agent_idx in range(self.n_agents):
if all(cleaned_dirt_piles[agent_idx].values()):
actions[agent_idx] = np.array(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "Noop"))
return actions
def transform_observations(self, env, ordered_dirt_piles, target_pile):
""" Assumes that agent has observations -DirtPiles and -Self """
agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
trans_obs = [torch.zeros(2+2*len(ordered_dirt_piles[0])) for _ in range(len(agent_positions))]
else:
# Only show current target pile
trans_obs = [torch.zeros(4) for _ in range(len(agent_positions))]
for i, pos in enumerate(agent_positions):
agent_x, agent_y = pos[0], pos[1]
trans_obs[i][0] = agent_x
trans_obs[i][1] = agent_y
idx = 2
if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
for pile_pos in ordered_dirt_piles[i]:
trans_obs[i][idx] = pile_pos[0]
trans_obs[i][idx + 1] = pile_pos[1]
idx += 2
else:
trans_obs[i][2] = ordered_dirt_piles[i][target_pile[i]][0]
trans_obs[i][3] = ordered_dirt_piles[i][target_pile[i]][1]
return trans_obs
def get_all_observations(self, env):
dirt_piles_positions = [env.state.entities['DirtPiles'][pile_idx].pos for pile_idx in
range(len(env.state.entities['DirtPiles']))]
if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
obs = [torch.zeros(2 + 2 * len(dirt_piles_positions))]
observations = [[]]
# Fill in pile positions
idx = 2
for pile_pos in dirt_piles_positions:
obs[0][idx] = pile_pos[0]
obs[0][idx + 1] = pile_pos[1]
idx += 2
else:
# Have multiple observation layers of the map for each dirt pile one
obs = [torch.zeros(4) for _ in range(self.n_agents) for _ in dirt_piles_positions]
observations = [[] for _ in dirt_piles_positions]
for idx, pile_pos in enumerate(dirt_piles_positions):
obs[idx][2] = pile_pos[0]
obs[idx][3] = pile_pos[1]
valid_agent_positions = env.state.entities.floorlist
#observations_shape = (max(t[0] for t in valid_agent_positions) + 2, max(t[1] for t in valid_agent_positions) + 2)
for idx, pos in enumerate(valid_agent_positions):
for obs_layer in range(len(obs)):
observation = copy.deepcopy(obs[obs_layer])
observation[0] = pos[0]
observation[1] = pos[1]
observations[obs_layer].append(observation)
return observations
def get_dirt_piles_positions(self, env):
return [env.state.entities['DirtPiles'][pile_idx].pos for pile_idx in range(len(env.state.entities['DirtPiles']))]
def get_ordered_dirt_piles(self, env, cleaned_dirt_piles, target_pile):
""" Each agent can have it's individual pile order """
ordered_dirt_piles = [[] for _ in range(self.n_agents)]
dirt_pile_positions = self.get_dirt_piles_positions(env)
agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
for agent_idx in range(self.n_agents):
if self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "agents"]:
ordered_dirt_piles[agent_idx] = dirt_pile_positions
elif self.cfg[nms.ALGORITHM]["pile-order"] == "random":
ordered_dirt_piles[agent_idx] = dirt_pile_positions
random.shuffle(ordered_dirt_piles)
elif self.cfg[nms.ALGORITHM]["pile-order"] == "none":
ordered_dirt_piles[agent_idx] = None
elif self.cfg[nms.ALGORITHM]["pile-order"] in ["smart", "dynamic"]:
# Calculate distances for remaining unvisited dirt piles
remaining_target_piles = [pos for pos, value in cleaned_dirt_piles[agent_idx].items() if not value]
pile_distances = {pos:0 for pos in remaining_target_piles}
agent_pos = agent_positions[agent_idx]
for pos in remaining_target_piles:
pile_distances[pos] = np.abs(agent_pos[0] - pos[0]) + np.abs(agent_pos[1] - pos[1])
if self.cfg[nms.ALGORITHM]["pile-order"] == "smart":
# Check if there is an agent in line with any of the remaining dirt piles
for pile_pos in remaining_target_piles:
for other_pos in agent_positions:
if other_pos != agent_pos:
if agent_pos[0] == other_pos[0] == pile_pos[0] or agent_pos[1] == other_pos[1] == pile_pos[1]:
# Get the line between the agent and the goal
path = self.bresenham(agent_pos[0], agent_pos[1], pile_pos[0], pile_pos[1])
# Check if the entity lies on the path between the agent and the goal
if other_pos in path:
pile_distances[pile_pos] += np.abs(agent_pos[0] - other_pos[0]) + np.abs(agent_pos[1] - other_pos[1])
sorted_pile_distances = dict(sorted(pile_distances.items(), key=lambda item: item[1]))
# Insert already visited dirt piles
ordered_dirt_piles[agent_idx] = [pos for pos in dirt_pile_positions if pos not in remaining_target_piles]
# Fill up with sorted positions
for pos in sorted_pile_distances.keys():
ordered_dirt_piles[agent_idx].append(pos)
else:
print("Not a valid pile order option.")
exit()
return ordered_dirt_piles
def bresenham(self, x0, y0, x1, y1):
"""Bresenham's line algorithm to get the coordinates of a line between two points."""
dx = np.abs(x1 - x0)
dy = np.abs(y1 - y0)
sx = 1 if x0 < x1 else -1
sy = 1 if y0 < y1 else -1
err = dx - dy
coordinates = []
while True:
coordinates.append((x0, y0))
if x0 == x1 and y0 == y1:
break
e2 = 2 * err
if e2 > -dy:
err -= dy
x0 += sx
if e2 < dx:
err += dx
y0 += sy
return coordinates
def update_ordered_dirt_piles(self, agent_idx, cleaned_dirt_piles, ordered_dirt_piles, env, target_pile):
# Only update ordered_dirt_pile for agent that reached its target pile
updated_ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
for i in range(len(ordered_dirt_piles[agent_idx])):
ordered_dirt_piles[agent_idx][i] = updated_ordered_dirt_piles[agent_idx][i]
def distribute_indices(self, env):
indices = []
n_dirt_piles = len(self.get_dirt_piles_positions(env))
if n_dirt_piles == 1 or self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "random", "none", "dynamic", "smart"]:
indices = [[0] for _ in range(self.n_agents)]
else:
base_count = n_dirt_piles // self.n_agents
remainder = n_dirt_piles % self.n_agents
start_index = 0
for i in range(self.n_agents):
# Add an extra index to the first 'remainder' objects
end_index = start_index + base_count + (1 if i < remainder else 0)
indices.append(list(range(start_index, end_index)))
start_index = end_index
# Static form: auxiliary pile, primary pile, auxiliary pile, ...
# -> Starting with index 0 even piles are auxiliary piles, odd piles are primary piles
if self.cfg[nms.ALGORITHM]["auxiliary_piles"] and "Doors" in env.state.entities.keys():
door_positions = [door.pos for door in env.state.entities["Doors"]]
agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
distances = {door_pos:[] for door_pos in door_positions}
# Calculate distance of every agent to every door
for door_pos in door_positions:
for agent_pos in agent_positions:
distances[door_pos].append(np.abs(door_pos[0] - agent_pos[0]) + np.abs(door_pos[1] - agent_pos[1]))
def duplicate_indices(lst, item):
return [i for i, x in enumerate(lst) if x == item]
# Get agent indices of agents with same distance to door
affected_agents = {door_pos:{} for door_pos in door_positions}
for door_pos in distances.keys():
dist = distances[door_pos]
dist_set = set(dist)
for d in dist_set:
affected_agents[door_pos][str(d)] = duplicate_indices(dist, d)
# TODO: Make generic for multiple doors
updated_indices = []
if len(affected_agents[door_positions[0]]) == 0:
# Remove auxiliary piles for all agents
updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
else:
for distance, agent_indices in affected_agents[door_positions[0]].items():
# Pick random agent to keep auxiliary pile and remove it for all others
#selected_agent = np.random.choice(agent_indices)
selected_agent = 0
for agent_idx in agent_indices:
if agent_idx == selected_agent:
updated_indices.append(indices[agent_idx])
else:
updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])
indices = updated_indices
return indices
def update_target_pile(self, env, agent_idx, target_pile, indices):
if self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "random", "none", "dynamic", "smart"]:
if target_pile[agent_idx] + 1 < len(self.get_dirt_piles_positions(env)):
target_pile[agent_idx] += 1
else:
target_pile[agent_idx] = 0
else:
if target_pile[agent_idx] + 1 in indices[agent_idx]:
target_pile[agent_idx] += 1
def door_is_close(self, env, agent_idx):
neighbourhood = [y for x in env.state.entities.neighboring_positions(env.state["Agent"][agent_idx].pos)
for y in env.state.entities.pos_dict[x] if "Door" in y.name]
if neighbourhood:
return neighbourhood[0]
def use_door_or_move(self, env, obs, cleaned_dirt_piles, target_pile, det=False):
action = []
for agent_idx, agent in enumerate(self.agents):
agent_obs = self._as_torch((obs)[agent_idx]).view(-1).to(torch.float32)
# If agent already reached its target
if all(cleaned_dirt_piles[agent_idx].values()):
action.append(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "Noop"))
if not det:
# Include agent experience entry manually
agent._episode.append((None, None, None, agent.vf(agent_obs)))
else:
if door := self.door_is_close(env, agent_idx):
if door.is_closed:
action.append(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "use_door"))
# Don't include action in agent experience
else:
if det:
action.append(int(agent.pi(agent_obs, det=True)[0]))
else:
action.append(int(agent.step(agent_obs)))
else:
if det:
action.append(int(agent.pi(agent_obs, det=True)[0]))
else:
action.append(int(agent.step(agent_obs)))
return action
def reward_distance(self, env, obs, target_pile, reward):
agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
# Give a negative reward for every step that keeps agent from getting closer to currently selected target pile/ closest pile
for idx, pos in enumerate(agent_positions):
last_pos = (int(obs[idx][0]), int(obs[idx][1].item()))
target_pile_pos = self.get_dirt_piles_positions(env)[target_pile[idx]]
last_distance = np.abs(target_pile_pos[0] - last_pos[0]) + np.abs(target_pile_pos[1] - last_pos[1])
new_distance = np.abs(target_pile_pos[0] - pos[0]) + np.abs(target_pile_pos[1] - pos[1])
if new_distance >= last_distance:
reward[idx] -= 0.05 # 0.05
return reward
def punish_entering_same_field(self, next_obs, passed_fields, reward):
# Give a high negative reward if agent enters same field twice
for idx in range(self.n_agents):
if (next_obs[idx][0], next_obs[idx][1]) in passed_fields[idx]:
reward[idx] += -0.1
else:
passed_fields[idx].append((next_obs[idx][0], next_obs[idx][1]))
def handle_dirt_quadrant_observation_bugs(self, obs, env):
try:
# Check that dirt position and amount are still correct
assert np.where(obs[0][0] == 0.5)[0][0] == 1 and np.where(obs[0][0] == 0.5)[0][0] == 1
except:
print("Missing dirt pile")
# Manually place dirt on defined position
obs[0][0][1][1] = 0.5
try:
# Check that self still returns a valid agent position on the map
assert np.where(obs[0][1] == 1)[0][0] and np.where(obs[0][1] == 1)[1][0]
except:
# Place agent manually in obs object on last known position
x, y = env.state.moving_entites[0].pos[0], env.state.moving_entites[0].pos[1]
obs[0][1][x][y] = 1
print("Missing agent position")
def get_all_cleaned_dirt_piles(self, dirt_piles_positions, cleaned_dirt_piles):
meta_cleaned_dirt_piles = {pos: False for pos in dirt_piles_positions}
for agent_idx in range(self.n_agents):
for (pos, cleaned) in cleaned_dirt_piles[agent_idx].items():
if cleaned:
meta_cleaned_dirt_piles[pos] = True
return meta_cleaned_dirt_piles
def handle_dirt(self, env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done):
# Check if agent moved on field with dirt. If that is the case collect dirt automatically
agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
dirt_piles_positions = self.get_dirt_piles_positions(env)
if any([True for pos in agent_positions if pos in dirt_piles_positions]):
# Do Noop for agent that does not collect dirt
"""action = [np.array(5), np.array(5)]
# Execute real step in environment
for idx, pos in enumerate(agent_positions):
if pos in cleaned_dirt_piles[idx].keys() and not cleaned_dirt_piles[idx][pos]:
action[idx] = np.array(4)
# Collect dirt
_, next_obs, reward, done, info = env.step(action)
cleaned_dirt_piles[idx][pos] = True
break"""
# Only simulate collecting the dirt
for idx, pos in enumerate(agent_positions):
if pos in cleaned_dirt_piles[idx].keys() and not cleaned_dirt_piles[idx][pos]:
# print(env.state.entities["Agent"][idx], pos, idx, target_pile, ordered_dirt_piles)
# If dirt piles should be cleaned in a specific order
if ordered_dirt_piles[idx]:
if pos == ordered_dirt_piles[idx][target_pile[idx]]:
reward[idx] += 50 # 1
cleaned_dirt_piles[idx][pos] = True
# Set pointer to next dirt pile
self.update_target_pile(env, idx, target_pile, indices)
self.update_ordered_dirt_piles(idx, cleaned_dirt_piles, ordered_dirt_piles, env, target_pile)
if self.cfg[nms.ALGORITHM]["pile_all_done"] == "single":
done = True
if all(cleaned_dirt_piles[idx].values()):
# Reset cleaned_dirt_piles indicator
for pos in dirt_piles_positions:
cleaned_dirt_piles[idx][pos] = False
else:
reward[idx] += 50 # 1
cleaned_dirt_piles[idx][pos] = True
if self.cfg[nms.ALGORITHM]["pile_all_done"] in ["all", "distributed"]:
if all([all(cleaned_dirt_piles[i].values()) for i in range(self.n_agents)]):
done = True
elif self.cfg[nms.ALGORITHM]["pile_all_done"] == "shared":
# End episode if both agents together have cleaned all dirt piles
if all(self.get_all_cleaned_dirt_piles(dirt_piles_positions, cleaned_dirt_piles).values()):
done = True
return reward, done
def handle_finished_episode(self, obs):
with torch.inference_mode(False):
for ag_i, agent in enumerate(self.agents):
# Get states, actions, rewards and values from rollout buffer
data = agent.finish_episode()
# Chunk episode data, such that there will be no memory failure for very long episodes
chunks = self.split_into_chunks(data)
for (s, a, R, V) in chunks:
# Calculate discounted return and advantage
G = cumulate_discount(R, self.cfg[nms.ALGORITHM]["gamma"])
if self.cfg[nms.ALGORITHM]["advantage"] == "Reinforce":
A = G
elif self.cfg[nms.ALGORITHM]["advantage"] == "Advantage-AC":
A = G - V # Actor-Critic Advantages
elif self.cfg[nms.ALGORITHM]["advantage"] == "TD-Advantage-AC":
with torch.no_grad():
A = R + self.cfg[nms.ALGORITHM]["gamma"] * np.append(V[1:], agent.vf(
self._as_torch(obs[ag_i]).view(-1).to(
torch.float32)).numpy()) - V # TD Actor-Critic Advantages
else:
print("Not a valid advantage option.")
exit()
rollout = (torch.tensor(x.copy()).to(torch.float32) for x in (s, a, G, A))
# Update policy and value net of agent with experience from rollout buffer
agent.train(*rollout)
def split_into_chunks(self, data_tuple):
result = [data_tuple]
chunk_size = self.cfg[nms.ALGORITHM]["chunk-episode"]
if chunk_size > 0:
# Get the maximum length of the lists in the tuple to handle different lengths
max_length = max(len(lst) for lst in data_tuple)
# Prepare a list to store the result
result = []
# Split each list into chunks and add them to the result
for i in range(0, max_length, chunk_size):
# Create a sublist containing the ith chunk from each list
sublist = [lst[i:i + chunk_size] for lst in data_tuple if i < len(lst)]
result.append(sublist)
return result
def set_agent_spawnpoint(self, env):
for agent_idx in range(self.n_agents):
agent_name = list(env.state.agents_conf.keys())[agent_idx]
current_pos_pointer = env.state.agents_conf[agent_name]["pos_pointer"]
# Making the reset dependent on the number of spawnpoints and not the number of dirtpiles allows
# for having multiple subsequent spawnpoints with the same target pile
if current_pos_pointer == len(env.state.agents_conf[agent_name]['positions']) - 1:
env.state.agents_conf[agent_name]["pos_pointer"] = 0
else:
env.state.agents_conf[agent_name]["pos_pointer"] += 1
@torch.no_grad()
def train_loop(self):
env = self.factory
n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
global_steps, episode = 0, 0
indices = self.distribute_indices(env)
dirt_piles_positions = self.get_dirt_piles_positions(env)
used_actions = {i:0 for i in range(len(env.state.entities["Agent"][0]._actions))} # Assume both agents have the same actions
target_pile = [partition[0] for partition in indices] # pointer that points to the target pile for each agent. (point to same pile, point to different piles)
cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)] # Have own dictionary for each agent
while global_steps < max_steps:
print(global_steps)
obs = env.reset() # !!!!!!!!Commented seems to work better? Only if a fixed spawnpoint is given
if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
env.render()
self.set_agent_spawnpoint(env)
ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
# Reset current target pile at episode begin if all piles have to be cleaned in one episode
if self.cfg[nms.ALGORITHM]["pile_all_done"] == "all":
target_pile = [partition[0] for partition in indices]
cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
"""passed_fields = [[] for _ in range(self.n_agents)]"""
"""obs = list(obs.values())"""
obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
done, rew_log = [False] * self.n_agents, 0
print("Agents spawnpoints:", [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)])
print("Agents target piles:", target_pile)
print("Agents initial observation:", obs)
print("Agents cleaned dirt piles:", cleaned_dirt_piles)
# Add Clean and Noop actions to agent actions so that they can be executed when the agent comes on a dirpile
"""for i in range(self.n_agents):
self.factory.state['Agent'][i].actions.extend([Clean(), Noop()])"""
while not all(done):
# 0="North", 1="East", 2="South", 3="West", 4="Clean", 5="Noop"
action = self.use_door_or_move(env, obs, cleaned_dirt_piles, target_pile) \
if "Doors" in env.state.entities.keys() else self.get_actions(obs)
used_actions[int(action[0])] += 1
_, next_obs, reward, done, info = env.step(action)
if done:
print("DoneAtMaxStepsReached:", len(self.agents[0]._episode))
next_obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
# Add small negative reward if agent has moved away from the target_pile
# reward = self.reward_distance(env, obs, target_pile, reward)
# Check and handle if agent is on field with dirt. This method can change the observation for the next step.
# If pile_all_done is "single", the episode ends if agents reached its target pile and the new episode begins
# with the updated observation. The observation that is saved to the rollout buffer, which resulted in reaching
# the target pile should not be updated before saving. Thus, the self.transform_observations call must happen
# before this method is called.
reward, done = self.handle_dirt(env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done)
if n_steps != 0 and (global_steps + 1) % n_steps == 0:
print("max_steps reached")
done = True
done = [done] * self.n_agents if isinstance(done, bool) else done
for ag_i, agent in enumerate(self.agents):
# For forced actions like door opening, we have to call the step function with this action, but
# since we are not allowed to exceed the dimensions range, we can't log the corresponding step info.
if action[ag_i] in range(self.act_dim):
# Add agent results into respective rollout buffers
agent._episode[-1] = (next_obs[ag_i], action[ag_i], reward[ag_i], agent._episode[-1][-1])
if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
env.render()
obs = next_obs
if all(done): self.handle_finished_episode(obs)
global_steps += 1
rew_log += sum(reward)
if global_steps >= max_steps:
break
print(f'reward at episode: {episode} = {rew_log}')
self.reward_development.append(rew_log)
episode += 1
self.plot_reward_development()
if self.cfg[nms.ENV]["save_and_log"]:
self.create_info_maps(env, used_actions)
self.save_agent_models()
plot_action_maps(env, [self], self.results_path)
@torch.inference_mode(True)
def eval_loop(self, n_episodes, render=False):
env = self.eval_factory
self.set_cfg(eval=True)
episode, results = 0, []
dirt_piles_positions = self.get_dirt_piles_positions(env)
indices = self.distribute_indices(env)
target_pile = [partition[0] for partition in indices] # pointer that points to the target pile for each agent. (point to same pile, point to different piles)
if self.cfg[nms.ALGORITHM]["pile_all_done"] == "distributed":
cleaned_dirt_piles = [{dirt_piles_positions[idx]: False for idx in indices[i]} for i in range(self.n_agents)]
else:
cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
while episode < n_episodes:
obs = env.reset()
if self.cfg[nms.ENV][nms.EVAL_RENDER]:
if self.cfg[nms.ENV]["save_and_log"] and self.cfg[nms.ENV]["record"]:
env.set_recorder(self.recorder)
env.render()
env._renderer.fps = 5
self.set_agent_spawnpoint(env)
"""obs = list(obs.values())"""
# Reset current target pile at episode begin if all piles have to be cleaned in one episode
if self.cfg[nms.ALGORITHM]["pile_all_done"] in ["all", "distributed", "shared"]:
target_pile = [partition[0] for partition in indices]
if self.cfg[nms.ALGORITHM]["pile_all_done"] == "distributed":
cleaned_dirt_piles = [{dirt_piles_positions[idx]: False for idx in indices[i]} for i in range(self.n_agents)]
else:
cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)
# Add Clean and Noop actions to agent actions so that they can be executed when the agent comes on a dirpile
"""for i in range(self.n_agents):
self.factory.state['Agent'][i].actions.extend([Clean(), Noop()])"""
while not all(done):
action = self.use_door_or_move(env, obs, cleaned_dirt_piles, target_pile, det=True) \
if "Doors" in env.state.entities.keys() else self.execute_policy(obs, env, cleaned_dirt_piles) # zero exploration
_, next_obs, reward, done, info = env.step(action) # Note that this call seems to flip the lists in indices
if done:
print("DoneAtMaxStepsReached:", len(self.agents[0]._episode))
# Add small negative reward if agent has moved away from the target_pile
# reward = self.reward_distance(env, obs, target_pile, reward)
# Check and handle if agent is on field with dirt
reward, done = self.handle_dirt(env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done)
# Get transformed next_obs that might have been updated because of self.handle_dirt.
# For eval, where pile_all_done is "all", it's mandatory that the potential change of the target pile
# in the observation, caused by self.handle_dirt, is already considered when the next action is calculated.
next_obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
done = [done] * self.n_agents if isinstance(done, bool) else done
if self.cfg[nms.ENV][nms.EVAL_RENDER]:
env.render()
obs = next_obs
episode += 1
# Properly finalize the video file
if self.cfg[nms.ENV]["save_and_log"] and self.cfg[nms.ENV]["record"]:
self.recorder.close()
def plot_reward_development(self):
smoothed_data = np.convolve(self.reward_development, np.ones(10) / 10, mode='valid')
plt.plot(smoothed_data)
plt.ylim([-10, max(smoothed_data) + 20])
plt.title('Smoothed Reward Development')
plt.xlabel('Episode')
plt.ylabel('Reward')
if self.cfg[nms.ENV]["save_and_log"]:
plt.savefig(f"{self.results_path}/smoothed_reward_development.png")
plt.show()
def save_configs(self):
with open(f"{self.results_path}/MARL_config.txt", "w") as txt_file:
txt_file.write(str(self.cfg))
with open(f"{self.results_path}/train_env_config.txt", "w") as txt_file:
txt_file.write(str(self.factory.conf))
with open(f"{self.results_path}/eval_env_config.txt", "w") as txt_file:
txt_file.write(str(self.eval_factory.conf))
def save_agent_models(self):
for idx, agent in enumerate(self.agents):
agent_name = list(self.factory.state.agents_conf.keys())[idx]
agent.pi.save_model_parameters(self.results_path, agent_name)
agent.vf.save_model_parameters(self.results_path, agent_name)
def load_agents(self, runs_list):
for idx, run in enumerate(runs_list):
run_path = f"../study_out/{run}"
agent_name = list(self.eval_factory.state.agents_conf.keys())[idx]
self.agents[idx].pi.load_model_parameters(f"{run_path}/{agent_name}_PolicyNet_model_parameters.pth")
self.agents[idx].vf.load_model_parameters(f"{run_path}/{agent_name}_ValueNet_model_parameters.pth")
def create_info_maps(self, env, used_actions):
# Create value map
all_valid_observations = self.get_all_observations(env)
dirt_piles_positions = self.get_dirt_piles_positions(env)
with open(f"{self.results_path}/info_maps.txt", "w") as txt_file:
for obs_layer, pos in enumerate(dirt_piles_positions):
observations_shape = (
max(t[0] for t in env.state.entities.floorlist) + 2, max(t[1] for t in env.state.entities.floorlist) + 2)
value_maps = [np.zeros(observations_shape) for _ in self.agents]
likeliest_action = [np.full(observations_shape, np.NaN) for _ in self.agents]
action_probabilities = [np.zeros((observations_shape[0], observations_shape[1], self.act_dim)) for
_ in self.agents]
for obs in all_valid_observations[obs_layer]:
"""obs = self._as_torch(obs).view(-1).to(torch.float32)"""
for idx, agent in enumerate(self.agents):
"""indices = np.where(obs[1] == 1) # Get agent position on grid (1 indicates the position)
x, y = indices[0][0], indices[1][0]"""
x, y = int(obs[0]), int(obs[1])
try:
value_maps[idx][x][y] = agent.vf(obs)
probs = agent.pi.distribution(obs).probs
likeliest_action[idx][x][y] = torch.argmax(probs) # get the likeliest action at the current agent position
action_probabilities[idx][x][y] = probs
except:
pass
txt_file.write("=======Value Maps=======\n")
print("=======Value Maps=======")
for agent_idx, vmap in enumerate(value_maps):
txt_file.write(f"Value map of agent {agent_idx} for target pile {pos}:\n")
print(f"Value map of agent {agent_idx} for target pile {pos}:")
vmap = self._as_torch(vmap).round(decimals=4)
max_digits = max(len(str(vmap.max().item())), len(str(vmap.min().item())))
for idx, row in enumerate(vmap):
txt_file.write(' '.join(f" {elem:>{max_digits + 1}}" for elem in row.tolist()))
txt_file.write("\n")
print(' '.join(f" {elem:>{max_digits + 1}}" for elem in row.tolist()))
txt_file.write("\n")
txt_file.write("=======Likeliest Action=======\n")
print("=======Likeliest Action=======")
for agent_idx, amap in enumerate(likeliest_action):
txt_file.write(f"Likeliest action map of agent {agent_idx} for target pile {pos}:\n")
print(f"Likeliest action map of agent {agent_idx} for target pile {pos}:")
txt_file.write(np.array2string(amap))
print(amap)
txt_file.write("\n")
txt_file.write("=======Action Probabilities=======\n")
print("=======Action Probabilities=======")
for agent_idx, pmap in enumerate(action_probabilities):
self.action_probabilities[agent_idx].append(pmap)
txt_file.write(f"Action probability map of agent {agent_idx} for target pile {pos}:\n")
print(f"Action probability map of agent {agent_idx} for target pile {pos}:")
for d in range(pmap.shape[0]):
row = '['
for r in range(pmap.shape[1]):
row += "[" + ', '.join(f"{x:7.4f}" for x in pmap[d, r]) + "]"
txt_file.write(row + "]")
txt_file.write("\n")
print(row + "]")
txt_file.write(f"Used actions: {used_actions}\n")
print("Used actions:", used_actions)

View File

@ -2,8 +2,6 @@ import numpy as np; import torch as th; import scipy as sp;
from collections import deque
from torch import nn
# RLLab Magic for calculating the discounted return G(t) = R(t) + gamma * R(t-1)
# cf. https://github.com/rll/rllab/blob/ba78e4c16dc492982e648f117875b22af3965579/rllab/misc/special.py#L107
cumulate_discount = lambda x, gamma: sp.signal.lfilter([1], [1, - gamma], x[::-1], axis=0)[::-1]
class Net(th.nn.Module):
@ -21,11 +19,11 @@ class Net(th.nn.Module):
if module.bias is not None:
nn.init.uniform_(module.bias, a=-0.1, b=0.1)
def save_model(self, path, agent_name):
th.save(self.net, f"{path}/{agent_name}_{self.__class__.__name__}_model.pth")
def save_model(self, path):
th.save(self.net, f"{path}/{self.__class__.__name__}_model.pth")
def save_model_parameters(self, path, agent_name):
th.save(self.net.state_dict(), f"{path}/{agent_name}_{self.__class__.__name__}_model_parameters.pth")
def save_model_parameters(self, path):
th.save(self.net.state_dict(), f"{path}/{self.__class__.__name__}_model_parameters.pth")
def load_model_parameters(self, path):
self.net.load_state_dict(th.load(path))

View File

@ -1,3 +1,4 @@
class Names:
ENV = 'env'
ENV_NAME = 'env_name'
@ -35,3 +36,8 @@ class Names:
SINGLE = 'single'
DISTRIBUTED = 'distributed'
SHARED = 'shared'
EARLY_STOPPING = 'early_stopping'
TRAIN = 'train'
SEED = 'seed'
LAST_N_EPISODES = 'last_n_episodes'
MEAN_TARGET_CHANGE = 'mean_target_change'

View File

@ -0,0 +1,12 @@
env:
classname: marl_factory_grid.configs.marl.multi_agent_configs
env_name: "marl/multi_agent_configs/coin_quadrant_eval_config"
n_agents: 2 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
pile-order: "smart" # Triggers implementation of our emergence prevention mechanism. Agents consider distance to other agent
pile-observability: "single" # Agents can only perceive one coin pile at any given time step
pile_all_done: "shared" # Indicates that agents don't have to collect the same coin piles
auxiliary_piles: False # Coin quadrant does not use this option

View File

@ -0,0 +1,13 @@
# Configuration that shows emergent behavior in out coin-quadrant environment
env:
classname: marl_factory_grid.configs.marl.multi_agent_configs
env_name: "marl/multi_agent_configs/coin_quadrant_eval_config"
n_agents: 2 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
pile-order: "dynamic" # Agents only decide on next target pile based on the distance to the respective piles
pile-observability: "single" # Agents can only perceive one coin pile at any given time step
pile_all_done: "shared" # Indicates that agents don't have to collect the same coin piles
auxiliary_piles: False # Coin quadrant does not use this option

View File

@ -0,0 +1,16 @@
env:
classname: marl_factory_grid.configs.marl.multi_agent_configs
env_name: "marl/multi_agent_configs/two_rooms_eval_config"
n_agents: 2 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
# Piles (=encoded flags) are evenly distributed among the two agents and have to be collected in the order defined
# by the environment config (cf. coords_or_quantity)
pile-order: "agents"
pile-observability: "single" # Agents can only perceive one dirt pile at any given time step
pile_all_done: "distributed" # Indicates that agents must clean their specifically assigned dirt piles
auxiliary_piles: True # Allows agents to go to an auxiliary pile

View File

@ -0,0 +1,17 @@
# Configuration that shows emergent behavior in our two-rooms environment
env:
classname: marl_factory_grid.configs..marl.multi_agent_configs
env_name: "marl/multi_agent_configs/two_rooms_eval_config_emergent"
n_agents: 2 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
# Piles (=encoded flags) are evenly distributed among the two agents and have to be collected in the order defined
# by the environment config (cf. coords_or_quantity)
pile-order: "agents"
pile-observability: "single" # Agents can only perceive one dirt pile at any given time step
pile_all_done: "distributed" # Indicates that agents must clean their specifically assigned dirt piles
auxiliary_piles: False # Shows emergent behavior

View File

@ -0,0 +1,13 @@
env:
classname: marl_factory_grid.configs.marl.single_agent_configs
env_name: "marl/single_agent_configs/coin_quadrant_agent1_eval_config"
n_agents: 1 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
pile-order: "fixed" # Clean coin piles in a fixed order specified by the environment config (cf. coords_or_quantity)
pile-observability: "single" # Agent can only perceive one coin pile at any given time step
pile_all_done: "all" # During inference the episode ends only when all coin piles are cleaned
auxiliary_piles: False # Coin quadrant does not use this option

View File

@ -0,0 +1,21 @@
env:
classname: marl_factory_grid.configs.marl.single_agent_configs
env_name: "marl/single_agent_configs/coin_quadrant_agent1_train_config"
n_agents: 1 # Number of agents in the environment
train_render: False # If training should be graphically visualized
save_and_log: True # If configurations and potential logging files should be saved
algorithm:
seed: 9 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
gamma: 0.99 # The gamma value that is used as discounting factor
n_steps: 0 # How much experience should be sampled at most until the next value- and policy-net updates are performed. (0 = Monte Carlo)
chunk-episode: 20000 # For update, splits very large episodes in batches of approximately equal size. (0 = update networks with full episode at once)
max_steps: 400000 # Number of training steps used for agent1 (=agent2)
early_stopping: True # If the early stopping functionality should be used
last_n_episodes: 100 # To determine if low change phase has begun, the last n episodes are checked if the mean target change is reached
mean_target_change: 2.0 # What should be the accepted fluctuation for determining if a low change phase has begun
advantage: "Advantage-AC" # Defines the used actor critic model
pile-order: "fixed" # Clean coin piles in a fixed order specified by the environment config (cf. coords_or_quantity)
pile-observability: "single" # Agent can only perceive one coin pile at any given time step
pile_all_done: "single" # Episode ends when the current target pile is cleaned
auxiliary_piles: False # Coin quadrant does not use this option

View File

@ -0,0 +1,14 @@
env:
classname: marl_factory_grid.configs.marl.single_agent_configs
env_name: "marl/single_agent_configs/two_rooms_agent2_eval_config"
n_agents: 1 # Number of agents in the environment
eval_render: True # If inference should be graphically visualized
save_and_log: False # If configurations and potential logging files should be saved
algorithm:
seed: 42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
pile-order: "fixed" # Clean coin piles (=encoded flags) in a fixed order specified by the environment config (cf. coords_or_quantity)
pile-observability: "single" # Agent can only perceive one coin pile at any given time step
pile_all_done: "all" # During inference the episode ends only when all coin piles are cleaned
auxiliary_piles: False # Auxiliary piles are only differentiated from regular target piles during marl eval

View File

@ -0,0 +1,22 @@
env:
classname: marl_factory_grid.configs.marl.single_agent_configs
env_name: "marl/single_agent_configs/two_rooms_agent2_train_config"
n_agents: 1 # Number of agents in the environment
train_render: False # If training should be graphically visualized
save_and_log: True # If configurations and potential logging files should be saved
algorithm:
seed: 9 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
gamma: 0.99 # The gamma value that is used as discounting factor
n_steps: 0 # How much experience should be sampled at most until the next value- and policy-net updates are performed. (0 = Monte Carlo)
chunk-episode: 20000 # For update, splits very large episodes in batches of approximately equal size. (0 = update networks with full episode at once)
max_steps: 300000 # Number of training steps used to train the agent. Here, only a placeholder value
early_stopping: True # If the early stopping functionality should be used
last_n_episodes: 100 # To determine if low change phase has begun, the last n episodes are checked if the mean target change is reached
mean_target_change: 2.0 # What should be the accepted fluctuation for determining if a low change phase has begun
advantage: "Advantage-AC" # Defines the used actor critic model
pile-order: "fixed" # Clean coin piles (=encoded flags) in a fixed order specified by the environment config (cf. coords_or_quantity)
pile-observability: "single" # Agent can only perceive one coin pile at any given time step
pile_all_done: "single" # Episode ends when the current target pile is cleaned
auxiliary_piles: False # Auxiliary piles are only differentiated from regular target piles during marl eval

View File

@ -1,11 +1,14 @@
import copy
import os
from pathlib import Path
from typing import List
import numpy as np
import pandas as pd
import torch
from marl_factory_grid.algorithms.rl.constants import Names as nms
from marl_factory_grid.algorithms.marl.constants import Names as nms
from marl_factory_grid.algorithms.rl.base_a2c import cumulate_discount
from marl_factory_grid.algorithms.marl.base_a2c import cumulate_discount
def _as_torch(x):
@ -187,7 +190,7 @@ def distribute_indices(env, cfg, n_agents):
# -> Starting with index 0 even piles are auxiliary piles, odd piles are primary piles
if cfg[nms.ALGORITHM][nms.AUXILIARY_PILES] and nms.DOORS in env.state.entities.keys():
door_positions = [door.pos for door in env.state.entities[nms.DOORS]]
distances = {door_pos: [] for door_pos in door_positions}
distances = {door_pos:[] for door_pos in door_positions}
# Calculate distance of every agent to every door
for door_pos in door_positions:
@ -198,7 +201,7 @@ def distribute_indices(env, cfg, n_agents):
return [i for i, x in enumerate(lst) if x == item]
# Get agent indices of agents with same distance to door
affected_agents = {door_pos: {} for door_pos in door_positions}
affected_agents = {door_pos:{} for door_pos in door_positions}
for door_pos in distances.keys():
dist = distances[door_pos]
dist_set = set(dist)
@ -206,22 +209,20 @@ def distribute_indices(env, cfg, n_agents):
affected_agents[door_pos][str(d)] = duplicate_indices(dist, d)
updated_indices = []
for door_pos, agent_distances in affected_agents.items():
if len(agent_distances) == 0:
# Remove auxiliary piles for all agents
# (In config, we defined every pile with an even numbered index to be an auxiliary pile)
updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
else:
for distance, agent_indices in agent_distances.items():
# For each distance group, pick one random agent to keep the auxiliary pile
# selected_agent = np.random.choice(agent_indices)
selected_agent = 0
for agent_idx in agent_indices:
if agent_idx == selected_agent:
updated_indices.append(indices[agent_idx])
else:
updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])
if len(affected_agents[door_positions[0]]) == 0:
# Remove auxiliary piles for all agents
# (In config, we defined every pile with an even numbered index to be an auxiliary pile)
updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
else:
for distance, agent_indices in affected_agents[door_positions[0]].items():
# Pick random agent to keep auxiliary pile and remove it for all others
#selected_agent = np.random.choice(agent_indices)
selected_agent = 0
for agent_idx in agent_indices:
if agent_idx == selected_agent:
updated_indices.append(indices[agent_idx])
else:
updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])
indices = updated_indices
@ -335,3 +336,42 @@ def save_agent_models(results_path, agents):
for idx, agent in enumerate(agents):
agent.pi.save_model_parameters(results_path)
agent.vf.save_model_parameters(results_path)
def has_low_change_phase_started(return_change_development, last_n_episodes, mean_target_change):
""" Checks if training has reached a phase with only marginal average change """
if np.mean(np.abs(return_change_development[-last_n_episodes:])) < mean_target_change:
print("Low change phase started.")
return True
return False
def significant_deviation(return_change_development, low_change_phase_start_episode):
""" Determines if a significant return deviation has occurred in the last episode """
return_change_development = return_change_development[low_change_phase_start_episode:]
df = pd.DataFrame({'Episode': range(len(return_change_development)), 'DeltaReturn': return_change_development})
df['Difference'] = df['DeltaReturn'].diff().abs()
# Only the most extreme changes (those that are greater than 99.99% of all changes) will be considered significant
threshold = df['Difference'].quantile(0.9999)
# Identify significant changes
significant_changes = df[df['Difference'] > threshold]
print("Threshold: ", threshold, "Significant changes: ", significant_changes)
if len(significant_changes["Episode"]) > 0:
return True
return False
def get_algorithms_marl_path():
return Path(Path(__file__).parent)
def get_configs_marl_path():
return Path(os.path.join(Path(__file__).parent.parent.parent, "configs"))
def get_agent_models_path():
return Path(os.path.join(Path(__file__).parent.parent, "agent_models"))

View File

@ -1 +0,0 @@
from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory

View File

@ -1,112 +0,0 @@
import numpy as np
import torch as th
import scipy as sp
from collections import deque
from torch import nn
cumulate_discount = lambda x, gamma: sp.signal.lfilter([1], [1, - gamma], x[::-1], axis=0)[::-1]
class Net(th.nn.Module):
def __init__(self, shape, activation, lr):
super().__init__()
self.net = th.nn.Sequential(*[layer
for io, a in zip(zip(shape[:-1], shape[1:]),
[activation] * (len(shape) - 2) + [th.nn.Identity])
for layer in [th.nn.Linear(*io), a()]])
self.optimizer = th.optim.Adam(self.net.parameters(), lr=lr)
# Initialize weights uniformly, so that for the policy net all actions have approximately the same
# probability in the beginning
for module in self.modules():
if isinstance(module, nn.Linear):
nn.init.uniform_(module.weight, a=-0.1, b=0.1)
if module.bias is not None:
nn.init.uniform_(module.bias, a=-0.1, b=0.1)
def save_model(self, path):
th.save(self.net, f"{path}/{self.__class__.__name__}_model.pth")
def save_model_parameters(self, path):
th.save(self.net.state_dict(), f"{path}/{self.__class__.__name__}_model_parameters.pth")
def load_model_parameters(self, path):
self.net.load_state_dict(th.load(path))
self.net.eval()
class ValueNet(Net):
def __init__(self, obs_dim, hidden_sizes=[64, 64], activation=th.nn.ReLU, lr=1e-3):
super().__init__([obs_dim] + hidden_sizes + [1], activation, lr)
def forward(self, obs): return self.net(obs)
def loss(self, states, returns): return ((returns - self(states)) ** 2).mean()
class PolicyNet(Net):
def __init__(self, obs_dim, act_dim, hidden_sizes=[64, 64], activation=th.nn.Tanh, lr=3e-4):
super().__init__([obs_dim] + hidden_sizes + [act_dim], activation, lr)
self.distribution = lambda obs: th.distributions.Categorical(logits=self.net(obs))
def forward(self, obs, act=None, det=False):
"""Given an observation: Returns policy distribution and probablilty for a given action
or Returns a sampled action and its corresponding probablilty"""
pi = self.distribution(obs)
if act is not None: return pi, pi.log_prob(act)
act = self.net(obs).argmax() if det else pi.sample() # sample from the learned distribution
return act, pi.log_prob(act)
def loss(self, states, actions, advantages):
_, logp = self.forward(states, actions)
loss = -(logp * advantages).mean()
return loss
class PolicyGradient:
""" Autonomous agent using vanilla policy gradient. """
def __init__(self, env, seed=42, gamma=0.99, agent_id=0, act_dim=None, obs_dim=None):
self.env = env
self.gamma = gamma # Setup env and discount
th.manual_seed(seed)
np.random.seed(seed) # Seed Torch, numpy and gym
# Keep track of previous rewards and performed steps to calcule the mean Return metric
self._episode, self.ep_returns, self.num_steps = [], deque(maxlen=100), 0
# Get observation and action shapes
if not obs_dim:
obs_size = env.observation_space.shape if len(env.state.entities.by_name("Agents")) == 1 \
else env.observation_space[agent_id].shape # Single agent case vs. multi-agent case
obs_dim = np.prod(obs_size)
if not act_dim:
act_dim = env.action_space[agent_id].n
self.vf = ValueNet(obs_dim) # Setup Value Network (Critic)
self.pi = PolicyNet(obs_dim, act_dim) # Setup Policy Network (Actor)
def step(self, obs):
""" Given an observation, get action and probs from policy and values from critic"""
with th.no_grad():
(a, _), v = self.pi(obs), self.vf(obs)
self._episode.append((None, None, None, v))
return a.numpy()
def policy(self, obs, det=True):
return self.pi(obs, det=det)[0].numpy()
def finish_episode(self):
"""Process self._episode & reset self.env, Returns (s,a,G,V)-Tuple and new inital state"""
s, a, r, v = (np.array(e) for e in zip(*self._episode)) # Get trajectories from rollout
self.ep_returns.append(sum(r))
self._episode = [] # Add episode return to buffer & reset
return s, a, r, v # state, action, Return, Value Tensors
def train(self, states, actions, returns, advantages): # Update policy weights
self.pi.optimizer.zero_grad()
self.vf.optimizer.zero_grad() # Reset optimizer
states = states.flatten(1, -1) # Reduce dimensionality to rollout_dim x input_dim
policy_loss = self.pi.loss(states, actions, advantages) # Calculate Policy loss
policy_loss.backward()
self.pi.optimizer.step() # Apply Policy loss
value_loss = self.vf.loss(states, returns) # Calculate Value loss
value_loss.backward()
self.vf.optimizer.step() # Apply Value loss

View File

@ -1,242 +0,0 @@
import torch
from typing import Union, List, Dict
import numpy as np
from torch.distributions import Categorical
from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
from marl_factory_grid.algorithms.utils import add_env_props, instantiate_class
from pathlib import Path
import pandas as pd
from collections import deque
class Names:
REWARD = 'reward'
DONE = 'done'
ACTION = 'action'
OBSERVATION = 'observation'
LOGITS = 'logits'
HIDDEN_ACTOR = 'hidden_actor'
HIDDEN_CRITIC = 'hidden_critic'
AGENT = 'agent'
ENV = 'env'
ENV_NAME = 'env_name'
N_AGENTS = 'n_agents'
ALGORITHM = 'algorithm'
MAX_STEPS = 'max_steps'
N_STEPS = 'n_steps'
BUFFER_SIZE = 'buffer_size'
CRITIC = 'critic'
BATCH_SIZE = 'bnatch_size'
N_ACTIONS = 'n_actions'
TRAIN_RENDER = 'train_render'
EVAL_RENDER = 'eval_render'
nms = Names
ListOrTensor = Union[List, torch.Tensor]
class BaseActorCritic:
def __init__(self, cfg):
self.factory = add_env_props(cfg)
self.__training = True
self.cfg = cfg
self.n_agents = cfg[nms.AGENT][nms.N_AGENTS]
self.reset_memory_after_epoch = True
self.setup()
def setup(self):
self.net = instantiate_class(self.cfg[nms.AGENT])
self.optimizer = torch.optim.RMSprop(self.net.parameters(), lr=3e-4, eps=1e-5)
@classmethod
def _as_torch(cls, x):
if isinstance(x, np.ndarray):
return torch.from_numpy(x)
elif isinstance(x, List):
return torch.tensor(x)
elif isinstance(x, (int, float)):
return torch.tensor([x])
return x
def train(self):
self.__training = False
networks = [self.net] if not isinstance(self.net, List) else self.net
for net in networks:
net.train()
def eval(self):
self.__training = False
networks = [self.net] if not isinstance(self.net, List) else self.net
for net in networks:
net.eval()
def load_state_dict(self, path: Path):
pass
def get_actions(self, out) -> ListOrTensor:
actions = [Categorical(logits=logits).sample().item() for logits in out[nms.LOGITS]]
return actions
def init_hidden(self) -> Dict[str, ListOrTensor]:
pass
def forward(self,
observations: ListOrTensor,
actions: ListOrTensor,
hidden_actor: ListOrTensor,
hidden_critic: ListOrTensor
) -> Dict[str, ListOrTensor]:
pass
@torch.no_grad()
def train_loop(self, checkpointer=None):
env = self.factory
if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
env.render()
n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
tm = MARLActorCriticMemory(self.n_agents, self.cfg[nms.ALGORITHM].get(nms.BUFFER_SIZE, n_steps))
global_steps, episode, df_results = 0, 0, []
reward_queue = deque(maxlen=2000)
while global_steps < max_steps:
obs = env.reset()
obs = list(obs.values())
last_hiddens = self.init_hidden()
last_action, reward = [-1] * self.n_agents, [0.] * self.n_agents
done, rew_log = [False] * self.n_agents, 0
if self.reset_memory_after_epoch:
tm.reset()
tm.add(observation=obs, action=last_action,
logits=torch.zeros(self.n_agents, 1, self.cfg[nms.AGENT][nms.N_ACTIONS]),
values=torch.zeros(self.n_agents, 1), reward=reward, done=done, **last_hiddens)
while not all(done):
out = self.forward(obs, last_action, **last_hiddens)
action = self.get_actions(out)
_, next_obs, reward, done, info = env.step(action)
done = [done] * self.n_agents if isinstance(done, bool) else done
if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
env.render()
last_hiddens = dict(hidden_actor=out[nms.HIDDEN_ACTOR],
hidden_critic=out[nms.HIDDEN_CRITIC])
logits = torch.stack([tensor.squeeze(0) for tensor in out.get(nms.LOGITS, None)], dim=0)
values = torch.stack([tensor.squeeze(0) for tensor in out.get(nms.CRITIC, None)], dim=0)
tm.add(observation=obs, action=action, reward=reward, done=done,
logits=logits, values=values,
**last_hiddens)
obs = next_obs
last_action = action
if (global_steps+1) % n_steps == 0 or all(done):
with torch.inference_mode(False):
self.learn(tm)
global_steps += 1
rew_log += sum(reward)
reward_queue.extend(reward)
if checkpointer is not None:
checkpointer.step([
(f'agent#{i}', agent)
for i, agent in enumerate([self.net] if not isinstance(self.net, List) else self.net)
])
if global_steps >= max_steps:
break
if global_steps%100 == 0:
print(f'reward at episode: {episode} = {rew_log}')
episode += 1
df_results.append([episode, rew_log, *reward])
df_results = pd.DataFrame(df_results,
columns=['steps', 'reward', *[f'agent#{i}' for i in range(self.n_agents)]]
)
if checkpointer is not None:
df_results.to_csv(checkpointer.path / 'results.csv', index=False)
return df_results
@torch.inference_mode(True)
def eval_loop(self, n_episodes, render=False):
env = self.factory
if self.cfg[nms.ENV][nms.EVAL_RENDER]:
env.render()
episode, results = 0, []
while episode < n_episodes:
obs = env.reset()
obs = list(obs.values())
last_hiddens = self.init_hidden()
last_action, reward = [-1] * self.n_agents, [0.] * self.n_agents
done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)
while not all(done):
out = self.forward(obs, last_action, **last_hiddens)
action = self.get_actions(out)
_, next_obs, reward, done, info = env.step(action)
if self.cfg[nms.ENV][nms.EVAL_RENDER]:
env.render()
if isinstance(done, bool):
done = [done] * obs[0].shape[0]
obs = next_obs
last_action = action
last_hiddens = dict(hidden_actor=out.get(nms.HIDDEN_ACTOR, None),
hidden_critic=out.get(nms.HIDDEN_CRITIC, None)
)
eps_rew += torch.tensor(reward)
results.append(eps_rew.tolist() + [sum(eps_rew).item()] + [episode])
episode += 1
agent_columns = [f'agent#{i}' for i in range(self.cfg[nms.ENV][nms.N_AGENTS])]
results = pd.DataFrame(results, columns=agent_columns + ['sum', 'episode'])
results = pd.melt(results, id_vars=['episode'], value_vars=agent_columns + ['sum'],
value_name='reward', var_name='agent')
return results
@staticmethod
def compute_advantages(critic, reward, done, gamma, gae_coef=0.0):
tds = (reward + gamma * (1.0 - done) * critic[:, 1:].detach()) - critic[:, :-1]
if gae_coef <= 0:
return tds
gae = torch.zeros_like(tds[:, -1])
gaes = []
for t in range(tds.shape[1]-1, -1, -1):
gae = tds[:, t] + gamma * gae_coef * (1.0 - done[:, t]) * gae
gaes.insert(0, gae)
gaes = torch.stack(gaes, dim=1)
return gaes
def actor_critic(self, tm, network, gamma, entropy_coef, vf_coef, gae_coef=0.0, **kwargs):
obs, actions, done, reward = tm.observation, tm.action, tm.done[:, 1:], tm.reward[:, 1:]
out = network(obs, actions, tm.hidden_actor[:, 0].squeeze(0), tm.hidden_critic[:, 0].squeeze(0))
logits = out[nms.LOGITS][:, :-1] # last one only needed for v_{t+1}
critic = out[nms.CRITIC]
entropy_loss = Categorical(logits=logits).entropy().mean(-1)
advantages = self.compute_advantages(critic, reward, done, gamma, gae_coef)
value_loss = advantages.pow(2).mean(-1) # n_agent
# policy loss
log_ap = torch.log_softmax(logits, -1)
log_ap = torch.gather(log_ap, dim=-1, index=actions[:, 1:].unsqueeze(-1)).squeeze()
a2c_loss = -(advantages.detach() * log_ap).mean(-1)
# weighted loss
loss = a2c_loss + vf_coef*value_loss - entropy_coef * entropy_loss
return loss.mean()
def learn(self, tm: MARLActorCriticMemory, **kwargs):
loss = self.actor_critic(tm, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
# remove next_obs, will be added in next iter
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
self.optimizer.step()

View File

@ -1,34 +0,0 @@
agent:
classname: marl_factory_grid.algorithms.rl.networks.RecurrentAC
n_agents: 2
obs_emb_size: 96
action_emb_size: 16
hidden_size_actor: 64
hidden_size_critic: 64
use_agent_embedding: False
env:
classname: marl_factory_grid.configs.custom
env_name: "custom/MultiAgentConfigs/dirt_quadrant_train_config"
n_agents: 2
max_steps: 250
pomdp_r: 2
stack_n_frames: 0
individual_rewards: True
train_render: False
eval_render: True
save_and_log: True
record: False
method: marl_factory_grid.algorithms.rl.LoopSEAC
algorithm:
gamma: 0.99
entropy_coef: 0.01
vf_coef: 0.05
n_steps: 0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
max_steps: 200000
advantage: "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
pile-order: "dynamic" # Use "dynamic" to see emergent phenomenon and "smart" to prevent it
pile-observability: "single" # Options: "single", "all"
pile_all_done: "shared" # Options: "single", "all" ("single" for training, "all" for eval), "shared"
auxiliary_piles: False # Option that is only considered when pile-order = "agents"
chunk-episode: 20000 # Chunk size. (0 = update networks with full episode at once)

View File

@ -1,35 +0,0 @@
agent:
classname: marl_factory_grid.algorithms.rl.networks.RecurrentAC
n_agents: 2
obs_emb_size: 96
action_emb_size: 16
hidden_size_actor: 64
hidden_size_critic: 64
use_agent_embedding: False
env:
classname: marl_factory_grid.configs.custom
env_name: "custom/two_rooms_one_door_modified_train_config"
n_agents: 2
max_steps: 250
pomdp_r: 2
stack_n_frames: 0
individual_rewards: True
train_render: False
eval_render: True
save_and_log: True
record: False
method: marl_factory_grid.algorithms.rl.LoopSEAC
algorithm:
gamma: 0.99
entropy_coef: 0.01
vf_coef: 0.05
n_steps: 0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
max_steps: 260000
advantage: "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
pile-order: "agents" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
pile-observability: "single" # Options: "single", "all"
pile_all_done: "distributed" # Options: "single", "all" ("single" for training, "all" and "distributed" for eval)
auxiliary_piles: True # Use True to see emergent phenomenon and False to prevent it
chunk-episode: 20000 # Chunk size. (0 = update networks with full episode at once)

View File

@ -1,34 +0,0 @@
agent:
classname: marl_factory_grid.algorithms.rl.networks.RecurrentAC
n_agents: 1
obs_emb_size: 96
action_emb_size: 16
hidden_size_actor: 64
hidden_size_critic: 64
use_agent_embedding: False
env:
classname: marl_factory_grid.configs.custom
env_name: "custom/dirt_quadrant_train_config"
n_agents: 1
max_steps: 250
pomdp_r: 2
stack_n_frames: 0
individual_rewards: True
train_render: False
eval_render: True
save_and_log: True
record: False
method: marl_factory_grid.algorithms.rl.LoopSEAC
algorithm:
gamma: 0.99
entropy_coef: 0.01
vf_coef: 0.05
n_steps: 0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
max_steps: 240000
advantage: "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
pile-order: "fixed" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
pile-observability: "single" # Options: "single", "all"
pile_all_done: "single" # Options: "single", "all" ("single" for training, "all" for eval)
auxiliary_piles: False # Option that is only considered when pile-order = "agents"
chunk-episode: 20000 # Chunk size. (0 = update networks with full episode at once)

View File

@ -1,8 +0,0 @@
marl_factory_grid>environment>rules.py#SpawnEntity.on_reset()
marl_factory_grid>environment>rewards.py
marl_factory_grid>modules>clean_up>groups.py#DirtPiles.trigger_spawn()
marl_factory_grid>environment>rules.py#AgentSpawnRule
marl_factory_grid>utils>states.py#GameState.__init__()
marl_factory_grid>environment>factory.py>Factory#render
marl_factory_grid>environment>factory.py>Factory#set_recorder
marl_factory_grid>utils>renderer.py>Renderer#render

View File

@ -1,35 +0,0 @@
agent:
classname: marl_factory_grid.algorithms.rl.networks.RecurrentAC
n_agents: 1
obs_emb_size: 96
action_emb_size: 16
hidden_size_actor: 64
hidden_size_critic: 64
use_agent_embedding: False
env:
classname: marl_factory_grid.configs.custom
env_name: "custom/two_rooms_one_door_modified_train_config"
n_agents: 1
max_steps: 250
pomdp_r: 2
stack_n_frames: 0
individual_rewards: True
train_render: False
eval_render: True
save_and_log: False
record: False
method: marl_factory_grid.algorithms.rl.LoopSEAC
algorithm:
gamma: 0.99
entropy_coef: 0.01
vf_coef: 0.05
n_steps: 0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
max_steps: 260000
advantage: "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
pile-order: "fixed" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
pile-observability: "single" # Options: "single", "all"
pile_all_done: "single" # Options: "single", "all" ("single" for training, "all" for eval)
auxiliary_piles: False # Option that is only considered when pile-order = "agents"
chunk-episode: 20000 # Chunk size. (0 = update networks with full episode at once)

View File

@ -1,57 +0,0 @@
import torch
from marl_factory_grid.algorithms.rl.base_ac import BaseActorCritic, nms
from marl_factory_grid.algorithms.utils import instantiate_class
from pathlib import Path
from natsort import natsorted
from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
class LoopIAC(BaseActorCritic):
def __init__(self, cfg):
super(LoopIAC, self).__init__(cfg)
def setup(self):
self.net = [
instantiate_class(self.cfg[nms.AGENT]) for _ in range(self.n_agents)
]
self.optimizer = [
torch.optim.RMSprop(self.net[ag_i].parameters(), lr=3e-4, eps=1e-5) for ag_i in range(self.n_agents)
]
def load_state_dict(self, path: Path):
paths = natsorted(list(path.glob('*.pt')))
for path, net in zip(paths, self.net):
net.load_state_dict(torch.load(path))
@staticmethod
def merge_dicts(ds): # todo could be recursive for more than 1 hierarchy
d = {}
for k in ds[0].keys():
d[k] = [d[k] for d in ds]
return d
def init_hidden(self):
ha = [net.init_hidden_actor() for net in self.net]
hc = [net.init_hidden_critic() for net in self.net]
return dict(hidden_actor=ha, hidden_critic=hc)
def forward(self, observations, actions, hidden_actor, hidden_critic):
outputs = [
net(
self._as_torch(observations[ag_i]).unsqueeze(0).unsqueeze(0), # agent x time
self._as_torch(actions[ag_i]).unsqueeze(0),
hidden_actor[ag_i],
hidden_critic[ag_i]
) for ag_i, net in enumerate(self.net)
]
return self.merge_dicts(outputs)
def learn(self, tms: MARLActorCriticMemory, **kwargs):
for ag_i in range(self.n_agents):
tm, net = tms(ag_i), self.net[ag_i]
loss = self.actor_critic(tm, net, **self.cfg[nms.ALGORITHM], **kwargs)
self.optimizer[ag_i].zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(net.parameters(), 0.5)
self.optimizer[ag_i].step()

View File

@ -1,66 +0,0 @@
from marl_factory_grid.algorithms.rl.base_ac import Names as nms
from marl_factory_grid.algorithms.rl.snac import LoopSNAC
from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
import torch
from torch.distributions import Categorical
from marl_factory_grid.algorithms.utils import instantiate_class
class LoopMAPPO(LoopSNAC):
def __init__(self, *args, **kwargs):
super(LoopMAPPO, self).__init__(*args, **kwargs)
self.reset_memory_after_epoch = False
def setup(self):
self.net = instantiate_class(self.cfg[nms.AGENT])
self.optimizer = torch.optim.Adam(self.net.parameters(), lr=3e-4, eps=1e-5)
def learn(self, tm: MARLActorCriticMemory, **kwargs):
if len(tm) >= self.cfg['algorithm']['buffer_size']:
# only learn when buffer is full
for batch_i in range(self.cfg['algorithm']['n_updates']):
batch = tm.chunk_dataloader(chunk_len=self.cfg['algorithm']['n_steps'],
k=self.cfg['algorithm']['batch_size'])
loss = self.mappo(batch, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
self.optimizer.step()
def monte_carlo_returns(self, rewards, done, gamma):
rewards_ = []
discounted_reward = torch.zeros_like(rewards[:, -1])
for t in range(rewards.shape[1]-1, -1, -1):
discounted_reward = rewards[:, t] + (gamma * (1.0 - done[:, t]) * discounted_reward)
rewards_.insert(0, discounted_reward)
rewards_ = torch.stack(rewards_, dim=1)
return rewards_
def mappo(self, batch, network, gamma, entropy_coef, vf_coef, clip_range, **__):
out = network(batch[nms.OBSERVATION], batch[nms.ACTION], batch[nms.HIDDEN_ACTOR], batch[nms.HIDDEN_CRITIC])
logits = out[nms.LOGITS][:, :-1] # last one only needed for v_{t+1}
old_log_probs = torch.log_softmax(batch[nms.LOGITS], -1)
old_log_probs = torch.gather(old_log_probs, index=batch[nms.ACTION][:, 1:].unsqueeze(-1), dim=-1).squeeze()
# monte carlo returns
mc_returns = self.monte_carlo_returns(batch[nms.REWARD], batch[nms.DONE], gamma)
mc_returns = (mc_returns - mc_returns.mean()) / (mc_returns.std() + 1e-8) # todo: norm across agent ok?
advantages = mc_returns - out[nms.CRITIC][:, :-1]
# policy loss
log_ap = torch.log_softmax(logits, -1)
log_ap = torch.gather(log_ap, dim=-1, index=batch[nms.ACTION][:, 1:].unsqueeze(-1)).squeeze()
ratio = (log_ap - old_log_probs).exp()
surr1 = ratio * advantages.detach()
surr2 = torch.clamp(ratio, 1 - clip_range, 1 + clip_range) * advantages.detach()
policy_loss = -torch.min(surr1, surr2).mean(-1)
# entropy & value loss
entropy_loss = Categorical(logits=logits).entropy().mean(-1)
value_loss = advantages.pow(2).mean(-1) # n_agent
# weighted loss
loss = policy_loss + vf_coef*value_loss - entropy_coef * entropy_loss
return loss.mean()

View File

@ -1,221 +0,0 @@
import numpy as np
from collections import deque
import torch
from typing import Union
from torch import Tensor
from torch.utils.data import Dataset, ConcatDataset
import random
class ActorCriticMemory(object):
def __init__(self, capacity=10):
self.capacity = capacity
self.reset()
def reset(self):
self.__actions = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__hidden_actor = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__hidden_critic = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__states = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__rewards = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__dones = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__logits = LazyTensorFiFoQueue(maxlen=self.capacity+1)
self.__values = LazyTensorFiFoQueue(maxlen=self.capacity+1)
def __len__(self):
return len(self.__rewards) - 1
@property
def observation(self, sls=slice(0, None)): # add time dimension through stacking
return self.__states[sls].unsqueeze(0) # 1 x time x hidden dim
@property
def hidden_actor(self, sls=slice(0, None)): # 1 x n_layers x dim
return self.__hidden_actor[sls].unsqueeze(0) # 1 x time x n_layers x dim
@property
def hidden_critic(self, sls=slice(0, None)): # 1 x n_layers x dim
return self.__hidden_critic[sls].unsqueeze(0) # 1 x time x n_layers x dim
@property
def reward(self, sls=slice(0, None)):
return self.__rewards[sls].squeeze().unsqueeze(0) # 1 x time
@property
def action(self, sls=slice(0, None)):
return self.__actions[sls].long().squeeze().unsqueeze(0) # 1 x time
@property
def done(self, sls=slice(0, None)):
return self.__dones[sls].float().squeeze().unsqueeze(0) # 1 x time
@property
def logits(self, sls=slice(0, None)): # assumes a trailing 1 for time dimension - common when using output from NN
return self.__logits[sls].squeeze().unsqueeze(0) # 1 x time x actions
@property
def values(self, sls=slice(0, None)):
return self.__values[sls].squeeze().unsqueeze(0) # 1 x time x actions
def add_observation(self, state: Union[Tensor, np.ndarray]):
self.__states.append(state if isinstance(state, Tensor) else torch.from_numpy(state))
def add_hidden_actor(self, hidden: Tensor):
# layers x hidden dim
self.__hidden_actor.append(hidden)
def add_hidden_critic(self, hidden: Tensor):
# layers x hidden dim
self.__hidden_critic.append(hidden)
def add_action(self, action: Union[int, Tensor]):
if not isinstance(action, Tensor):
action = torch.tensor(action)
self.__actions.append(action)
def add_reward(self, reward: Union[float, Tensor]):
if not isinstance(reward, Tensor):
reward = torch.tensor(reward)
self.__rewards.append(reward)
def add_done(self, done: bool):
if not isinstance(done, Tensor):
done = torch.tensor(done)
self.__dones.append(done)
def add_logits(self, logits: Tensor):
self.__logits.append(logits)
def add_values(self, values: Tensor):
self.__values.append(values)
def add(self, **kwargs):
for k, v in kwargs.items():
func = getattr(ActorCriticMemory, f'add_{k}')
func(self, v)
class MARLActorCriticMemory(object):
def __init__(self, n_agents, capacity):
self.n_agents = n_agents
self.memories = [
ActorCriticMemory(capacity) for _ in range(n_agents)
]
def __call__(self, agent_i):
return self.memories[agent_i]
def __len__(self):
return len(self.memories[0]) # todo add assertion check!
def reset(self):
for mem in self.memories:
mem.reset()
def add(self, **kwargs):
for agent_i in range(self.n_agents):
for k, v in kwargs.items():
func = getattr(ActorCriticMemory, f'add_{k}')
func(self.memories[agent_i], v[agent_i])
def __getattr__(self, attr):
all_attrs = [getattr(mem, attr) for mem in self.memories]
return torch.cat(all_attrs, 0) # agent x time ...
def chunk_dataloader(self, chunk_len, k):
datasets = [ExperienceChunks(mem, chunk_len, k) for mem in self.memories]
dataset = ConcatDataset(datasets)
data = [dataset[i] for i in range(len(dataset))]
data = custom_collate_fn(data)
return data
def custom_collate_fn(batch):
elem = batch[0]
return {key: torch.cat([d[key] for d in batch], dim=0) for key in elem}
class ExperienceChunks(Dataset):
def __init__(self, memory, chunk_len, k):
assert chunk_len <= len(memory), 'chunk_len cannot be longer than the size of the memory'
self.memory = memory
self.chunk_len = chunk_len
self.k = k
@property
def whitelist(self):
whitelist = torch.ones(len(self.memory) - self.chunk_len)
for d in self.memory.done.squeeze().nonzero().flatten():
whitelist[max((0, d-self.chunk_len-1)):d+2] = 0
whitelist[0] = 0
return whitelist.tolist()
def sample(self, start=1):
cl = self.chunk_len
sample = dict(observation=self.memory.observation[:, start:start+cl+1],
action=self.memory.action[:, start-1:start+cl],
hidden_actor=self.memory.hidden_actor[:, start-1],
hidden_critic=self.memory.hidden_critic[:, start-1],
reward=self.memory.reward[:, start:start + cl],
done=self.memory.done[:, start:start + cl],
logits=self.memory.logits[:, start:start + cl],
values=self.memory.values[:, start:start + cl])
return sample
def __len__(self):
return self.k
def __getitem__(self, i):
idx = random.choices(range(0, len(self.memory) - self.chunk_len), weights=self.whitelist, k=1)
return self.sample(idx[0])
class LazyTensorFiFoQueue:
def __init__(self, maxlen):
self.maxlen = maxlen
self.reset()
def reset(self):
self.__lazy_queue = deque(maxlen=self.maxlen)
self.shape = None
self.queue = None
def shape_init(self, tensor: Tensor):
self.shape = torch.Size([self.maxlen, *tensor.shape])
def build_tensor_queue(self):
if len(self.__lazy_queue) > 0:
block = torch.stack(list(self.__lazy_queue), dim=0)
l = block.shape[0]
if self.queue is None:
self.queue = block
elif self.true_len() <= self.maxlen:
self.queue = torch.cat((self.queue, block), dim=0)
else:
self.queue = torch.cat((self.queue[l:], block), dim=0)
self.__lazy_queue.clear()
def append(self, data):
if self.shape is None:
self.shape_init(data)
self.__lazy_queue.append(data)
if len(self.__lazy_queue) >= self.maxlen:
self.build_tensor_queue()
def true_len(self):
return len(self.__lazy_queue) + (0 if self.queue is None else self.queue.shape[0])
def __len__(self):
return min((self.true_len(), self.maxlen))
def __str__(self):
return f'LazyTensorFiFoQueue\tmaxlen: {self.maxlen}, shape: {self.shape}, ' \
f'len: {len(self)}, true_len: {self.true_len()}, elements in lazy queue: {len(self.__lazy_queue)}'
def __getitem__(self, item_or_slice):
self.build_tensor_queue()
return self.queue[item_or_slice]

View File

@ -1,103 +0,0 @@
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
class RecurrentAC(nn.Module):
def __init__(self, observation_size, n_actions, obs_emb_size,
action_emb_size, hidden_size_actor, hidden_size_critic,
n_agents, use_agent_embedding=True):
super(RecurrentAC, self).__init__()
observation_size = np.prod(observation_size)
self.n_layers = 1
self.n_actions = n_actions
self.use_agent_embedding = use_agent_embedding
self.hidden_size_actor = hidden_size_actor
self.hidden_size_critic = hidden_size_critic
self.action_emb_size = action_emb_size
self.obs_proj = nn.Linear(observation_size, obs_emb_size)
self.action_emb = nn.Embedding(n_actions+1, action_emb_size, padding_idx=0)
self.agent_emb = nn.Embedding(n_agents, action_emb_size)
mix_in_size = obs_emb_size+action_emb_size if not use_agent_embedding else obs_emb_size+n_agents*action_emb_size
self.mix = nn.Sequential(nn.Tanh(),
nn.Linear(mix_in_size, obs_emb_size),
nn.Tanh(),
nn.Linear(obs_emb_size, obs_emb_size)
)
self.gru_actor = nn.GRU(obs_emb_size, hidden_size_actor, batch_first=True, num_layers=self.n_layers)
self.gru_critic = nn.GRU(obs_emb_size, hidden_size_critic, batch_first=True, num_layers=self.n_layers)
self.action_head = nn.Sequential(
nn.Linear(hidden_size_actor, hidden_size_actor),
nn.Tanh(),
nn.Linear(hidden_size_actor, n_actions)
)
# spectral_norm(nn.Linear(hidden_size_actor, hidden_size_actor)),
self.critic_head = nn.Sequential(
nn.Linear(hidden_size_critic, hidden_size_critic),
nn.Tanh(),
nn.Linear(hidden_size_critic, 1)
)
#self.action_head[-1].weight.data.uniform_(-3e-3, 3e-3)
#self.action_head[-1].bias.data.uniform_(-3e-3, 3e-3)
def init_hidden_actor(self):
return torch.zeros(1, self.n_layers, self.hidden_size_actor)
def init_hidden_critic(self):
return torch.zeros(1, self.n_layers, self.hidden_size_critic)
def forward(self, observations, actions, hidden_actor=None, hidden_critic=None):
n_agents, t, *_ = observations.shape
obs_emb = self.obs_proj(observations.view(n_agents, t, -1).float())
action_emb = self.action_emb(actions+1) # shift by one due to padding idx
if not self.use_agent_embedding:
x_t = torch.cat((obs_emb, action_emb), -1)
else:
agent_emb = self.agent_emb(
torch.cat([torch.arange(0, n_agents, 1).view(-1, 1)] * t, 1)
)
x_t = torch.cat((obs_emb, agent_emb, action_emb), -1)
mixed_x_t = self.mix(x_t)
output_p, _ = self.gru_actor(input=mixed_x_t, hx=hidden_actor.swapaxes(1, 0))
output_c, _ = self.gru_critic(input=mixed_x_t, hx=hidden_critic.swapaxes(1, 0))
logits = self.action_head(output_p)
critic = self.critic_head(output_c).squeeze(-1)
return dict(logits=logits, critic=critic, hidden_actor=output_p, hidden_critic=output_c)
class RecurrentACL2(RecurrentAC):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.action_head = nn.Sequential(
nn.Linear(self.hidden_size_actor, self.hidden_size_actor),
nn.Tanh(),
NormalizedLinear(self.hidden_size_actor, self.n_actions, trainable_magnitude=True)
)
class NormalizedLinear(nn.Linear):
def __init__(self, in_features: int, out_features: int,
device=None, dtype=None, trainable_magnitude=False):
super(NormalizedLinear, self).__init__(in_features, out_features, False, device, dtype)
self.d_sqrt = in_features**0.5
self.trainable_magnitude = trainable_magnitude
self.scale = nn.Parameter(torch.tensor([1.]), requires_grad=trainable_magnitude)
def forward(self, in_array):
normalized_input = F.normalize(in_array, dim=-1, p=2, eps=1e-5)
normalized_weight = F.normalize(self.weight, dim=-1, p=2, eps=1e-5)
return F.linear(normalized_input, normalized_weight) * self.d_sqrt * self.scale
class L2Norm(nn.Module):
def __init__(self, in_features, trainable_magnitude=False):
super(L2Norm, self).__init__()
self.d_sqrt = in_features**0.5
self.scale = nn.Parameter(torch.tensor([1.]), requires_grad=trainable_magnitude)
def forward(self, x):
return F.normalize(x, dim=-1, p=2, eps=1e-5) * self.d_sqrt * self.scale

View File

@ -1,55 +0,0 @@
import torch
from torch.distributions import Categorical
from marl_factory_grid.algorithms.rl.iac import LoopIAC
from marl_factory_grid.algorithms.rl.base_ac import nms
from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
class LoopSEAC(LoopIAC):
def __init__(self, cfg):
super(LoopSEAC, self).__init__(cfg)
def actor_critic(self, tm, networks, gamma, entropy_coef, vf_coef, gae_coef=0.0, **kwargs):
obs, actions, done, reward = tm.observation, tm.action, tm.done[:, 1:], tm.reward[:, 1:]
outputs = [net(obs, actions, tm.hidden_actor[:, 0], tm.hidden_critic[:, 0]) for net in networks]
with torch.inference_mode(True):
true_action_logp = torch.stack([
torch.log_softmax(out[nms.LOGITS][ag_i, :-1], -1)
.gather(index=actions[ag_i, 1:, None], dim=-1)
for ag_i, out in enumerate(outputs)
], 0).squeeze()
losses = []
for ag_i, out in enumerate(outputs):
logits = out[nms.LOGITS][:, :-1] # last one only needed for v_{t+1}
critic = out[nms.CRITIC]
entropy_loss = Categorical(logits=logits[ag_i]).entropy().mean()
advantages = self.compute_advantages(critic, reward, done, gamma, gae_coef)
# policy loss
log_ap = torch.log_softmax(logits, -1)
log_ap = torch.gather(log_ap, dim=-1, index=actions[:, 1:].unsqueeze(-1)).squeeze()
# importance weights
iw = (log_ap - true_action_logp).exp().detach() # importance_weights
a2c_loss = (-iw*log_ap * advantages.detach()).mean(-1)
value_loss = (iw*advantages.pow(2)).mean(-1) # n_agent
# weighted loss
loss = (a2c_loss + vf_coef*value_loss - entropy_coef * entropy_loss).mean()
losses.append(loss)
return losses
def learn(self, tms: MARLActorCriticMemory, **kwargs):
losses = self.actor_critic(tms, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
for ag_i, loss in enumerate(losses):
self.optimizer[ag_i].zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.net[ag_i].parameters(), 0.5)
self.optimizer[ag_i].step()

View File

@ -1,33 +0,0 @@
from marl_factory_grid.algorithms.rl.base_ac import BaseActorCritic
from marl_factory_grid.algorithms.rl.base_ac import nms
import torch
from torch.distributions import Categorical
from pathlib import Path
class LoopSNAC(BaseActorCritic):
def __init__(self, cfg):
super().__init__(cfg)
def load_state_dict(self, path: Path):
path2weights = list(path.glob('*.pt'))
assert len(path2weights) == 1, f'Expected a single set of weights but got {len(path2weights)}'
self.net.load_state_dict(torch.load(path2weights[0]))
def init_hidden(self):
hidden_actor = self.net.init_hidden_actor()
hidden_critic = self.net.init_hidden_critic()
return dict(hidden_actor=torch.cat([hidden_actor] * self.n_agents, 0),
hidden_critic=torch.cat([hidden_critic] * self.n_agents, 0)
)
def get_actions(self, out):
actions = Categorical(logits=out[nms.LOGITS]).sample().squeeze()
return actions
def forward(self, observations, actions, hidden_actor, hidden_critic):
out = self.net(self._as_torch(observations).unsqueeze(1),
self._as_torch(actions).unsqueeze(1),
hidden_actor, hidden_critic
)
return out

View File

@ -33,6 +33,7 @@ class TSPBaseAgent(ABC):
self.local_optimization = True
self._env = state
self.state = self._env.state[c.AGENT][agent_i]
self.spawn_position = np.array(self.state.pos)
self._position_graph = self.generate_pos_graph()
self._static_route = None
self.cached_route = None
@ -79,7 +80,7 @@ class TSPBaseAgent(ABC):
start_time = time.time()
if self.cached_route is not None:
print(f" Used cached route: {self.cached_route}")
#print(f" Used cached route: {self.cached_route}")
return copy.deepcopy(self.cached_route)
else:
@ -89,7 +90,7 @@ class TSPBaseAgent(ABC):
[self.state.pos] + \
[x for x in positions if max(abs(np.subtract(x, self.state.pos))) < 3]
try:
while len(nodes) < 7:
while len(nodes) < 13:
nodes += [next(x for x in positions if x not in nodes)]
except StopIteration:
nodes = [self.state.pos] + positions
@ -100,11 +101,11 @@ class TSPBaseAgent(ABC):
route = tsp.traveling_salesman_problem(self._position_graph,
nodes=nodes, cycle=True, method=tsp.greedy_tsp)
self.cached_route = copy.deepcopy(route)
print(f"Cached route: {self.cached_route}")
#print(f"Cached route: {self.cached_route}")
end_time = time.time()
duration = end_time - start_time
print("TSP calculation took {:.2f} seconds to execute".format(duration))
#print("TSP calculation took {:.2f} seconds to execute".format(duration))
return route
def _door_is_close(self, state):

View File

@ -0,0 +1,96 @@
import os
import pickle
from pathlib import Path
from tqdm import trange
from marl_factory_grid import Factory
from marl_factory_grid.algorithms.static.contortions import get_coin_quadrant_tsp_agents, get_two_rooms_tsp_agents
def coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon):
run_tsp_setting("coin_quadrant", emergent_phenomenon, log=False)
def two_rooms_multi_agent_tsp_eval(emergent_phenomenon):
run_tsp_setting("two_rooms", emergent_phenomenon, log=False)
def run_tsp_setting(config_name, emergent_phenomenon, n_episodes=1, log=False):
# Render at each step?
render = True
# Path to config File
path = Path(f'./marl_factory_grid/configs/tsp/{config_name}.yaml')
# Create results folder
runs = os.listdir("./study_out/")
run_numbers = [int(run[7:]) for run in runs if run[:7] == "tsp_run"]
next_run_number = max(run_numbers) + 1 if run_numbers else 0
results_path = f"./study_out/tsp_run{next_run_number}"
os.mkdir(results_path)
# Env Init
factory = Factory(path)
with open(f"{results_path}/env_config.txt", "w") as txt_file:
txt_file.write(str(factory.conf))
still_existing_coin_piles = []
reached_flags = []
for episode in trange(n_episodes):
_ = factory.reset()
still_existing_coin_piles.append([])
reached_flags.append([])
done = False
if render:
factory.render()
factory._renderer.fps = 5
if config_name == "coin_quadrant":
agents = get_coin_quadrant_tsp_agents(emergent_phenomenon, factory)
elif config_name == "two_rooms":
agents = get_two_rooms_tsp_agents(emergent_phenomenon, factory)
else:
print("Config name does not exist. Abort...")
break
ep_steps = 0
while not done:
a = [x.predict() for x in agents]
# Have this condition, to terminate as soon as all coin piles are collected. This ensures that the implementation
# of the TSP agent is equivalent to that of the RL agent
if 'CoinPiles' in list(factory.state.entities.keys()) and factory.state.entities['CoinPiles'].global_amount == 0.0:
break
obs_type, _, _, done, info = factory.step(a)
if 'CoinPiles' in list(factory.state.entities.keys()):
still_existing_coin_piles[-1].append(len(factory.state.entities['CoinPiles']))
if 'Destinations' in list(factory.state.entities.keys()):
reached_flags[-1].append(sum([1 for ele in [x.was_reached() for x in factory.state['Destinations']] if ele]))
ep_steps += 1
if render:
factory.render()
if done:
break
collected_coin_piles_per_step = []
if 'CoinPiles' in list(factory.state.entities.keys()):
for ep in still_existing_coin_piles:
collected_coin_piles_per_step.append([max(ep)-ep[idx] for idx, value in enumerate(ep)])
# Remove first element and add last element where all coin piles have been collected
del collected_coin_piles_per_step[-1][0]
collected_coin_piles_per_step[-1].append(max(still_existing_coin_piles[-1]))
# Add last entry to reached_flags
print("Number of environment steps:", ep_steps)
if 'CoinPiles' in list(factory.state.entities.keys()):
print("Collected coins per step:", collected_coin_piles_per_step)
if 'Destinations' in list(factory.state.entities.keys()):
print("Reached flags per step:", reached_flags)
if log:
if 'CoinPiles' in list(factory.state.entities.keys()):
metrics_data = {"collected_coin_piles_per_step": collected_coin_piles_per_step}
if 'Destinations' in list(factory.state.entities.keys()):
metrics_data = {"reached_flags": reached_flags}
with open(f"{results_path}/metrics", "wb") as pickle_file:
pickle.dump(metrics_data, pickle_file)

View File

@ -0,0 +1,55 @@
import numpy as np
from marl_factory_grid.algorithms.static.TSP_coin_agent import TSPCoinAgent
from marl_factory_grid.algorithms.static.TSP_target_agent import TSPTargetAgent
def get_coin_quadrant_tsp_agents(emergent_phenomenon, factory):
agents = [TSPCoinAgent(factory, 0), TSPCoinAgent(factory, 1)]
if not emergent_phenomenon:
edge_costs = {}
# Add costs for horizontal edges
for i in range(1, 10):
for j in range(1, 9):
# Add costs for both traversal directions
edge_costs[f"{(i, j)}-{i, j + 1}"] = 0.55 + (i - 1) * 0.05
edge_costs[f"{i, j + 1}-{(i, j)}"] = 0.55 + (i - 1) * 0.05
# Add costs for vertical edges
for i in range(1, 9):
for j in range(1, 10):
# Add costs for both traversal directions
edge_costs[f"{(i, j)}-{i + 1, j}"] = 0.55 + (i) * 0.05
edge_costs[f"{i + 1, j}-{(i, j)}"] = 0.55 + (i - 1) * 0.05
for agent in agents:
for u, v, weight in agent._position_graph.edges(data='weight'):
agent._position_graph[u][v]['weight'] = edge_costs[f"{u}-{v}"]
return agents
def get_two_rooms_tsp_agents(emergent_phenomenon, factory):
agents = [TSPTargetAgent(factory, 0), TSPTargetAgent(factory, 1)]
if not emergent_phenomenon:
edge_costs = {}
# Add costs for horizontal edges
for i in range(1, 6):
for j in range(1, 13):
# Add costs for both traversal directions
edge_costs[f"{(i, j)}-{i, j + 1}"] = np.abs(5/i*np.cbrt(((j+1)/4 - 1)) - 1)
edge_costs[f"{i, j + 1}-{(i, j)}"] = np.abs(5/i*np.cbrt((j/4 - 1)) - 1)
# Add costs for vertical edges
for i in range(1, 5):
for j in range(1, 14):
# Add costs for both traversal directions
edge_costs[f"{(i, j)}-{i + 1, j}"] = np.abs(5/(i+1)*np.cbrt((j/4 - 1)) - 1)
edge_costs[f"{i + 1, j}-{(i, j)}"] = np.abs(5/i*np.cbrt((j/4 - 1)) - 1)
for agent in agents:
for u, v, weight in agent._position_graph.edges(data='weight'):
agent._position_graph[u][v]['weight'] = edge_costs[f"{u}-{v}"]
return agents

View File

@ -1,9 +1,11 @@
import os
from pathlib import Path
import numpy as np
import yaml
from marl_factory_grid import Factory
from marl_factory_grid.algorithms.marl.utils import get_configs_marl_path
def load_class(classname):
@ -43,6 +45,10 @@ def get_class(arguments):
return c
def get_study_out_path():
return Path(os.path.join(Path(__file__).parent.parent.parent, "study_out"))
def get_arguments(arguments):
d = dict(arguments)
if "classname" in d:
@ -58,19 +64,13 @@ def load_yaml_file(path: Path):
def add_env_props(cfg):
# Path to config File
env_path = Path(f'../marl_factory_grid/configs/{cfg["env"]["env_name"]}.yaml')
env_path = Path(f'{get_configs_marl_path()}/{cfg["env"]["env_name"]}.yaml')
print(cfg)
# Env Init
factory = Factory(env_path)
_ = factory.reset()
# Agent Init
if len(factory.state.moving_entites) == 1: # Single agent setting
observation_size = list(factory.observation_space.shape)
else: # Multi-agent setting
observation_size = list(factory.observation_space[0].shape)
cfg['agent'].update(dict(observation_size=observation_size, n_actions=factory.action_space[0].n))
return factory

View File

@ -1,78 +0,0 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: quadrant
# Radius of Partially observable Markov decision process
pomdp_r: 0 # default 3
# Print all messages and events
verbose: false
# Run tests
tests: false
# In the "clean and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
# other agents aim to collect coin piles.
Agents:
# The collect coin agents
#Sigmund:
#Actions:
#- Move4
#- Noop
#Observations:
#- CoinPiles
#- Self
#Positions:
#- (9,1)
#- (1,1)
#- (2,4)
#- (4,7)
#- (7,9)
#- (2,4)
#- (4,7)
#- (7,9)
#- (9,9)
#- (9,1)
Wolfgang:
Actions:
- Move4
Observations:
- CoinPiles
- Self
Positions:
- (9,5)
#- (1,1)
#- (2,4)
#- (4,7)
#- (7,9)
#- (2,4)
#- (4,7)
#- (7,9)
#- (9,9)
#- (9,5)
Entities:
CoinPiles:
coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) #(9,9), (7,9), (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (4,7), (2,4), (1, 1) # (1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
# Can be omitted/ignored if you do not want to take care of collisions at all.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions.
# The environment stops when all coins are collected
DoneOnAllCoinsCollected:
#DoneAtMaxStepsReached:
#max_steps: 200

View File

@ -1,62 +0,0 @@
General:
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_modified
# View Radius; 0 = full observatbility
pomdp_r: 0
# Print all messages and events
verbose: false
# Run tests
tests: false
# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
Agents:
#Sigmund:
#Actions:
#- Move4
#- DoorUse
#Observations:
#- CoinPiles
#- Self
#Positions:
#- (3,1)
#- (2,1)
Wolfgang:
Actions:
- Move4
- DoorUse
Observations:
- CoinPiles
- Self
Positions:
- (3,13)
- (2,13)
Entities:
CoinPiles:
coords_or_quantity: (2,13), (3,2) # (2,1), (3,12)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
Doors: { }
Rules:
# Environment Dynamics
#DoorAutoClose:
#close_frequency: 10
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
#DoneOnAllDirtCleaned:
DoneAtMaxStepsReached:
max_steps: 50

View File

@ -1,75 +0,0 @@
General:
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_modified
# View Radius; 0 = full observatbility
pomdp_r: 0
# Print all messages and events
verbose: false
# Run tests
tests: false
# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
Agents:
#Sigmund:
#Actions:
#- Move4
#Observations:
#- CoinPiles
#- Self
#Positions:
#- (3,1)
#- (1,1)
#- (3,1)
#- (5,1)
#- (3,1)
#- (1,8)
#- (3,1)
#- (5,8)
Wolfgang:
Actions:
- Move4
Observations:
- CoinPiles
- Self
Positions:
- (3,13)
- (2,13)
- (1,13)
- (3,13)
- (1,8)
- (2,6)
- (3,10)
- (4,6)
Entities:
CoinPiles:
coords_or_quantity: (2,13), (3,2) # (2,1), (3,12)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
#Doors: { }
Rules:
# Environment Dynamics
#DoorAutoClose:
#close_frequency: 10
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
DoneOnAllCoinsCollected:
#DoneAtMaxStepsReached:
#max_steps: 100
AgentSpawnRule:
spawn_rule: "order"

View File

@ -26,6 +26,28 @@ Agents:
- Noop
- Charge
- Clean
- DestAction
- DoorUse
- ItemAction
- Move8
Observations:
- Combined:
- Other
- Walls
- GlobalPosition
- Battery
- ChargePods
- DirtPiles
- Destinations
- Doors
- Items
- Inventory
- DropOffLocations
- Maintainers
Herbert:
Actions:
- Noop
- Charge
- Collect
- DestAction
- DoorUse
@ -39,7 +61,6 @@ Agents:
- Battery
- ChargePods
- CoinPiles
- DirtPiles
- Destinations
- Doors
- Items
@ -62,10 +83,10 @@ Entities:
# CoinPiles: Entities that can be collected by an agent.
CoinPiles:
coords_or_quantity: 10
initial_amount: 2
initial_amount: 1
collect_amount: 1
coin_spawn_r_var: 0.1
max_global_amount: 20
max_global_amount: 10
max_local_amount: 5
# Destinations: Entities representing target locations for agents.

View File

@ -5,60 +5,47 @@ General:
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: quadrant
# Radius of Partially observable Markov decision process
pomdp_r: 0 # default 3
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In the "collect and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
# other agents aim to collect coin piles.
# Define Agents, their actions, observations and spawnpoints
Agents:
# The collect coin agents
Sigmund:
# The clean agents
Agent1:
Actions:
- Move4
#- Collect
#- Noop
- Noop
Observations:
- CoinPiles
- Self
Positions:
- (9,1)
- (4,5)
- (1,1)
- (4,5)
- (9,1)
- (9,9)
Wolfgang:
Agent2:
Actions:
- Move4
#- Collect
#- Noop
- Noop
Observations:
- CoinPiles
- Self
Positions:
- (9,5)
- (4,5)
- (1,1)
- (4,5)
- (9,5)
- (9,9)
Entities:
CoinPiles:
coords_or_quantity: (9,9), (1,1), (4,5) # (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coords_or_quantity: (9,9), (7,9), (4,7), (2,4), (1, 1)
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
randomize: False # If coins should spawn at random positions instead of the positions defined above
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
# Can be omitted/ignored if you do not want to take care of collisions at all.
@ -67,7 +54,5 @@ Rules:
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions.
# The environment stops when all coins are collected
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:
#DoneAtMaxStepsReached: # An episode should last for at most max_steps steps
#max_steps: 100

View File

@ -1,20 +1,20 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_modified
# View Radius; 0 = full observatbility
pomdp_r: 0
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
# Define Agents, their actions, observations and spawnpoints
Agents:
Sigmund:
Agent1:
Actions:
- Move4
- DoorUse
@ -24,7 +24,7 @@ Agents:
- Self
Positions:
- (3,1)
Wolfgang:
Agent2:
Actions:
- Move4
- DoorUse
@ -36,10 +36,11 @@ Agents:
- (3,13)
Entities:
# For RL-agent we model the flags as coin piles to be more flexible
CoinPiles:
coords_or_quantity: (2,1), (3,12), (2,13), (3,2) # Static form: auxiliary pile, primary pile, auxiliary pile, ...
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
@ -47,16 +48,13 @@ Entities:
Doors: { }
Rules:
# Environment Dynamics
#DoorAutoClose:
#close_frequency: 10
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
#DoneOnAllDirtCleaned:
# Define the conditions for the environment to stop. Either success or a fail conditions.
# Environment execution stops after 30 steps
DoneAtMaxStepsReached:
max_steps: 50
max_steps: 30

View File

@ -1,20 +1,20 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_modified
# View Radius; 0 = full observatbility
pomdp_r: 0
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
# Define Agents, their actions, observations and spawnpoints
Agents:
Sigmund:
Agent1:
Actions:
- Move4
- DoorUse
@ -24,7 +24,7 @@ Agents:
- Self
Positions:
- (3,1)
Wolfgang:
Agent2:
Actions:
- Move4
- DoorUse
@ -36,10 +36,11 @@ Agents:
- (3,13)
Entities:
# For RL-agent we model the flags as coin piles to be more flexible
CoinPiles:
coords_or_quantity: (3,12), (3,2) # Static form: auxiliary pile, primary pile, auxiliary pile, ...
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coords_or_quantity: (3,12), (3,2) # Locations of flags
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
@ -47,16 +48,13 @@ Entities:
Doors: { }
Rules:
# Environment Dynamics
#DoorAutoClose:
#close_frequency: 10
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
#DoneOnAllDirtCleaned:
# Define the conditions for the environment to stop. Either success or a fail conditions
# Environment execution stops after 30 steps
DoneAtMaxStepsReached:
max_steps: 30

View File

@ -0,0 +1,48 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: quadrant
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# Define Agents, their actions, observations and spawnpoints
Agents:
# The clean agents
Agent1:
Actions:
- Move4
- Noop
Observations:
- CoinPiles
- Self
Positions:
- (9,1)
Entities:
CoinPiles:
coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
# Can be omitted/ignored if you do not want to take care of collisions at all.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions.
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:

View File

@ -5,69 +5,45 @@ General:
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: quadrant
# Radius of Partially observable Markov decision process
pomdp_r: 0 # default 3
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In the "collect and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
# other agents aim to collect coin piles.
# Define Agents, their actions, observations and spawnpoints
Agents:
# The clean agents
#Sigmund:
#Actions:
#- Move4
#Observations:
#- CoinPiles
#- Self
#Positions:
#- (9,1)
#- (1,1)
#- (2,4)
#- (4,7)
#- (6,8)
#- (7,9)
#- (2,4)
#- (4,7)
#- (6,8)
#- (7,9)
#- (9,9)
#- (9,1)
Wolfgang:
Agent1:
Actions:
- Move4
Observations:
- CoinPiles
- Self
Positions:
- (9,5)
Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
- (9,1)
- (1,1)
- (2,4)
- (4,7)
- (6,8)
- (7,9)
- (2,4)
- (4,7)
- (6,8)
- (7,9)
- (9,9)
- (9,5)
- (9,1)
Entities:
CoinPiles:
coords_or_quantity: (1, 1), (2,4), (4,7), (6,8), (7,9), (9,9) # (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
# Can be omitted/ignored if you do not want to take care of collisions at all.
@ -76,10 +52,8 @@ Rules:
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions.
# The environment stops when all coins are collected
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:
#DoneAtMaxStepsReached: # An episode should last for at most max_steps steps
#max_steps: 1000
# Define how agents spawn.
# Options: "random" (Spawn agent at a random position from the list of defined positions)

View File

@ -0,0 +1,50 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# Define Agents, their actions, observations and spawnpoints
Agents:
Agent1:
Actions:
- Move4
- DoorUse
Observations:
- CoinPiles
- Self
Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
- (3,1)
- (2,1) # spawnpoint only required if agent1 should go to its auxiliary pile
Entities:
CoinPiles:
coords_or_quantity: (2,1), (3,12) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
Doors: { }
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions
# Environment execution stops after 30 steps
DoneAtMaxStepsReached:
max_steps: 30

View File

@ -0,0 +1,55 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# Define Agents, their actions, observations and spawnpoints
Agents:
Agent1:
Actions:
- Move4
Observations:
- CoinPiles
- Self
Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
- (5,1)
- (2,1)
- (1,1)
Entities:
CoinPiles:
coords_or_quantity: (3,12) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
#Doors: { } # We leave out the door during training
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:
# Define how agents spawn.
# Options: "random" (Spawn agent at a random position from the list of defined positions)
# "first" (Always spawn agent at first position regardless of the other provided positions)
# "order" (Loop through agent positions)
AgentSpawnRule:
spawn_rule: "order"

View File

@ -0,0 +1,49 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# Define Agents, their actions, observations and spawnpoints
Agents:
Agent2:
Actions:
- Move4
- DoorUse
Observations:
- CoinPiles
- Self
Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
- (3,13)
Entities:
CoinPiles:
coords_or_quantity: (3,2) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
Doors: { }
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions
# Environment execution stops after 30 steps
DoneAtMaxStepsReached:
max_steps: 30

View File

@ -0,0 +1,54 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# Define Agents, their actions, observations and spawnpoints
Agents:
Agent2:
Actions:
- Move4
Observations:
- CoinPiles
- Self
Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
- (3,13)
Entities:
CoinPiles:
coords_or_quantity: (3,2) # Locations of coin piles
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
#Doors: { } # We leave out the door during training
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
# Utilities
# This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
WatchCollisions:
done_at_collisions: false
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:
# Defines how agents spawn.
# Options: "random" (Spawn agent at a random position from the list of defined positions)
# "first" (Always spawn agent at first position regardless of the other provided positions)
# "order" (Loop through agent positions)
AgentSpawnRule:
spawn_rule: "order"

View File

@ -18,28 +18,28 @@ Agents:
# - Doors
# - Maintainers
# Clones: 0
# Item test agent:
# Actions:
# - Noop
# - Charge
# - DestAction
# - DoorUse
# - ItemAction
# - Move8
# Observations:
# - Combined:
# - Other
# - Walls
# - GlobalPosition
# - Battery
# - ChargePods
# - Destinations
# - Doors
# - Items
# - Inventory
# - DropOffLocations
# - Maintainers
# Clones: 0
Item test agent:
Actions:
- Noop
- Charge
- DestAction
- DoorUse
- ItemAction
- Move8
Observations:
- Combined:
- Other
- Walls
- GlobalPosition
- Battery
- ChargePods
- Destinations
- Doors
- Items
- Inventory
- DropOffLocations
- Maintainers
Clones: 0
# Target test agent:
# Actions:
# - Noop
@ -56,25 +56,25 @@ Agents:
# - Doors
# - Maintainers
# Clones: 1
Coin test agent:
Actions:
- Noop
- Charge
- Collect
- DoorUse
- Move8
Observations:
- Combined:
- Other
- Walls
- GlobalPosition
- Battery
- ChargePods
- CoinPiles
- Destinations
- Doors
- Maintainers
Clones: 1
# Coin test agent:
# Actions:
# - Noop
# - Charge
# - Collect
# - DoorUse
# - Move8
# Observations:
# - Combined:
# - Other
# - Walls
# - GlobalPosition
# - Battery
# - ChargePods
# - CoinPiles
# - Destinations
# - Doors
# - Maintainers
# Clones: 1
Entities:
@ -93,7 +93,7 @@ Entities:
# dirt_spawn_r_var: 0.1
# max_global_amount: 20
# max_local_amount: 5
CoinPiles:
DirtPiles:
coords_or_quantity: 10
initial_amount: 2
collect_amount: 1
@ -134,7 +134,7 @@ Rules:
# respawn_freq: 15
RespawnItems:
respawn_freq: 15
RespawnCoins:
RespawnDirt:
respawn_freq: 15
# Utilities

View File

@ -5,31 +5,34 @@ General:
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: quadrant
# Radius of Partially observable Markov decision process
pomdp_r: 0 # default 3
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In the "clean and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
# other agents aim to clean dirt piles.
# Define Agents, their actions, observations and spawnpoints
Agents:
# The coin collect agents
Sigmund:
# The clean agents
Agent1:
Actions:
- Move4
- Collect
- Noop
Observations:
- Walls
- CoinPiles
- Self
Positions:
- (9,1)
Wolfgang:
Agent2:
Actions:
- Move4
- Collect
- Noop
Observations:
- Walls
- CoinPiles
- Self
Positions:
@ -37,12 +40,13 @@ Agents:
Entities:
CoinPiles:
coords_or_quantity: (9,9), (7,9), (4,7), (2,4), (1, 1) # (4,7), (2,4), (1, 1) # (1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
collect_amount: 1
coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9)
initial_amount: 0.5
clean_amount: 1
coin_spawn_r_var: 0
max_global_amount: 12
max_local_amount: 1
randomize: False # If coins should spawn at random positions instead of the positions defined above
# Rules section specifies the rules governing the dynamics of the environment.
Rules:
@ -55,7 +59,5 @@ Rules:
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions.
# The environment stops when all coins are collected
# The environment stops when all coin is cleaned
DoneOnAllCoinsCollected:
#DoneAtMaxStepsReached:
#max_steps: 200

View File

@ -1,40 +1,38 @@
General:
# RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
env_seed: 69
# Individual vs global rewards
individual_rewards: true
# The level.txt file to load from marl_factory_grid/levels
level_name: two_rooms_modified
# View Radius; 0 = full observatbility
pomdp_r: 0
level_name: two_rooms_small
# View Radius
pomdp_r: 0 # Use custom partial observability setting
# Print all messages and events
verbose: false
# Run tests
tests: false
# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
# Define Agents, their actions, observations and spawnpoints
Agents:
Wolfgang:
Agent1:
Actions:
- Move4
- Noop
- DestAction
- DestAction # Action that is performed when the destination is reached
- DoorUse
Observations:
- Walls
- Other
- Doors
- Destination
Positions:
- (3,1) # Agent spawnpoint
Sigmund:
- (3,1)
Agent2:
Actions:
- Move4
- Noop
- DestAction
- DoorUse
Observations:
- Other
- Walls
- Destination
- Doors
@ -45,10 +43,11 @@ Entities:
Destinations:
spawnrule:
SpawnDestinationsPerAgent:
# Target coordinates
coords_or_quantity:
Wolfgang:
- (3,12) # Target coordinates
Sigmund:
Agent1:
- (3,12)
Agent2:
- (3,2)
Doors: { }
@ -68,10 +67,12 @@ Rules:
AssignGlobalPositions: { }
DoneAtDestinationReach:
reward_at_done: 1
reward_at_done: 50
# We want to give rewards only, when all targets have been reached.
condition: "all"
# Done Conditions
# Define the conditions for the environment to stop. Either success or a fail conditions
# Environment execution stops after 30 steps
DoneAtMaxStepsReached:
max_steps: 50
max_steps: 30

View File

@ -1,3 +1,4 @@
import copy
import shutil
from collections import defaultdict
@ -100,7 +101,7 @@ class Factory(gym.Env):
parsed_entities = self.conf.load_entities()
self.map = LevelParser(self.level_filepath, parsed_entities, self.conf.pomdp_r)
self.levels_that_require_masking = ['two_rooms']
self.levels_that_require_masking = ['two_rooms_small']
# Init for later usage:
# noinspection PyTypeChecker
@ -274,10 +275,15 @@ class Factory(gym.Env):
global Renderer
self._renderer = Renderer(self.map.level_shape, view_radius=self.conf.pomdp_r, fps=10)
render_entities = self.state.entities.render()
# Remove potential Nones from entities
render_entities_full = self.state.entities.render()
# Hide entities where certain conditions are met (e.g., amount <= 0 for DirtPiles)
render_entities = self.filter_entities(render_entities)
maintain_indices = self.filter_entities(self.state.entities)
if maintain_indices:
render_entities = [render_entity for idx, render_entity in enumerate(render_entities_full) if idx in maintain_indices]
else:
render_entities = render_entities_full
# Mask entities based on dynamic conditions instead of hardcoding level-specific logic
if self.conf['General']['level_name'] in self.levels_that_require_masking:
@ -291,18 +297,18 @@ class Factory(gym.Env):
def filter_entities(self, entities):
""" Generalized method to filter out entities that shouldn't be rendered. """
if 'DirtPiles' in self.state.entities.keys():
entities = [entity for entity in entities if not (entity.name == 'DirtPiles' and entity.amount <= 0)]
return entities
if 'CoinPiles' in self.state.entities.keys():
all_entities = [item for sublist in [[e for e in entity] for entity in entities] for item in sublist]
return [idx for idx, entity in enumerate(all_entities) if not ('CoinPile' in entity.name and entity.amount <= 0)]
def mask_entities(self, entities):
""" Generalized method to mask entities based on dynamic conditions. """
for entity in entities:
if entity.name == 'CoinPiles':
# entity.name = 'Destinations'
# entity.value = 1
entity.mask = 'Destinations'
entity.mask_value = 1
entity.name = 'Destinations'
entity.value = 1
#entity.mask = 'Destinations'
#entity.mask_value = 1
return entities
def set_recorder(self, recorder):

Some files were not shown because too many files have changed in this diff Show More