State of repo for ISOLA paper

2025-07-08 02:21:36 +02:00 · 2024-10-25 17:24:11 +02:00
parent 95749d8238
commit e37b23c20c
120 changed files with 1487 additions and 6439 deletions
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@ -1,23 +0,0 @@
-stages:          # List of stages for jobs, and their order of execution
-  - build
-
-build-job:       # This job runs in the build stage, which runs first.
-  stage: build
-
-  rules:
-    - if: $CI_COMMIT_REF_NAME == "pypi" #when commit pushed in this branch it will trigger this job
-
-  variables:
-    TWINE_USERNAME: $USER_NAME
-    TWINE_PASSWORD: $API_KEY
-    TWINE_REPOSITORY: rl-factory-grid
-
-  image: python:slim
-  script:
-    - echo "Compiling the code..."
-    - pip install -U twine
-    - python setup.py sdist bdist_wheel
-    - twine check dist/*
-    # try upload in test platform before the oficial
-    - twine upload --repository-url https://upload.pypi.org/legacy/ dist/*
-    - echo "Upload complete."
--- a/.readthedocs.yaml
+++ b/.readthedocs.yaml
@ -1,19 +0,0 @@
-# Required
-version: 2
-
-# Set the OS, Python version and other tools you might need
-build:
-  os: ubuntu-22.04
-  tools:
-    python: "3.12"
-
-# Optional but recommended, declare the Python requirements required
-# to build your documentation
-# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
-python:
-    install:
-      - requirements: docs/requirements.txt
-
-# Build documentation in the "docs/" directory with Sphinx
-sphinx:
-   configuration: docs/source/conf.py
--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2024 TRAIL lab
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README-EDYS.md
+++ b/README-EDYS.md
@ -1,5 +1,7 @@
 # About EDYS

+by Steffen Illium, Joel Friedrich, Julian Schönberger, Robert Müller, Fabian Ritz
+
 ## Tackling emergent dysfunctions (EDYs) in cooperation with Fraunhofer-IKS.

 Collaborating with Fraunhofer-IKS, this project is dedicated to investigating Emergent Dysfunctions (EDYs) within
@ -46,42 +48,32 @@ systems.
    - This allows for processes such as retraining on an already initialized policy and fine-tuning to enhance the
      agent's performance based on the enriched information.

-## Setup
-
-Install this environment using `pip install marl-factory-grid`. For more information refer
-to ['installation'](docs/source/installation.rst).
-Refer to [quickstart](_quickstart) for specific scenarios.
-
 ## Usage

 The majority of environment objects, including entities, rules, and assets, can be loaded automatically.
 Simply specify the requirements of your environment in a [
 *yaml*-config file](marl_factory_grid/configs/default_config.yaml).

-If you only plan on using the environment without making any modifications, use ``quickstart_use``.
-This creates a default config-file and another one that lists all possible options of the environment.
-Also, it generates an initial script where an agent is executed in the specified environment.
-For further details on utilizing the environment, refer to ['usage'](docs/source/usage.rst).
+Two example scripts, that show how you can execute different agents in varying configurations of the environment can be 
+found in ```env_examples```.

 Existing modules include a variety of functionalities within the environment:

 - [Agents](marl_factory_grid/algorithms) implement either static strategies or learning algorithms based on the specific
  configuration.
- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/coin  cleaning
+- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/entitites.py)  cleaning
  [dirt](marl_factory_grid/modules/clean_up/entitites.py), picking
  up [items](marl_factory_grid/modules/items/entitites.py) and
  delivering them to designated drop-off locations.
- Agents are equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
+- Agents can be equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
  time if not charged at a chargepod.
 - The [maintainer](marl_factory_grid/modules/maintenance/entities.py) aims to
  repair [machines](marl_factory_grid/modules/machines/entitites.py) that lose health over time.

 ## Customization

-If you plan on modifying the environment by for example adding entities or rules, use ``quickstart_modify``.
-This creates a template module and a script that runs an agent, incorporating the generated module.
-More information on how to modify the levels, entities, groups, rules and assets
-goto [modifications](docs/source/modifications.rst).
+You can modify the environment in various ways, by for example adding level, entities or rules.
+

 ### Levels

@ -96,7 +88,7 @@ General:

 ... or create your own , maybe with the help of [asciiflow.com](https://asciiflow.com/#/).
 Make sure to use `#` as [Walls](marl_factory_grid/environment/entity/wall.py), `-` as free (walkable) floor, `D`
-for [Doors](./modules/doors/entities.py).
+for [Doors](marl_factory_grid/modules/doors/entitites.py).
 Other Entites (define you own) may bring their own `Symbols`

 ### Entites
@ -104,19 +96,26 @@ Other Entites (define you own) may bring their own `Symbols`
 Entites are [Objects](marl_factory_grid/environment/entity/object.py) that can additionally be assigned a position.
 Abstract Entities are provided.

+If you wish to introduce new entities to the environment just create a new module that implements the entity class. 
+If necessary, provide additional classe such as custom actions or rewards and load the entity into the environment 
+using the config file.
+
 ### Groups

 [Groups](marl_factory_grid/environment/groups/objects.py) are entity Sets that provide administrative access to all
 group members.
-All [Entites](marl_factory_grid/environment/entity/global_entities.py) are available at runtime as EnvState property.
+All [Entites](marl_factory_grid/environment/entity/entity.py) are available at runtime as EnvState property.

 ### Rules

-[Rules](marl_factory_grid/environment/entity/object.py) define how the environment behaves on microscale.
+[Rules](marl_factory_grid/environment/rules.py) define how the environment behaves on microscale.
 Each of the hookes (`on_init`, `pre_step`, `on_step`, '`post_step`', `on_done`)
 provide env-access to implement custom logic, calculate rewards, or gather information.

-![Hooks](../../images/Hooks_FIKS.png)
+If you wish to introduce new rules to the environment make sure it implements the Rule class and override its' hooks to 
+implement your own rule logic.
+
+![Hooks](images/Hooks_FIKS.png)

 [Results](marl_factory_grid/environment/entity/object.py) provide a way to return `rule` evaluations such as rewards and
 state reports back to the environment.
--- a/README-EMAS.md
+++ b/README-EMAS.md
@ -0,0 +1,65 @@
+# Emergence in Multi-Agent Systems: A Safety Perspective
+
+by Philipp Altmann, Julian Schönberger, Steffen Illium, Maximilian Zorn, Fabian Ritz, Tom Haider, Simon Burton, Thomas Gabor
+
+## About
+This is the code for the experiments of our paper. The experiments are build on top of the ```EDYS environment``` , 
+which we developed specifically for studying emergent behaviour in multi-agent systems. This environment is versatile
+and can be configured in various ways with different degrees of complexity. We refer to [README-EDYS.md](README-EDYS.md) for a 
+detailed overview of the functionalities of the environment and an explanation of the project context.
+
+## Setup
+
+1. Set up a virtualenv with python 3.10 or higher. You can use pyvenv or conda for this.
+2. Run ```pip install -r requirements.txt``` to get requirements.
+3. In case there is no ```study_out/``` folder in the root directory, create one.
+
+## Rerunning the Experiments 
+
+The respective experiments from our paper can be reenacted in [main.py](main.py).
+Just select the method representing the part of our experiments you want to rerun and 
+execute it via the ```__main__``` function.
+
+## Further Remarks
+1. We use config files located in the [configs](marl_factory_grid/configs) and the 
+[multi_agent_configs](marl_factory_grid/algorithms/marl/multi_agent_configs), 
+[single_agent_configs](marl_factory_grid/algorithms/marl/single_agent_configs) folders to configure the environments and the RL
+algorithm for our experiments, respectively. You don't need to change anything to rerun the 
+experiments, but we provided some additional comments in the configs for an overall better
+understanding of the functionalities.
+2. The results of the experiment runs are stored in [study_out](study_out).
+3. We reuse the ```coin-quadrant``` implementation of the RL agent for the ```two_rooms``` environment. The coin assets 
+are masked with flags in the visualization. This masking does not affect the RL agents in any way.
+4. The code for the cost contortion for preventing the emergent behavior of the TSP agents can
+be found in [contortions.py](marl_factory_grid/algorithms/static/contortions.py).
+5. The functionalities that drive the emergence prevention mechanisms for the RL agents is mainly 
+located in the utility methods ```get_ordered_coin_piles (line 94)``` (for solving the emergence in the 
+coin-quadrant environment) and ```distribute_indices (line 171)``` (mechanism for two_doors), that are part of
+[utils.py](marl_factory_grid/algorithms/marl/utils.py)
+6. [agent_models](marl_factory_grid/algorithms/agent_models) contains the parameters of the trained models for the RL 
+agents. You can repeat the training by executing the training procedures in  [main.py](main.py). Alternatively, you can 
+use your own trained agents, which you have obtained by modifying the training configurations in [single_agent_configs](marl_factory_grid/algorithms/marl/single_agent_configs)
+, for the evaluation experiments by inserting the names of the run folders, e.g. “run9” and “run 12”, into the list in 
+the methods ```coin_quadrant_multi_agent_rl_eval``` and ```two_rooms_multi_agent_rl_eval``` in [RL_runner.py](marl_factory_grid/algorithms/marl/RL_runner.py).
+
+## Requirements
+Python 3.10
+
+```
+numpy==1.26.4
+pygame>=2.0
+numba>=0.56
+gymnasium>=0.26
+seaborn
+pandas
+PyYAML
+networkx
+torch
+tqdm
+packaging
+pillow
+scipy
+```
+
+
+
--- a/README_submission.md
+++ b/README_submission.md
@ -1,75 +0,0 @@
-# About EDYS
-
-## Tackling emergent dysfunctions (EDYs) in cooperation with Fraunhofer-IKS.
-
-Collaborating with Fraunhofer-IKS, this project is dedicated to investigating Emergent Dysfunctions (EDYs) within
-multi-agent environments. In multi-agent reinforcement learning (MARL), a population of agents learns by interacting
-with each other in a shared environment and adapt their behavior based on the feedback they receive from the environment
-and the actions of other agents.
-
-In this context, emergent behavior describes spontaneous behaviors resulting from interactions among agents and
-environmental stimuli, rather than explicit programming. This promotes natural, adaptable behavior, increases system
-unpredictability for dynamic learning , enables diverse strategies, and encourages collective intelligence for complex
-problem-solving. However, the complex dynamics of the environment also give rise to emerging dysfunctions—unexpected
-issues from agent interactions. This research aims to enhance our understanding of EDYs and their impact on multi-agent
-systems.
-
-### Project Objectives:
-
- Create an environment that provokes emerging dysfunctions.
-
-    - This is achieved by creating a high level of background noise in the domain, where various entities perform
-      diverse tasks, resulting in a deliberately chaotic dynamic.
-    - The goal is to observe and analyze naturally occurring emergent dysfunctions within the complexity generated in
-      this dynamic environment.
-
-
- Observational Framework:
-
-    - The project introduces an environment that is designed to capture dysfunctions as they naturally occur.
-    - The environment allows for continuous monitoring of agent behaviors, actions, and interactions.
-    - Tracking emergent dysfunctions in real-time provides valuable data for analysis and understanding.
-
-
- Compatibility
-    - The Framework allows learning entities from different manufacturers and projects with varying representations
-      of actions and observations to interact seamlessly within the environment.
-
-
-## Setup
-
-Install this environment using `pip install marl-factory-grid`. For more information refer
-to ['installation'](docs/source/installation.rst).
-
-## Usage
-
-The environment is configured to automatically load necessary objects, including entities, rules, and assets, based on your requirements.
-You can utilize existing configurations to replicate the experiments from [this paper](PAPER).
-
- Preconfigured Studies:
-    The studies folder contains predefined studies that can be used to replicate the experiments.
-    These studies provide a structured way to validate and analyze the outcomes observed in different scenarios.
-  - Creating your own scenarios:
-      If you want to use the environment with custom entities, rules or levels refer to the [complete repository]().
-
-
-
-Existing modules include a variety of functionalities within the environment:
-
- [Agents](marl_factory_grid/algorithms) implement either static strategies or learning algorithms based on the specific
-  configuration.
- Their action set includes opening [door entities](marl_factory_grid/modules/doors/entitites.py), collecting [coins](marl_factory_grid/modules/coins/entitites.py)  cleaning
-  [dirt](marl_factory_grid/modules/clean_up/entitites.py), picking
-  up [items](marl_factory_grid/modules/items/entitites.py) and
-  delivering them to designated drop-off locations.
- Agents are equipped with a [battery](marl_factory_grid/modules/batteries/entitites.py) that gradually depletes over
-  time if not charged at a chargepod.
-
-
-## Limitations
-
-The provided code and documentation are tailored for replicating and validating experiments as described in the paper. 
-Modifications to the environment, such as adding new entities, creating additional rules, or customizing behavior beyond the provided scope are not supported in this release.
-If you are interested in accessing the complete project, including features not covered in this release, refer to the [full repository](LINK FULL REPO).
-
-For further details on running the experiments, please consult the relevant documentation provided in the studies' folder.
--- a/_quickstart/all_test_config.yaml
+++ b/_quickstart/all_test_config.yaml
@ -1,112 +0,0 @@
---
-General:
-  level_name: large
-  env_seed: 69
-  verbose: !!bool False
-  pomdp_r: 3
-  individual_rewards: !!bool True
-
-Entities:
-  Defaults: {}
-  DirtPiles:
-      initial_dirt_ratio: 0.01          # On INIT, on max how many tiles does the dirt spawn in percent.
-      dirt_spawn_r_var: 0.5             # How much does the dirt spawn amount vary?
-      initial_amount: 1
-      max_local_amount: 3               # Max dirt amount per tile.
-      max_global_amount: 30             # Max dirt amount in the whole environment.
-  Doors:
-      closed_on_init: True
-      auto_close_interval: 10
-      indicate_area: False
-  Batteries: {}
-  ChargePods: {}
-  Destinations: {}
-  ReachedDestinations: {}
-  Items: {}
-  Inventories: {}
-  DropOffLocations: {}
-
-Agents:
-  Wolfgang:
-    Actions:
-      - Noop
-      - Noop
-      - Noop
-      - CleanUp
-    Observations:
-      - Self
-      - Placeholder
-      - Walls
-      - DirtPiles
-      - Placeholder
-      - Doors
-      - Doors
-  Bjoern:
-    Actions:
-      # Move4, Noop
-      - Move8
-      - DoorUse
-      - ItemAction
-    Observations:
-      - Defaults
-      - Combined:
-          - Other
-          - Walls
-      - Items
-      - Inventory
-  Karl-Heinz:
-    Actions:
-      - Move8
-      - DoorUse
-    Observations:
-      # Wall, Only Other Agents
-      - Defaults
-      - Combined:
-          - Other
-          - Self
-          - Walls
-          - Doors
-      - Destinations
-  Manfred:
-    Actions:
-      - Move8
-      - ItemAction
-      - DoorUse
-      - CleanUp
-      - DestAction
-      - BtryCharge
-    Observations:
-      - Defaults
-      - Battery
-      - Destinations
-      - DirtPiles
-      - Doors
-      - Items
-      - Inventory
-      - DropOffLocations
-Rules:
-  Defaults: {}
-  Collision:
-    done_at_collisions: !!bool False
-  DirtRespawnRule:
-    spawn_freq: 15
-  DirtSmearOnMove:
-    smear_amount: 0.12
-  DoorAutoClose: {}
-  DirtAllCleanDone: {}
-  Btry: {}
-  BtryDoneAtDischarge: {}
-  DestinationReach: {}
-  DestinationSpawn: {}
-  DestinationDone: {}
-  ItemRules: {}
-
-Assets:
-  - Defaults
-  - Dirt
-  - Door
-  - Machine
-  - Item
-  - Destination
-  - DropOffLocation
-  - Chargepod
--- a/_quickstart/combine_and_monitor_rerun.py
+++ b/_quickstart/combine_and_monitor_rerun.py
@ -1,189 +0,0 @@
-import sys
-from pathlib import Path
-
-##############################################
-# keep this for stand alone script execution #
-##############################################
-from environments.factory.base.base_factory import BaseFactory
-from environments.logging.recorder import EnvRecorder
-
-try:
-    # noinspection PyUnboundLocalVariable
-    if __package__ is None:
-        DIR = Path(__file__).resolve().parent
-        sys.path.insert(0, str(DIR.parent))
-        __package__ = DIR.name
-    else:
-        DIR = None
-except NameError:
-    DIR = None
-    pass
-##############################################
-##############################################
-##############################################
-
-
-import simplejson
-
-from environments import helpers as h
-from environments.factory.additional.combined_factories import DestBatteryFactory
-from environments.factory.additional.dest.factory_dest import DestFactory
-from environments.factory.additional.dirt.factory_dirt import DirtFactory
-from environments.factory.additional.item.factory_item import ItemFactory
-from environments.helpers import ObservationTranslator, ActionTranslator
-from environments.logging.envmonitor import EnvMonitor
-from environments.utility_classes import ObservationProperties, AgentRenderOptions, MovementProperties
-
-
-def policy_model_kwargs():
-    return dict(ent_coef=0.01)
-
-
-def dqn_model_kwargs():
-    return dict(buffer_size=50000,
-                learning_starts=64,
-                batch_size=64,
-                target_update_interval=5000,
-                exploration_fraction=0.25,
-                exploration_final_eps=0.025
-                )
-
-
-def encapsule_env_factory(env_fctry, env_kwrgs):
-
-    def _init():
-        with env_fctry(**env_kwrgs) as init_env:
-            return init_env
-
-    return _init
-
-
-if __name__ == '__main__':
-
-    render = False
-    # Define Global Env Parameters
-    # Define properties object parameters
-    factory_kwargs = dict(
-        max_steps=400, parse_doors=True,
-        level_name='rooms',
-        doors_have_area=True, verbose=False,
-        mv_prop=MovementProperties(allow_diagonal_movement=True,
-                                   allow_square_movement=True,
-                                   allow_no_op=False),
-        obs_prop=ObservationProperties(
-            frames_to_stack=3,
-            cast_shadows=True,
-            omit_agent_self=True,
-            render_agents=AgentRenderOptions.LEVEL,
-            additional_agent_placeholder=None,
-        )
-    )
-
-    # Bundle both environments with global kwargs and parameters
-    # Todo: find a better solution, like outo module loading
-    env_map = {'DirtFactory': DirtFactory,
-               'ItemFactory': ItemFactory,
-               'DestFactory': DestFactory,
-               'DestBatteryFactory': DestBatteryFactory
-               }
-    env_names = list(env_map.keys())
-
-    # Put all your multi-seed agends in a single folder, we do not need specific names etc.
-    available_models = dict()
-    available_envs = dict()
-    available_runs_kwargs = dict()
-    available_runs_agents = dict()
-    max_seed = 0
-    # Define this folder
-    combinations_path = Path('combinations')
-    # Those are all differently trained combinations of mdoels, environment and parameters
-    for combination in (x for x in combinations_path.iterdir() if x.is_dir()):
-        # These are all the models for this specific combination
-        for model_run in (x for x in combination.iterdir() if x.is_dir()):
-            model_name, env_name = model_run.name.split('_')[:2]
-            if model_name not in available_models:
-                available_models[model_name] = h.MODEL_MAP[model_name]
-            if env_name not in available_envs:
-                available_envs[env_name] = env_map[env_name]
-            # Those are all available seeds
-            for seed_run in (x for x in model_run.iterdir() if x.is_dir()):
-                max_seed = max(int(seed_run.name.split('_')[0]), max_seed)
-                # Read the environment configuration from ROM
-                with next(seed_run.glob('env_params.json')).open('r') as f:
-                    env_kwargs = simplejson.load(f)
-                available_runs_kwargs[seed_run.name] = env_kwargs
-                # Read the trained model_path from ROM
-                model_path = next(seed_run.glob('model.zip'))
-                available_runs_agents[seed_run.name] = model_path
-
-    # We start by combining all SAME MODEL CLASSES per available Seed, across ALL available ENVIRONMENTS.
-    for model_name, model_cls in available_models.items():
-        for seed in range(max_seed):
-            combined_env_kwargs = dict()
-            model_paths = list()
-            comparable_runs = {key: val for key, val in available_runs_kwargs.items() if (
-                    key.startswith(str(seed)) and model_name in key and key != 'key')
-                               }
-            for name, run_kwargs in comparable_runs.items():
-                # Select trained agent as a candidate:
-                model_paths.append(available_runs_agents[name])
-                # Sort Env Kwars:
-                for key, val in run_kwargs.items():
-                    if key not in combined_env_kwargs:
-                        combined_env_kwargs.update(dict(key=val))
-                    else:
-                        assert combined_env_kwargs[key] == val, "Check the combinations you try to make!"
-
-            # Update and combine all kwargs to account for multiple agent etc.
-            # We cannot capture all configuration cases!
-            for key, val in factory_kwargs.items():
-                if key not in combined_env_kwargs:
-                    combined_env_kwargs[key] = val
-                else:
-                    assert combined_env_kwargs[key] == val
-            del combined_env_kwargs['key']
-            combined_env_kwargs.update(n_agents=len(comparable_runs))
-            with type("CombinedEnv", tuple(available_envs.values()), {})(**combined_env_kwargs) as combEnv:
-                # EnvMonitor Init
-                comb = f'comb_{model_name}_{seed}'
-                comb_monitor_path = combinations_path / comb / f'{comb}_monitor.pick'
-                comb_recorder_path = combinations_path / comb / f'{comb}_recorder.json'
-                comb_monitor_path.parent.mkdir(parents=True, exist_ok=True)
-
-                monitoredCombEnv = EnvMonitor(combEnv, filepath=comb_monitor_path)
-                monitoredCombEnv = EnvRecorder(monitoredCombEnv, filepath=comb_recorder_path, freq=1)
-
-                # Evaluation starts here #####################################################
-                # Load all models
-                loaded_models = [available_models[model_name].load(model_path) for model_path in model_paths]
-                obs_translators = ObservationTranslator(
-                    monitoredCombEnv.named_observation_space,
-                    *[agent.named_observation_space for agent in loaded_models],
-                    placeholder_fill_value='n')
-                act_translators = ActionTranslator(
-                    monitoredCombEnv.named_action_space,
-                    *(agent.named_action_space for agent in loaded_models)
-                )
-
-                for episode in range(1):
-                    obs = monitoredCombEnv.reset()
-                    if render: monitoredCombEnv.render()
-                    rew, done_bool = 0, False
-                    while not done_bool:
-                        actions = []
-                        for i, model in enumerate(loaded_models):
-                            pred = model.predict(obs_translators.translate_observation(i, obs[i]))[0]
-                            actions.append(act_translators.translate_action(i, pred))
-
-                        obs, step_r, done_bool, info_obj = monitoredCombEnv.step(actions)
-
-                        rew += step_r
-                        if render: monitoredCombEnv.render()
-                        if done_bool:
-                            break
-                    print(f'Factory run {episode} done, reward is:\n    {rew}')
-                # Eval monitor outputs are automatically stored by the monitor object
-                # TODO: Plotting
-                monitoredCombEnv.save_records()
-                monitoredCombEnv.save_run()
-            pass
--- a/_quickstart/single_agent_train_battery_target_env.py
+++ b/_quickstart/single_agent_train_battery_target_env.py
@ -1,203 +0,0 @@
-import sys
-import time
-
-from pathlib import Path
-import simplejson
-
-import stable_baselines3 as sb3
-
-# This is needed, when you put this file in a subfolder.
-try:
-    # noinspection PyUnboundLocalVariable
-    if __package__ is None:
-        DIR = Path(__file__).resolve().parent
-        sys.path.insert(0, str(DIR.parent))
-        __package__ = DIR.name
-    else:
-        DIR = None
-except NameError:
-    DIR = None
-    pass
-
-from environments import helpers as h
-from environments.factory.additional.dest.dest_util import DestModeOptions, DestProperties
-from environments.factory.additional.btry.btry_util import BatteryProperties
-from environments.logging.envmonitor import EnvMonitor
-from environments.logging.recorder import EnvRecorder
-from environments.factory.additional.combined_factories import DestBatteryFactory
-from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
-
-from plotting.compare_runs import compare_seed_runs
-
-"""
-Welcome to this quick start file. Here we will see how to:
-    0. Setup I/O Paths
-    1. Setup parameters for the environments (dirt-factory).
-    2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-        Run the training.
-    3. Save environment and agent for later analysis.
-    4. Load the agent from drive
-    5. Rendering the environment with a run of the trained agent.
-    6. Plot metrics 
-"""
-
-if __name__ == '__main__':
-    #########################################################
-    # 0. Setup I/O Paths
-    # Define some general parameters
-    train_steps = 1e6
-    n_seeds = 3
-    model_class = sb3.PPO
-    env_class = DestBatteryFactory
-
-    env_params_json = 'env_params.json'
-
-    # Define a global studi save path
-    start_time = int(time.time())
-    study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
-    # Create an _identifier, which is unique for every combination and easy to read in filesystem
-    identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
-    exp_path = study_root_path / identifier
-
-    #########################################################
-    # 1. Setup parameters for the environments (dirt-factory).
-
-
-    # Define property object parameters.
-    #  'ObservationProperties' are for specifying how the agent sees the environment.
-    obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT,  # Agents won`t be shown in the obs at all
-                                      omit_agent_self=True,                  # This is default
-                                      additional_agent_placeholder=None,     # We will not take care of future agent
-                                      frames_to_stack=3,                     # To give the agent a notion of time
-                                      pomdp_r=2                              # the agent view-radius
-                                      )
-    #  'MovementProperties' are for specifying how the agent is allowed to move in the environment.
-    move_props = MovementProperties(allow_diagonal_movement=True,   # Euclidean style (vertices)
-                                    allow_square_movement=True,     # Manhattan (edges)
-                                    allow_no_op=False)              # Pause movement (do nothing)
-
-    #  'DirtProperties' control if and how dirt is spawned
-    # TODO: Comments
-    dest_props = DestProperties(
-        n_dests              = 2,  # How many destinations are there
-        dwell_time           = 0,  # How long does the agent need to "wait" on a destination
-        spawn_frequency      = 0,
-        spawn_in_other_zone  = True,  #
-        spawn_mode           = DestModeOptions.DONE,
-    )
-    btry_props = BatteryProperties(
-        initial_charge          = 0.9,  #
-        charge_rate             = 0.4,  #
-        charge_locations        = 3,  #
-        per_action_costs        = 0.01,
-        done_when_discharged    = True,
-        multi_charge            = False,
-    )
-
-    #  These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
-    # TODO: Comments
-    factory_kwargs = dict(n_agents=1,
-                          max_steps=400,
-                          parse_doors=True,
-                          level_name='rooms',
-                          doors_have_area=True,  #
-                          verbose=False,
-                          mv_prop=move_props,    # See Above
-                          obs_prop=obs_props,    # See Above
-                          done_at_collision=True,
-                          dest_prop=dest_props,
-                          btry_prop=btry_props
-                          )
-
-    #########################################################
-    # 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-    agent_kwargs = dict()
-
-
-    #########################################################
-    # Run the Training
-    for seed in range(n_seeds):
-        # Make a copy if you want to alter things in the training loop; like the seed.
-        env_kwargs = factory_kwargs.copy()
-        env_kwargs.update(env_seed=seed)
-
-        # Output folder
-        seed_path = exp_path / f'{str(seed)}_{identifier}'
-        seed_path.mkdir(parents=True, exist_ok=True)
-
-        # Parameter Storage
-        param_path = seed_path / env_params_json
-        # Observation (measures) Storage
-        monitor_path = seed_path / 'monitor.pick'
-        recorder_path = seed_path / 'recorder.json'
-        # Model save Path for the trained model
-        model_save_path = seed_path / f'model.zip'
-
-        # Env Init & Model kwargs definition
-        with env_class(**env_kwargs) as env_factory:
-
-            # EnvMonitor Init
-            env_monitor_callback = EnvMonitor(env_factory)
-
-            # EnvRecorder Init
-            env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
-
-            # Model Init
-            model = model_class("MlpPolicy", env_factory, verbose=1, seed=seed, device='cpu')
-
-            # Model train
-            model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
-
-            #########################################################
-            # 3. Save environment and agent for later analysis.
-            #   Save the trained Model, the monitor (environment measures) and the environment parameters
-            model.named_observation_space = env_factory.named_observation_space
-            model.named_action_space = env_factory.named_action_space
-            model.save(model_save_path)
-            env_factory.save_params(param_path)
-            env_monitor_callback.save_run(monitor_path)
-            env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
-
-    # Compare performance runs, for each seed within a model
-    try:
-        compare_seed_runs(exp_path, use_tex=False)
-    except ValueError:
-        pass
-
-    # Train ends here ############################################################
-
-    # Evaluation starts here #####################################################
-    # First Iterate over every model and monitor "as trained"
-    print('Start Measurement Tracking')
-    # For trained policy in study_root_path / _identifier
-    for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
-
-        # retrieve model class
-        model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
-        # Load the agent agent
-        model = model_cls.load(policy_path / 'model.zip', device='cpu')
-        # Load old environment kwargs
-        with next(policy_path.glob(env_params_json)).open('r') as f:
-            env_kwargs = simplejson.load(f)
-            # Make the environment stop ar collisions
-            # (you only want to have a single collision per episode hence the statistics)
-            env_kwargs.update(done_at_collision=True)
-
-        # Init Env
-        with env_class(**env_kwargs) as env_factory:
-            monitored_env_factory = EnvMonitor(env_factory)
-
-            # Evaluation Loop for i in range(n Episodes)
-            for episode in range(100):
-                # noinspection PyRedeclaration
-                env_state = monitored_env_factory.reset()
-                rew, done_bool = 0, False
-                while not done_bool:
-                    action = model.predict(env_state, deterministic=True)[0]
-                    env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
-                    rew += step_r
-                    if done_bool:
-                        break
-                print(f'Factory run {episode} done, reward is:\n    {rew}')
-            monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
-    print('Measurements Done')
--- a/_quickstart/single_agent_train_dest_env.py
+++ b/_quickstart/single_agent_train_dest_env.py
@ -1,193 +0,0 @@
-import sys
-import time
-
-from pathlib import Path
-import simplejson
-
-import stable_baselines3 as sb3
-
-# This is needed, when you put this file in a subfolder.
-try:
-    # noinspection PyUnboundLocalVariable
-    if __package__ is None:
-        DIR = Path(__file__).resolve().parent
-        sys.path.insert(0, str(DIR.parent))
-        __package__ = DIR.name
-    else:
-        DIR = None
-except NameError:
-    DIR = None
-    pass
-
-from environments import helpers as h
-from environments.factory.additional.dest.dest_util import DestModeOptions, DestProperties
-from environments.logging.envmonitor import EnvMonitor
-from environments.logging.recorder import EnvRecorder
-from environments.factory.additional.dest.factory_dest import DestFactory
-from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
-
-from plotting.compare_runs import compare_seed_runs
-
-"""
-Welcome to this quick start file. Here we will see how to:
-    0. Setup I/O Paths
-    1. Setup parameters for the environments (dest-factory).
-    2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-        Run the training.
-    3. Save environment and agent for later analysis.
-    4. Load the agent from drive
-    5. Rendering the environment with a run of the trained agent.
-    6. Plot metrics 
-"""
-
-if __name__ == '__main__':
-    #########################################################
-    # 0. Setup I/O Paths
-    # Define some general parameters
-    train_steps = 1e6
-    n_seeds = 3
-    model_class = sb3.PPO
-    env_class = DestFactory
-
-    env_params_json = 'env_params.json'
-
-    # Define a global studi save path
-    start_time = int(time.time())
-    study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
-    # Create an _identifier, which is unique for every combination and easy to read in filesystem
-    identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
-    exp_path = study_root_path / identifier
-
-    #########################################################
-    # 1. Setup parameters for the environments (dest-factory).
-
-
-    # Define property object parameters.
-    #  'ObservationProperties' are for specifying how the agent sees the environment.
-    obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT,  # Agents won`t be shown in the obs at all
-                                      omit_agent_self=True,                  # This is default
-                                      additional_agent_placeholder=None,     # We will not take care of future agent
-                                      frames_to_stack=3,                     # To give the agent a notion of time
-                                      pomdp_r=2                              # the agent view-radius
-                                      )
-    #  'MovementProperties' are for specifying how the agent is allowed to move in the environment.
-    move_props = MovementProperties(allow_diagonal_movement=True,   # Euclidean style (vertices)
-                                    allow_square_movement=True,     # Manhattan (edges)
-                                    allow_no_op=False)              # Pause movement (do nothing)
-
-    #  'DestProperties' control if and how dest is spawned
-    # TODO: Comments
-    dest_props = DestProperties(
-        n_dests              = 2,  # How many destinations are there
-        dwell_time           = 0,  # How long does the agent need to "wait" on a destination
-        spawn_frequency      = 0,
-        spawn_in_other_zone  = True,  #
-        spawn_mode           = DestModeOptions.DONE,
-    )
-
-    #  These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
-    # TODO: Comments
-    factory_kwargs = dict(n_agents=1,
-                          max_steps=400,
-                          parse_doors=True,
-                          level_name='rooms',
-                          doors_have_area=True,  #
-                          verbose=False,
-                          mv_prop=move_props,    # See Above
-                          obs_prop=obs_props,    # See Above
-                          done_at_collision=True,
-                          dest_prop=dest_props
-                          )
-
-    #########################################################
-    # 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-    agent_kwargs = dict()
-
-
-    #########################################################
-    # Run the Training
-    for seed in range(n_seeds):
-        # Make a copy if you want to alter things in the training loop; like the seed.
-        env_kwargs = factory_kwargs.copy()
-        env_kwargs.update(env_seed=seed)
-
-        # Output folder
-        seed_path = exp_path / f'{str(seed)}_{identifier}'
-        seed_path.mkdir(parents=True, exist_ok=True)
-
-        # Parameter Storage
-        param_path = seed_path / env_params_json
-        # Observation (measures) Storage
-        monitor_path = seed_path / 'monitor.pick'
-        recorder_path = seed_path / 'recorder.json'
-        # Model save Path for the trained model
-        model_save_path = seed_path / f'model.zip'
-
-        # Env Init & Model kwargs definition
-        with env_class(**env_kwargs) as env_factory:
-
-            # EnvMonitor Init
-            env_monitor_callback = EnvMonitor(env_factory)
-
-            # EnvRecorder Init
-            env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
-
-            # Model Init
-            model = model_class("MlpPolicy", env_factory,verbose=1, seed=seed, device='cpu')
-
-            # Model train
-            model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
-
-            #########################################################
-            # 3. Save environment and agent for later analysis.
-            #   Save the trained Model, the monitor (environment measures) and the environment parameters
-            model.named_observation_space = env_factory.named_observation_space
-            model.named_action_space = env_factory.named_action_space
-            model.save(model_save_path)
-            env_factory.save_params(param_path)
-            env_monitor_callback.save_run(monitor_path)
-            env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
-
-    # Compare performance runs, for each seed within a model
-    try:
-        compare_seed_runs(exp_path, use_tex=False)
-    except ValueError:
-        pass
-
-    # Train ends here ############################################################
-
-    # Evaluation starts here #####################################################
-    # First Iterate over every model and monitor "as trained"
-    print('Start Measurement Tracking')
-    # For trained policy in study_root_path / _identifier
-    for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
-
-        # retrieve model class
-        model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
-        # Load the agent agent
-        model = model_cls.load(policy_path / 'model.zip', device='cpu')
-        # Load old environment kwargs
-        with next(policy_path.glob(env_params_json)).open('r') as f:
-            env_kwargs = simplejson.load(f)
-            # Make the environment stop ar collisions
-            # (you only want to have a single collision per episode hence the statistics)
-            env_kwargs.update(done_at_collision=True)
-
-        # Init Env
-        with env_class(**env_kwargs) as env_factory:
-            monitored_env_factory = EnvMonitor(env_factory)
-
-            # Evaluation Loop for i in range(n Episodes)
-            for episode in range(100):
-                # noinspection PyRedeclaration
-                env_state = monitored_env_factory.reset()
-                rew, done_bool = 0, False
-                while not done_bool:
-                    action = model.predict(env_state, deterministic=True)[0]
-                    env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
-                    rew += step_r
-                    if done_bool:
-                        break
-                print(f'Factory run {episode} done, reward is:\n    {rew}')
-            monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
-    print('Measurements Done')
--- a/_quickstart/single_agent_train_dirt_env.py
+++ b/_quickstart/single_agent_train_dirt_env.py
@ -1,195 +0,0 @@
-import sys
-import time
-
-from pathlib import Path
-import simplejson
-
-import stable_baselines3 as sb3
-
-# This is needed, when you put this file in a subfolder.
-try:
-    # noinspection PyUnboundLocalVariable
-    if __package__ is None:
-        DIR = Path(__file__).resolve().parent
-        sys.path.insert(0, str(DIR.parent))
-        __package__ = DIR.name
-    else:
-        DIR = None
-except NameError:
-    DIR = None
-    pass
-
-from environments import helpers as h
-from environments.logging.envmonitor import EnvMonitor
-from environments.logging.recorder import EnvRecorder
-from environments.factory.additional.dirt.dirt_util import DirtProperties
-from environments.factory.additional.dirt.factory_dirt import DirtFactory
-from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
-
-from plotting.compare_runs import compare_seed_runs
-
-"""
-Welcome to this quick start file. Here we will see how to:
-    0. Setup I/O Paths
-    1. Setup parameters for the environments (dirt-factory).
-    2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-        Run the training.
-    3. Save environment and agent for later analysis.
-    4. Load the agent from drive
-    5. Rendering the environment with a run of the trained agent.
-    6. Plot metrics 
-"""
-
-if __name__ == '__main__':
-    #########################################################
-    # 0. Setup I/O Paths
-    # Define some general parameters
-    train_steps = 1e6
-    n_seeds = 3
-    model_class = sb3.PPO
-    env_class = DirtFactory
-
-    env_params_json = 'env_params.json'
-
-    # Define a global studi save path
-    start_time = int(time.time())
-    study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
-    # Create an _identifier, which is unique for every combination and easy to read in filesystem
-    identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
-    exp_path = study_root_path / identifier
-
-    #########################################################
-    # 1. Setup parameters for the environments (dirt-factory).
-
-
-    # Define property object parameters.
-    #  'ObservationProperties' are for specifying how the agent sees the environment.
-    obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT,  # Agents won`t be shown in the obs at all
-                                      omit_agent_self=True,                  # This is default
-                                      additional_agent_placeholder=None,     # We will not take care of future agent
-                                      frames_to_stack=3,                     # To give the agent a notion of time
-                                      pomdp_r=2                              # the agent' view-radius
-                                      )
-    #  'MovementProperties' are for specifying how the agent is allowed to move in the environment.
-    move_props = MovementProperties(allow_diagonal_movement=True,   # Euclidean style (vertices)
-                                    allow_square_movement=True,     # Manhattan (edges)
-                                    allow_no_op=False)              # Pause movement (do nothing)
-
-    #  'DirtProperties' control if and how dirt is spawned
-    # TODO: Comments
-    dirt_props = DirtProperties(initial_dirt_ratio=0.35,
-                                initial_dirt_spawn_r_var=0.1,
-                                clean_amount=0.34,
-                                max_spawn_amount=0.1,
-                                max_global_amount=20,
-                                max_local_amount=1,
-                                spawn_frequency=0,
-                                max_spawn_ratio=0.05,
-                                dirt_smear_amount=0.0)
-
-    #  These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
-    # TODO: Comments
-    factory_kwargs = dict(n_agents=1,
-                          max_steps=400,
-                          parse_doors=True,
-                          level_name='rooms',
-                          doors_have_area=True,  #
-                          verbose=False,
-                          mv_prop=move_props,    # See Above
-                          obs_prop=obs_props,    # See Above
-                          done_at_collision=True,
-                          dirt_prop=dirt_props
-                          )
-
-    #########################################################
-    # 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-    agent_kwargs = dict()
-
-
-    #########################################################
-    # Run the Training
-    for seed in range(n_seeds):
-        # Make a copy if you want to alter things in the training loop; like the seed.
-        env_kwargs = factory_kwargs.copy()
-        env_kwargs.update(env_seed=seed)
-
-        # Output folder
-        seed_path = exp_path / f'{str(seed)}_{identifier}'
-        seed_path.mkdir(parents=True, exist_ok=True)
-
-        # Parameter Storage
-        param_path = seed_path / env_params_json
-        # Observation (measures) Storage
-        monitor_path = seed_path / 'monitor.pick'
-        recorder_path = seed_path / 'recorder.json'
-        # Model save Path for the trained model
-        model_save_path = seed_path / f'model.zip'
-
-        # Env Init & Model kwargs definition
-        with env_class(**env_kwargs) as env_factory:
-
-            # EnvMonitor Init
-            env_monitor_callback = EnvMonitor(env_factory)
-
-            # EnvRecorder Init
-            env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
-
-            # Model Init
-            model = model_class("MlpPolicy", env_factory, verbose=1, seed=seed, device='cpu')
-
-            # Model train
-            model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
-
-            #########################################################
-            # 3. Save environment and agent for later analysis.
-            #   Save the trained Model, the monitor (environment measures) and the environment parameters
-            model.named_observation_space = env_factory.named_observation_space
-            model.named_action_space = env_factory.named_action_space
-            model.save(model_save_path)
-            env_factory.save_params(param_path)
-            env_monitor_callback.save_run(monitor_path)
-            env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
-
-    # Compare performance runs, for each seed within a model
-    try:
-        compare_seed_runs(exp_path, use_tex=False)
-    except ValueError:
-        pass
-
-    # Train ends here ############################################################
-
-    # Evaluation starts here #####################################################
-    # First Iterate over every model and monitor "as trained"
-    print('Start Measurement Tracking')
-    # For trained policy in study_root_path / _identifier
-    for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
-
-        # retrieve model class
-        model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
-        # Load the agent
-        model = model_cls.load(policy_path / 'model.zip', device='cpu')
-        # Load old environment kwargs
-        with next(policy_path.glob(env_params_json)).open('r') as f:
-            env_kwargs = simplejson.load(f)
-            # Make the environment stop ar collisions
-            # (you only want to have a single collision per episode hence the statistics)
-            env_kwargs.update(done_at_collision=True)
-
-        # Init Env
-        with env_class(**env_kwargs) as env_factory:
-            monitored_env_factory = EnvMonitor(env_factory)
-
-            # Evaluation Loop for i in range(n Episodes)
-            for episode in range(100):
-                # noinspection PyRedeclaration
-                env_state = monitored_env_factory.reset()
-                rew, done_bool = 0, False
-                while not done_bool:
-                    action = model.predict(env_state, deterministic=True)[0]
-                    env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
-                    rew += step_r
-                    if done_bool:
-                        break
-                print(f'Factory run {episode} done, reward is:\n    {rew}')
-            monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
-    print('Measurements Done')
--- a/_quickstart/single_agent_train_item_env.py
+++ b/_quickstart/single_agent_train_item_env.py
@ -1,191 +0,0 @@
-import sys
-import time
-
-from pathlib import Path
-import simplejson
-
-import stable_baselines3 as sb3
-
-# This is needed, when you put this file in a subfolder.
-try:
-    # noinspection PyUnboundLocalVariable
-    if __package__ is None:
-        DIR = Path(__file__).resolve().parent
-        sys.path.insert(0, str(DIR.parent))
-        __package__ = DIR.name
-    else:
-        DIR = None
-except NameError:
-    DIR = None
-    pass
-
-from environments import helpers as h
-from environments.factory.additional.item.factory_item import ItemFactory
-from environments.factory.additional.item.item_util import ItemProperties
-from environments.logging.envmonitor import EnvMonitor
-from environments.logging.recorder import EnvRecorder
-from environments.utility_classes import MovementProperties, ObservationProperties, AgentRenderOptions
-
-from plotting.compare_runs import compare_seed_runs
-
-"""
-Welcome to this quick start file. Here we will see how to:
-    0. Setup I/O Paths
-    1. Setup parameters for the environments (item-factory).
-    2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-        Run the training.
-    3. Save environment and agent for later analysis.
-    4. Load the agent from drive
-    5. Rendering the environment with a run of the trained agent.
-    6. Plot metrics 
-"""
-
-if __name__ == '__main__':
-    #########################################################
-    # 0. Setup I/O Paths
-    # Define some general parameters
-    train_steps = 1e6
-    n_seeds = 3
-    model_class = sb3.PPO
-    env_class = ItemFactory
-
-    env_params_json = 'env_params.json'
-
-    # Define a global studi save path
-    start_time = int(time.time())
-    study_root_path = Path(__file__).parent.parent / 'study_out' / f'{Path(__file__).stem}_{start_time}'
-    # Create an _identifier, which is unique for every combination and easy to read in filesystem
-    identifier = f'{model_class.__name__}_{env_class.__name__}_{start_time}'
-    exp_path = study_root_path / identifier
-
-    #########################################################
-    # 1. Setup parameters for the environments (item-factory).
-    #
-    # Define property object parameters.
-    #  'ObservationProperties' are for specifying how the agent sees the environment.
-    obs_props = ObservationProperties(render_agents=AgentRenderOptions.NOT,  # Agents won`t be shown in the obs at all
-                                      omit_agent_self=True,                  # This is default
-                                      additional_agent_placeholder=None,     # We will not take care of future agent
-                                      frames_to_stack=3,                     # To give the agent a notion of time
-                                      pomdp_r=2                              # the agent view-radius
-                                      )
-    #  'MovementProperties' are for specifying how the agent is allowed to move in the environment.
-    move_props = MovementProperties(allow_diagonal_movement=True,   # Euclidean style (vertices)
-                                    allow_square_movement=True,     # Manhattan (edges)
-                                    allow_no_op=False)              # Pause movement (do nothing)
-
-    # 'ItemProperties' control if and how item is spawned
-    # TODO: Comments
-    item_props = ItemProperties(
-        n_items                      = 7,     # How many items are there at the same time
-        spawn_frequency              = 50,     # Spawn Frequency in Steps
-        n_drop_off_locations         = 10,     # How many DropOff locations are there at the same time
-        max_dropoff_storage_size     = 0,     # How many items are needed until the dropoff is full
-        max_agent_inventory_capacity = 5,     # How many items are needed until the agent inventory is full)
-        )
-
-    #  These are the EnvKwargs for initializing the environment class, holding all former parameter-classes
-    # TODO: Comments
-    factory_kwargs = dict(n_agents=1,
-                          max_steps=400,
-                          parse_doors=True,
-                          level_name='rooms',
-                          doors_have_area=True,  #
-                          verbose=False,
-                          mv_prop=move_props,    # See Above
-                          obs_prop=obs_props,    # See Above
-                          done_at_collision=True,
-                          item_prop=item_props
-                          )
-
-    #########################################################
-    # 2. Setup parameters for the agent training (SB3: PPO) and save metrics.
-    agent_kwargs = dict()
-
-    #########################################################
-    # Run the Training
-    for seed in range(n_seeds):
-        # Make a copy if you want to alter things in the training loop; like the seed.
-        env_kwargs = factory_kwargs.copy()
-        env_kwargs.update(env_seed=seed)
-
-        # Output folder
-        seed_path = exp_path / f'{str(seed)}_{identifier}'
-        seed_path.mkdir(parents=True, exist_ok=True)
-
-        # Parameter Storage
-        param_path = seed_path / env_params_json
-        # Observation (measures) Storage
-        monitor_path = seed_path / 'monitor.pick'
-        recorder_path = seed_path / 'recorder.json'
-        # Model save Path for the trained model
-        model_save_path = seed_path / f'model.zip'
-
-        # Env Init & Model kwargs definition
-        with ItemFactory(**env_kwargs) as env_factory:
-
-            # EnvMonitor Init
-            env_monitor_callback = EnvMonitor(env_factory)
-
-            # EnvRecorder Init
-            env_recorder_callback = EnvRecorder(env_factory, freq=int(train_steps / 400 / 10))
-
-            # Model Init
-            model = model_class("MlpPolicy", env_factory,verbose=1, seed=seed, device='cpu')
-
-            # Model train
-            model.learn(total_timesteps=int(train_steps), callback=[env_monitor_callback, env_recorder_callback])
-
-            #########################################################
-            # 3. Save environment and agent for later analysis.
-            #   Save the trained Model, the monitor (environment measures) and the environment parameters
-            model.named_observation_space = env_factory.named_observation_space
-            model.named_action_space = env_factory.named_action_space
-            model.save(model_save_path)
-            env_factory.save_params(param_path)
-            env_monitor_callback.save_run(monitor_path)
-            env_recorder_callback.save_records(recorder_path, save_occupation_map=False)
-
-    # Compare performance runs, for each seed within a model
-    try:
-        compare_seed_runs(exp_path, use_tex=False)
-    except ValueError:
-        pass
-
-    # Train ends here ############################################################
-
-    # Evaluation starts here #####################################################
-    # First Iterate over every model and monitor "as trained"
-    print('Start Measurement Tracking')
-    # For trained policy in study_root_path / _identifier
-    for policy_path in [x for x in exp_path.iterdir() if x.is_dir()]:
-
-        # retrieve model class
-        model_cls = next(val for key, val in h.MODEL_MAP.items() if key in policy_path.parent.name)
-        # Load the agent agent
-        model = model_cls.load(policy_path / 'model.zip', device='cpu')
-        # Load old environment kwargs
-        with next(policy_path.glob(env_params_json)).open('r') as f:
-            env_kwargs = simplejson.load(f)
-            # Make the environment stop ar collisions
-            # (you only want to have a single collision per episode hence the statistics)
-            env_kwargs.update(done_at_collision=True)
-
-        # Init Env
-        with ItemFactory(**env_kwargs) as env_factory:
-            monitored_env_factory = EnvMonitor(env_factory)
-
-            # Evaluation Loop for i in range(n Episodes)
-            for episode in range(100):
-                # noinspection PyRedeclaration
-                env_state = monitored_env_factory.reset()
-                rew, done_bool = 0, False
-                while not done_bool:
-                    action = model.predict(env_state, deterministic=True)[0]
-                    env_state, step_r, done_bool, info_obj = monitored_env_factory.step(action)
-                    rew += step_r
-                    if done_bool:
-                        break
-                print(f'Factory run {episode} done, reward is:\n    {rew}')
-            monitored_env_factory.save_run(filepath=policy_path / 'eval_run_monitor.pick')
-    print('Measurements Done')
--- a/docs/Makefile
+++ b/docs/Makefile
@ -1,25 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = source
-BUILDDIR      = build
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile
-
-buildapi:
-  sphinx-apidoc.exe -fEM -T -t _templates -o source/source ../marl_factory_grid "../**/marl", "../**/proto"
-  @echo "Auto-generation of 'SOURCEAPI' documentation finished. " \
-          "The generated files were placed in 'source/'"
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/docs/make.bat
+++ b/docs/make.bat
@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=source
-set BUILDDIR=build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -1,4 +0,0 @@
-myst_parser
-sphinx-pdj-theme
-sphinx-mdinclude
-sphinx-book-theme
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -1,72 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# For the full list of built-in configuration values, see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Project information -----------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
-
-project = 'rl-factory-grid'
-copyright = '2023, Steffen Illium, Robert Mueller, Joel Friedrich'
-author = 'Steffen Illium, Robert Mueller, Joel Friedrich'
-release = '2.5.0'
-
-# -- General configuration ---------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
-
-extensions = [#'myst_parser',
-                  'sphinx.ext.todo',
-                  'sphinx.ext.autodoc',
-                  'sphinx.ext.intersphinx',
-                  # 'sphinx.ext.autosummary',
-                  'sphinx.ext.linkcode',
-                  'sphinx_mdinclude',
-              ]
-
-templates_path = ['_templates']
-exclude_patterns = ['marl_factory_grid.utils.proto', 'marl_factory_grid.utils.proto.fiksProto_pb2*']
-
-
-autoclass_content = 'both'
-autodoc_class_signature = 'separated'
-autodoc_typehints = 'description'
-autodoc_inherit_docstrings = True
-autodoc_typehints_format = 'short'
-autodoc_default_options = {
-    'members': True,
-    # 'member-order': 'bysource',
-    'special-members': '__init__',
-    'undoc-members': True,
-    # 'exclude-members': '__weakref__',
-    'show-inheritance': True,
-}
-autosummary_generate = True
-add_module_names = False
-toc_object_entries = False
-modindex_common_prefix = ['marl_factory_grid.']
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here.
-from pathlib import Path
-import sys
-sys.path.insert(0, (Path(__file__).parents[2]).resolve().as_posix())
-sys.path.insert(0, (Path(__file__).parents[2] / 'marl_factory_grid').resolve().as_posix())
-
-# -- Options for HTML output -------------------------------------------------
-# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
-html_theme = "sphinx_book_theme"  # 'alabaster'
-# html_static_path = ['_static']
-
-# In your configuration, you need to specify a linkcode_resolve function that returns an URL based on the object.
-# https://www.sphinx-doc.org/en/master/usage/extensions/linkcode.html
-
-
-def linkcode_resolve(domain, info):
-    if domain in ['py', '__init__.py']:
-        return None
-    if not info['module']:
-        return None
-    filename = info['module'].replace('.', '/')
-    return "https://github.com/illiumst/marl-factory-grid/%s.py" % filename
-
-print(sys.executable)
--- a/docs/source/creating
+++ b/docs/source/creating
@ -1,99 +0,0 @@
-Creating a New Scenario
-=======================
-
-
-Creating a new scenario in the `marl-factory-grid` environment allows you to customize the environment to fit your specific requirements. This guide provides step-by-step instructions on how to create a new scenario, including defining a configuration file, designing a level, and potentially adding new entities, rules, and assets. See the "modifications.rst" file for more information on how to modify existing entities, levels, rules, groups and assets.
-
-Step 1: Define Configuration File
-----------------
-
-1. **Create a Configuration File:** Start by creating a new configuration file (`.yaml`) for your scenario. This file will contain settings such as the number of agents, environment dimensions, and other parameters. You can use existing configuration files as templates.
-
-2. **Specify Custom Parameters:** Modify the configuration file to include any custom parameters specific to your scenario. For example, you can set the respawn rate of entities or define specific rewards.
-
-Step 2: Design the Level
-----------------
-
-1. **Create a Level File:** Design the layout of your environment by creating a new level file (`.txt`). Use symbols such as `#` for walls, `-` for walkable floors, and introduce new symbols for custom entities.
-
-2. **Define Entity Locations:** Specify the initial locations of entities, including agents and any new entities introduced in your scenario. These spawn locations are typically provided in the conf file.
-
-Step 3: Introduce New Entities
-----------------
-
-1. **Create New Entity Modules:** If your scenario involves introducing new entities, create new entity modules in the `marl_factory_grid/environment/entity` directory. Define their behavior, properties, and any custom actions they can perform. Check out the template module.
-
-2. **Update Configuration:** Update the configuration file to include settings related to your new entities, such as spawn rates, initial quantities, or any specific behaviors.
-
-Step 4: Implement Custom Rules
-----------------
-
-1. **Create Rule Modules:** If your scenario requires custom rules, create new rule modules in the `marl_factory_grid/environment/rules` directory. Implement the necessary logic to govern the behavior of entities in your scenario and use the provided environment hooks.
-
-2. **Update Configuration:** If your custom rules have configurable parameters, update the configuration file to include these settings and activate the rule by adding it to the conf file.
-
-Step 5: Add Custom Assets (Optional)
-----------------
-
-1. **Include Custom Asset Files:** If your scenario introduces new assets (e.g., images for entities), include the necessary asset files in the appropriate directories, such as `marl_factory_grid/environment/assets`.
-
-Step 6: Test and Experiment
-----------------
-
-1. **Run Your Scenario:** Use the provided scripts or write your own script to run the scenario with your customized configuration. Observe the behavior of agents and entities in the environment.
-
-2. **Iterate and Experiment:** Adjust configuration parameters, level design, or introduce new elements based on your observations. Iterate through this process until your scenario meets your desired specifications.
-
-
-Congratulations! You have successfully created a new scenario in the `marl-factory-grid` environment. Experiment with different configurations, levels, entities, and rules to design unique and engaging environments for your simulations. Below you find an example of how to create a new scenario.
-
-
-
-
-
-New Example Scenario: Apple Resource Dilemma
-----------------
-
-To provide you with an example, we'll guide you through creating the "Apple Resource Dilemma" scenario using the steps outlined in the tutorial.
-In this example scenario, agents face a dilemma of collecting apples. The apples only spawn if there are already enough in the environment. If agents collect them at the beginning, they won't respawn as quickly as if they wait for more to spawn before collecting.
-
-**Step 1: Define Configuration File**
-
-
-1. **Create a Configuration File:** Start by creating a new configuration file, e.g., `apple_dilemma_config.yaml`. Use the default config file as a good starting point.
-
-2. **Specify Custom Parameters:** Add custom parameters to control the behavior of your scenario. Also delete unused entities, actions and observations from the default config file such as dirt piles.
-
-**Step 2: Design the Level**
-
-1.  Create a Level File: Design the layout of your environment by creating a new level file, e.g., apple_dilemma_level.txt.
-    Of course you can also just use or modify an existing level.
-
-2. Define Entity Locations: Specify the initial locations of entities, including doors (D). Since the apples will likely be spawning randomly, it would not make sense to encode their spawn in the level file.
-
-**Step 3: Introduce New Entities**
-
-1. Create New Entity Modules: Create a new entity module for the apple in the `marl_factory_grid/environment/entity` directory. Use the module template or existing modules as inspiration. Instead of creating a new agent, the item agent can be used as he is already configured to collect all items and drop them off at designated locations.
-
-2. Update Configuration: Update the configuration file to include settings related to your new entities. Agents need to be able to interact and observe them.
-
-**Step 4: Implement Custom Rules**
-
-1. Create Rule Modules: You might want to create new rule modules. For example, apple_respawn_rule.py could be inspired from the dirt respawn rule:
-
->>> from marl_factory_grid.environment.rules.rule import Rule
-    class AppleRespawnRule(Rule):
-        def __init__(self, apple_spawn_rate=0.1):
-            super().__init__()
-            self.apple_spawn_rate = apple_spawn_rate
-        def tick_post_step(self, state):
-            # Logic to respawn apples based on spawn rate
-            pass
-
-2. Update Configuration: Update the configuration file to include the new rule.
-
-**Step 5: Add Custom Assets (Optional)**
-
-1. Include Custom Asset Files: If your scenario introduces new assets (e.g., images for entities), include the necessary files in the appropriate directories, such as `marl_factory_grid/environment/assets`.
-
-**Step 6: Test and Experiment**
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -1,23 +0,0 @@
-.. toctree::
-   :maxdepth: 1
-   :caption: Table of Contents
-   :titlesonly:
-
-   installation
-   usage
-   modifications
-   creating a new scenario
-   testing
-   source
-
-.. note::
-   This project is under active development.
-
-.. mdinclude:: ../../README.md
-
-Indices and tables
------------------
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
--- a/docs/source/installation.rst
+++ b/docs/source/installation.rst
@ -1,22 +0,0 @@
-Installation
-============
-
-
-
-How to install the environment
------------------------------
-
-To use `marl-factory-grid`, first install it using pip:
-
-.. code-block:: console
-
-   (.venv) $ pip install marl-factory-grid
-
-
-Indices and tables
------------------
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
-
--- a/docs/source/modifications.rst
+++ b/docs/source/modifications.rst
@ -1,92 +0,0 @@
-Custom Modifications
-====================
-
-This section covers main aspects of working with the environment.
-
-Modifying levels
----------------
-Varying levels are created by defining Walls, Floor or Doors in *.txt*-files (see `levels`_ for examples).
-Define which *level* to use in your *config file* as:
-
-.. _levels: marl_factory_grid/levels
-
->>> General:
-    level_name: rooms  # 'simple', 'narrow_corridor', 'eight_puzzle',...
-
-... or create your own. Maybe with the help of `asciiflow.com <https://asciiflow.com/#/>`_.
-Make sure to use `#` as `Walls`_ , `-` as free (walkable) floor and `D` for `Doors`_.
-Other Entities (define your own) may bring their own `Symbols`.
-
-.. _Walls: marl_factory_grid/environment/entity/wall.py
-.. _Doors: modules/doors/entities.py
-
-
-Modifying Entites
-----------------
-Entities are `Objects`_ that can additionally be assigned a position.
-Abstract Entities are provided.
-
-If you wish to introduce new entities to the environment just create a new module that implements the entity class. If
-necessary, provide additional classe such as custom actions or rewards and load the entity into the environment using
-the config file.
-
-.. _Objects: marl_factory_grid/environment/entity/object.py
-
-Modifying Groups
----------------
-`Groups`_ are entity Sets that provide administrative access to all group members.
-All `Entity Collections`_ are available at runtime as a property of the env state.
-If you add an entity, you probably also want a collection of that entity.
-
-.. _Groups: marl_factory_grid/environment/groups/objects.py
-.. _Entity Collections: marl_factory_grid/environment/entity/global_entities.py
-
-Modifying Rules
---------------
-`Rules <https://marl-factory-grid.readthedocs.io/en/latest/code/marl_factory_grid.environment.rules.html>`_ define how
-the environment behaves on micro scale. Each of the hooks (`on_init`, `pre_step`, `on_step`, '`post_step`', `on_done`)
-provide env-access to implement custom logic, calculate rewards, or gather information.
-
-If you wish to introduce new rules to the environment make sure it implements the Rule class and override its' hooks
-to implement your own rule logic.
-
-
-.. image:: ../../images/Hooks_FIKS.png
-   :alt: Hooks Image
-
-
-Modifying Constants and Rewards
-------------------------------
-
-Customizing rewards and constants allows you to tailor the environment to specific requirements.
-You can set custom rewards in the configuration file. If no specific rewards are defined, the environment
-will utilize default rewards, which are provided in the constants file of each module.
-
-In addition to rewards, you can also customize other constants used in the environment's rules or actions. Each module has
-its dedicated constants file, while global constants are centrally located in the environment's constants file.
-Be careful when making changes to constants, as they can radically impact the behavior of the environment. Only modify
-constants if you have a solid understanding of their implications and are confident in the adjustments you're making.
-
-
-Modifying Results
-----------------
-`Results <https://marl-factory-grid.readthedocs.io/en/latest/code/marl_factory_grid.utils.results.html>`_
-provide a way to return `rule` evaluations such as rewards and state reports back to the environment.
-
-
-Modifying Assets
----------------
-Make sure to bring your own assets for each Entity living in the Gridworld as the `Renderer` relies on it.
-PNG-files (transparent background) of square aspect-ratio should do the job, in general.
-
-.. image:: ../../marl_factory_grid/environment/assets/wall.png
-   :alt: Wall Image
-.. image:: ../../marl_factory_grid/environment/assets/agent/agent.png
-   :alt: Agent Image
-
-Indices and tables
------------------
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
--- a/docs/source/source.rst
+++ b/docs/source/source.rst
@ -1,17 +0,0 @@
-Source
-======
-
-.. toctree::
-   :maxdepth: 2
-   :glob:
-   :caption: Table of Contents
-   :titlesonly:
-
-   source/*
-
-Indices and tables
------------------
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
--- a/docs/source/source/marl_factory_grid.environment.entity.rst
+++ b/docs/source/source/marl_factory_grid.environment.entity.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.environment.entity package
-==============================================
-
-.. automodule:: marl_factory_grid.environment.entity
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.environment.entity.agent
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.entity.entity
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.entity.object
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.entity.util
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.entity.wall
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.environment.groups.rst
+++ b/docs/source/source/marl_factory_grid.environment.groups.rst
@ -1,52 +0,0 @@
-marl\_factory\_grid.environment.groups package
-==============================================
-
-.. automodule:: marl_factory_grid.environment.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.environment.groups.agents
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.collection
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.global_entities
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.mixins
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.objects
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.utils
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.groups.walls
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.environment.rst
+++ b/docs/source/source/marl_factory_grid.environment.rst
@ -1,49 +0,0 @@
-marl\_factory\_grid.environment package
-=======================================
-
-.. automodule:: marl_factory_grid.environment
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Subpackages
-----------
-
-.. toctree::
-   :maxdepth: 4
-
-   marl_factory_grid.environment.entity
-   marl_factory_grid.environment.groups
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.environment.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.factory
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.rewards
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.environment.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.levels.rst
+++ b/docs/source/source/marl_factory_grid.levels.rst
@ -1,7 +0,0 @@
-marl\_factory\_grid.levels package
-==================================
-
-.. automodule:: marl_factory_grid.levels
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.batteries.rst
+++ b/docs/source/source/marl_factory_grid.modules.batteries.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.batteries package
-=============================================
-
-.. automodule:: marl_factory_grid.modules.batteries
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.batteries.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.batteries.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.batteries.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.batteries.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.batteries.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.clean_up.rst
+++ b/docs/source/source/marl_factory_grid.modules.clean_up.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.clean\_up package
-=============================================
-
-.. automodule:: marl_factory_grid.modules.clean_up
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.clean_up.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.clean_up.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.clean_up.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.clean_up.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.clean_up.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.destinations.rst
+++ b/docs/source/source/marl_factory_grid.modules.destinations.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.destinations package
-================================================
-
-.. automodule:: marl_factory_grid.modules.destinations
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.destinations.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.destinations.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.destinations.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.destinations.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.destinations.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.doors.rst
+++ b/docs/source/source/marl_factory_grid.modules.doors.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.doors package
-=========================================
-
-.. automodule:: marl_factory_grid.modules.doors
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.doors.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.doors.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.doors.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.doors.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.doors.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.items.rst
+++ b/docs/source/source/marl_factory_grid.modules.items.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.items package
-=========================================
-
-.. automodule:: marl_factory_grid.modules.items
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.items.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.items.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.items.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.items.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.items.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.machines.rst
+++ b/docs/source/source/marl_factory_grid.modules.machines.rst
@ -1,40 +0,0 @@
-marl\_factory\_grid.modules.machines package
-============================================
-
-.. automodule:: marl_factory_grid.modules.machines
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.machines.actions
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.machines.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.machines.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.machines.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.machines.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.maintenance.rst
+++ b/docs/source/source/marl_factory_grid.modules.maintenance.rst
@ -1,34 +0,0 @@
-marl\_factory\_grid.modules.maintenance package
-===============================================
-
-.. automodule:: marl_factory_grid.modules.maintenance
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.maintenance.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.maintenance.entities
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.maintenance.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.maintenance.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.modules.rst
+++ b/docs/source/source/marl_factory_grid.modules.rst
@ -1,22 +0,0 @@
-marl\_factory\_grid.modules package
-===================================
-
-.. automodule:: marl_factory_grid.modules
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Subpackages
-----------
-
-.. toctree::
-   :maxdepth: 4
-
-   marl_factory_grid.modules.batteries
-   marl_factory_grid.modules.clean_up
-   marl_factory_grid.modules.destinations
-   marl_factory_grid.modules.doors
-   marl_factory_grid.modules.items
-   marl_factory_grid.modules.machines
-   marl_factory_grid.modules.maintenance
-   marl_factory_grid.modules.zones
--- a/docs/source/source/marl_factory_grid.modules.zones.rst
+++ b/docs/source/source/marl_factory_grid.modules.zones.rst
@ -1,34 +0,0 @@
-marl\_factory\_grid.modules.zones package
-=========================================
-
-.. automodule:: marl_factory_grid.modules.zones
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.modules.zones.constants
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.zones.entitites
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.zones.groups
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.modules.zones.rules
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.rst
+++ b/docs/source/source/marl_factory_grid.rst
@ -1,28 +0,0 @@
-marl\_factory\_grid package
-===========================
-
-.. automodule:: marl_factory_grid
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Subpackages
-----------
-
-.. toctree::
-   :maxdepth: 4
-
-   marl_factory_grid.algorithms
-   marl_factory_grid.environment
-   marl_factory_grid.levels
-   marl_factory_grid.modules
-   marl_factory_grid.utils
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.quickstart
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.utils.logging.rst
+++ b/docs/source/source/marl_factory_grid.utils.logging.rst
@ -1,22 +0,0 @@
-marl\_factory\_grid.utils.logging package
-=========================================
-
-.. automodule:: marl_factory_grid.utils.logging
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.utils.logging.envmonitor
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.logging.recorder
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.utils.plotting.rst
+++ b/docs/source/source/marl_factory_grid.utils.plotting.rst
@ -1,28 +0,0 @@
-marl\_factory\_grid.utils.plotting package
-==========================================
-
-.. automodule:: marl_factory_grid.utils.plotting
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.utils.plotting.plot_compare_runs
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.plotting.plot_single_runs
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.plotting.plotting_utils
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/source/marl_factory_grid.utils.rst
+++ b/docs/source/source/marl_factory_grid.utils.rst
@ -1,79 +0,0 @@
-marl\_factory\_grid.utils package
-=================================
-
-.. automodule:: marl_factory_grid.utils
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-Subpackages
-----------
-
-.. toctree::
-   :maxdepth: 4
-
-   marl_factory_grid.utils.logging
-   marl_factory_grid.utils.plotting
-
-Submodules
----------
-
-
-.. automodule:: marl_factory_grid.utils.config_parser
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.helpers
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.level_parser
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.observation_builder
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.ray_caster
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.renderer
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.results
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.states
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.tools
-   :members:
-   :undoc-members:
-   :show-inheritance:
-
-
-.. automodule:: marl_factory_grid.utils.utility_classes
-   :members:
-   :undoc-members:
-   :show-inheritance:
--- a/docs/source/testing.rst
+++ b/docs/source/testing.rst
@ -1,15 +0,0 @@
-Testing
-=======
-In EDYS, tests are seamlessly integrated through environment hooks, mirroring the organization of rules, as explained in the README.md file.
-
-Running tests
-------------
-To include specific tests in your run, simply append them to the "tests" section within the configuration file.
-If the test requires a specific entity in the environment (i.e the clean up test requires a TSPDirtAgent that can observe
-and clean dirt in its environment), make sure to include it in the config file.
-
-Writing tests
------------
-If you intend to create additional tests, refer to the tests.py file for examples.
-Ensure that any new tests implement the corresponding test class and make use of its hooks.
-There are no additional steps required, except for the inclusion of your custom tests in the config file.
--- a/docs/source/usage.rst
+++ b/docs/source/usage.rst
@ -1,75 +0,0 @@
-Basic Usage
-===========
-
-Environment objects, including agents, entities and rules, that are specified in a *yaml*-configfile will be loaded automatically.
-Using ``quickstart_use`` creates a default config-file and another one that lists all possible options of the environment.
-Also, it generates an initial script where an agent is executed in the environment specified by the config-file.
-
-After initializing the environment using the specified configuration file, the script enters a reinforcement learning loop.
-The loop consists of episodes, where each episode involves resetting the environment, executing actions, and receiving feedback.
-
-Here's a breakdown of the key components in the provided script. Feel free to customize it based on your specific requirements:
-
-1. **Initialization:**
-
->>> path = Path('marl_factory_grid/configs/default_config.yaml')
-    factory = Factory(path)
-    factory = EnvMonitor(factory)
-    factory = EnvRecorder(factory)
-
-    - The `path` variable points to the location of your configuration file. Ensure it corresponds to the correct path.
-    - `Factory` initializes the environment based on the provided configuration.
-    - `EnvMonitor` and `EnvRecorder` are optional components. They add monitoring and recording functionalities to the environment, respectively.
-
-2. **Reinforcement Learning Loop:**
-
->>> for episode in trange(10):
-        _ = factory.reset()
-        done = False
-        if render:
-            factory.render()
-        action_spaces = factory.action_space
-        agents = []
-
-    - The loop iterates over a specified number of episodes (in this case, 10).
-    - `factory.reset()` resets the environment for a new episode.
-    - `factory.render()` is used for visualization if rendering is enabled.
-    - `action_spaces` stores the action spaces available for the agents.
-    - `agents` will store agent-specific information during the episode.
-
-3. **Taking Actions:**
-
->>> while not done:
-        a = [randint(0, x.n - 1) for x in action_spaces]
-        obs_type, _, reward, done, info = factory.step(a)
-        if render:
-            factory.render()
-
-    - Within each episode, the loop continues until the environment signals completion (`done`).
-    - `a` represents a list of random actions for each agent based on their action space.
-    - `factory.step(a)` executes the actions, returning observation types, rewards, completion status, and additional information.
-
-4. **Handling Episode Completion:**
-
->>> if done:
-        print(f'Episode {episode} done...')
-
-    - After each episode, a message is printed indicating its completion.
-
-
-Evaluating the run
------------------
-
-If monitoring and recording are enabled, the environment states will be traced and recorded automatically.
-The EnvMonitor class acts as a wrapper for Gym environments, monitoring and logging key information during interactions,
-while the EnvRecorder class records state summaries during interactions in the environment.
-At the end of each run a plot displaying the step reward is generated. The step reward represents the cumulative sum of rewards obtained by all agents throughout the episode.
-Furthermore a comparative plot that shows the achieved score (step reward) over several runs with different seeds or different parameter settings can be generated using the methods provided in plotting/plot_compare_runs.py.
-For a more comprehensive evaluation, we recommend using the `Weights and Biases (W&B) <https://wandb.ai/site>`_ framework, with the dataframes generated by the monitor and recorder. These can be found in the run path specified in your script. W&B provides a powerful API for logging and visualizing model training metrics, enabling analysis using predefined or also custom metrics.
-
-Indices and tables
------------------
-
-* :ref:`genindex`
-* :ref:`modindex`
-* :ref:`search`
--- a/env_examples/run_default.py
+++ b/env_examples/run_default.py
@ -1,37 +1,29 @@
 from pathlib import Path
-from pprint import pprint
-
 from tqdm import trange
-
 from marl_factory_grid.algorithms.static.TSP_coin_agent import TSPCoinAgent
 from marl_factory_grid.algorithms.static.TSP_dirt_agent import TSPDirtAgent
-from marl_factory_grid.algorithms.static.TSP_item_agent import TSPItemAgent
-from marl_factory_grid.algorithms.static.TSP_target_agent import TSPTargetAgent
 from marl_factory_grid.environment.factory import Factory

-from marl_factory_grid.utils.plotting.plot_single_runs import plot_routes, plot_action_maps
-
 if __name__ == '__main__':

-    run_path = Path('study_out')
+    run_path = Path('../study_out')
    render = True
    monitor = True
    record = True

    # Path to config File
-    path = Path('marl_factory_grid/configs/test_config.yaml')
+    path = Path('../marl_factory_grid/configs/default_config.yaml')

    # Env Init
    factory = Factory(path)

-    for episode in trange(1):
+    for episode in trange(10):
        _ = factory.reset()
        done = False
        if render:
            factory.render()
        action_spaces = factory.action_space
-        # agents = [TSPDirtAgent(factory, 0), TSPItemAgent(factory, 1), TSPTargetAgent(factory, 2)]
-        agents = [TSPCoinAgent(factory, 0)]
+        agents = [TSPDirtAgent(factory, 0), TSPCoinAgent(factory, 1)]
        while not done:
            a = [x.predict() for x in agents]
            obs_type, _, _, done, info = factory.step(a)
--- a/env_examples/run_test.py
+++ b/env_examples/run_test.py
@ -1,41 +1,33 @@
 from pathlib import Path
-from random import randint
 from tqdm import trange
-
+from marl_factory_grid.algorithms.static.TSP_item_agent import TSPItemAgent
 from marl_factory_grid.environment.factory import Factory

-from marl_factory_grid.utils.logging.envmonitor import EnvMonitor
-from marl_factory_grid.utils.logging.recorder import EnvRecorder
-from marl_factory_grid.utils.plotting.plot_single_runs import plot_single_run
-from marl_factory_grid.utils.tools import ConfigExplainer
-
-
 if __name__ == '__main__':
-    # Render at each step?
-    render = True

-    run_path = Path('study_out')
+    run_path = Path('../study_out')
+    render = True
+    monitor = True
+    record = True

    # Path to config File
-    path = Path('marl_factory_grid/configs/_obs_test.yaml')
+    path = Path('../marl_factory_grid/configs/test_config.yaml')

    # Env Init
    factory = Factory(path)

-    # RL learn Loop
    for episode in trange(10):
        _ = factory.reset()
        done = False
        if render:
            factory.render()
        action_spaces = factory.action_space
+        agents = [TSPItemAgent(factory, 0)]
        while not done:
-            a = [randint(0, x.n - 1) for x in action_spaces]
+            a = [x.predict() for x in agents]
            obs_type, _, _, done, info = factory.step(a)
            if render:
                factory.render()
            if done:
                print(f'Episode {episode} done...')
                break
-
-    print('Done!!! Goodbye....')
--- a/main.py
+++ b/main.py
@ -0,0 +1,82 @@
+from marl_factory_grid.algorithms.marl.RL_runner import rerun_coin_quadrant_agent1_training, \
+    rerun_two_rooms_agent1_training, rerun_two_rooms_agent2_training, coin_quadrant_multi_agent_rl_eval, \
+    two_rooms_multi_agent_rl_eval
+from marl_factory_grid.algorithms.static.TSP_runner import coin_quadrant_multi_agent_tsp_eval, \
+    two_rooms_multi_agent_tsp_eval
+
+
+###### Coin-quadrant environment ######
+def coin_quadrant_single_agent_training():
+    """ Rerun training of RL-agent in coins_quadrant environment.
+    The trained model and additional training metrics are saved in the study_out folder. """
+    rerun_coin_quadrant_agent1_training()
+
+
+def coin_quadrant_RL_multi_agent_eval_emergent():
+    """ Rerun multi-agent evaluation of RL-agents in coins_quadrant environment,
+    with occurring emergent phenomenon. Evaluation takes trained models from study_out/run0 for both agents."""
+    coin_quadrant_multi_agent_rl_eval(emergent_phenomenon=True)
+
+
+def coin_quadrant_RL_multi_agent_eval_prevented():
+    """ Rerun multi-agent evaluation of RL-agents in coins_quadrant environment,
+    with emergence prevention mechanism. Evaluation takes trained models from study_out/run0 for both agents."""
+    coin_quadrant_multi_agent_rl_eval(emergent_phenomenon=False)
+
+
+def coin_quadrant_TSP_multi_agent_eval_emergent():
+    """ Rerun multi-agent evaluation of TSP-agents in coins_quadrant environment,
+    with occurring emergent phenomenon. """
+    coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon=True)
+
+
+def coin_quadrant_TSP_multi_agent_eval_prevented():
+    """ Rerun multi-agent evaluation of TSP-agents in coins_quadrant environment,
+    with emergence prevention mechanism. """
+    coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon=False)
+
+
+###### Two-rooms environment ######
+
+def two_rooms_agent1_training():
+    """ Rerun training of left RL-agent in two_rooms environment.
+        The trained model and additional training metrics are saved in the study_out folder. """
+    rerun_two_rooms_agent1_training()
+
+
+def two_rooms_agent2_training():
+    """ Rerun training of right RL-agent in two_rooms environment.
+        The trained model and additional training metrics are saved in the study_out folder. """
+    rerun_two_rooms_agent2_training()
+
+
+def two_rooms_RL_multi_agent_eval_emergent():
+    """ Rerun multi-agent evaluation of RL-agents in two_rooms environment, with
+        occurring emergent phenomenon. Evaluation takes trained models
+        from study_out/run1 for agent1 and study_out/run2 for agent2. """
+    two_rooms_multi_agent_rl_eval(emergent_phenomenon=True)
+
+
+def two_rooms_RL_multi_agent_eval_prevented():
+    """ Rerun multi-agent evaluation of RL-agents in two_rooms environment, with
+        emergence prevention mechanism. Evaluation takes trained models
+        from study_out/run1 for agent1 and study_out/run2 for agent2. """
+    two_rooms_multi_agent_rl_eval(emergent_phenomenon=False)
+
+
+def two_rooms_TSP_multi_agent_eval_emergent():
+    """ Rerun multi-agent evaluation of TSP-agents in two_rooms environment, with
+        occurring emergent phenomenon. """
+    two_rooms_multi_agent_tsp_eval(emergent_phenomenon=True)
+
+
+def two_rooms_TSP_multi_agent_eval_prevented():
+    """ Rerun multi-agent evaluation of TSP-agents in two_rooms environment, with
+        emergence prevention mechanism. """
+    two_rooms_multi_agent_tsp_eval(emergent_phenomenon=False)
+
+
+if __name__ == '__main__':
+    # Select any of the above functions to rerun the respective part
+    #  from our evaluation section of the paper
+    coin_quadrant_RL_multi_agent_eval_prevented()
--- a/marl_factory_grid/init.py
+++ b/marl_factory_grid/init.py
@ -1,4 +1,3 @@
-from .quickstart import init
 from marl_factory_grid.environment.factory import Factory
 """
 Main module of the 'rl-factory-grid'-environment.
--- a/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_coin_quadrant.pth
+++ b/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_coin_quadrant.pth
--- a/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_two_rooms_agent1.pth
+++ b/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_two_rooms_agent1.pth
--- a/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_two_rooms_agent2.pth
+++ b/marl_factory_grid/algorithms/agent_models/PolicyNet_model_parameters_two_rooms_agent2.pth
--- a/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_coin_quadrant.pth
+++ b/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_coin_quadrant.pth
--- a/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_two_rooms_agent1.pth
+++ b/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_two_rooms_agent1.pth
--- a/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_two_rooms_agent2.pth
+++ b/marl_factory_grid/algorithms/agent_models/ValueNet_model_parameters_two_rooms_agent2.pth
--- a/marl_factory_grid/algorithms/marl/RL_runner.py
+++ b/marl_factory_grid/algorithms/marl/RL_runner.py
@ -0,0 +1,80 @@
+from pathlib import Path
+from marl_factory_grid.algorithms.marl.a2c_coin import A2C
+from marl_factory_grid.algorithms.marl.utils import get_algorithms_marl_path
+from marl_factory_grid.algorithms.utils import load_yaml_file
+
+
+####### Training routines ######
+def rerun_coin_quadrant_agent1_training():
+    train_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_train_config.yaml')
+    eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_eval_config.yaml')
+    train_cfg = load_yaml_file(train_cfg_path)
+    eval_cfg = load_yaml_file(eval_cfg_path)
+
+    print("Training phase")
+    agent = A2C(train_cfg=train_cfg, eval_cfg=eval_cfg, mode="train")
+    agent.train_loop()
+    print("Evaluation phase")
+    agent.eval_loop("coin_quadrant", n_episodes=1)
+
+
+def two_rooms_training(max_steps, agent_name):
+    train_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_train_config.yaml')
+    eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_eval_config.yaml')
+    train_cfg = load_yaml_file(train_cfg_path)
+    eval_cfg = load_yaml_file(eval_cfg_path)
+
+    # train_cfg["algorithm"]["max_steps"] = max_steps
+    train_cfg["env"]["env_name"] = f"marl/single_agent_configs/two_rooms_{agent_name}_train_config"
+    eval_cfg["env"]["env_name"] = f"marl/single_agent_configs/two_rooms_{agent_name}_eval_config"
+    print("Training phase")
+    agent = A2C(train_cfg=train_cfg, eval_cfg=eval_cfg, mode="train")
+    agent.train_loop()
+    print("Evaluation phase")
+    agent.eval_loop("two_rooms", n_episodes=1)
+
+
+def rerun_two_rooms_agent1_training():
+    two_rooms_training(max_steps=190000, agent_name="agent1")
+
+
+def rerun_two_rooms_agent2_training():
+    two_rooms_training(max_steps=260000, agent_name="agent2")
+
+
+####### Eval routines ########
+def single_agent_eval(config_name, run_folder_name):
+    eval_cfg_path = Path(f'./marl_factory_grid/algorithms/marl/single_agent_configs/{config_name}_eval_config.yaml')
+    eval_cfg = load_yaml_file(eval_cfg_path)
+
+    # A value for train_cfg is required, but the train environment won't be used
+    agent = A2C(eval_cfg=eval_cfg, mode="eval")
+    print("Evaluation phase")
+    agent.load_agents(config_name, [run_folder_name])
+    agent.eval_loop(config_name, 1)
+
+
+def multi_agent_eval(config_name, runs, emergent_phenomenon=False):
+    eval_cfg_path = Path(f'{get_algorithms_marl_path()}/multi_agent_configs/{config_name}' +
+                         f'_eval_config{"_emergent" if emergent_phenomenon else ""}.yaml')
+    eval_cfg = load_yaml_file(eval_cfg_path)
+
+    # A value for train_cfg is required, but the train environment won't be used
+    agent = A2C(eval_cfg=eval_cfg, mode="eval")
+    print("Evaluation phase")
+    agent.load_agents(config_name, runs)
+    agent.eval_loop(config_name, 1)
+
+
+def coin_quadrant_multi_agent_rl_eval(emergent_phenomenon):
+    # Using an empty list for runs indicates, that the default agents in algorithms/agent_models should be used.
+    # If you want to use different agents, that were obtained by running the training with a different seed, you can
+    # load these agents by inserting the names of the runs in study_out/ into the runs list e.g. ["run1", "run2"]
+    multi_agent_eval("coin_quadrant", [], emergent_phenomenon)
+
+
+def two_rooms_multi_agent_rl_eval(emergent_phenomenon):
+    # Using an empty list for runs indicates, that the default agents in algorithms/agent_models should be used.
+    # If you want to use different agents, that were obtained by running the training with a different seed, you can
+    # load these agents by inserting the names of the runs in study_out/ into the runs list e.g. ["run1", "run2"]
+    multi_agent_eval("two_rooms", [], emergent_phenomenon)
--- a/marl_factory_grid/algorithms/marl/init.py
+++ b/marl_factory_grid/algorithms/marl/init.py
@ -0,0 +1 @@
+
--- a/marl_factory_grid/algorithms/marl/a2c_coin.py
+++ b/marl_factory_grid/algorithms/marl/a2c_coin.py
@ -1,53 +1,66 @@
 import os
+import pickle
 import torch
 from typing import Union, List
 import numpy as np
 from tqdm import tqdm

-from marl_factory_grid.algorithms.rl.base_a2c import PolicyGradient
-from marl_factory_grid.algorithms.rl.constants import Names
-from marl_factory_grid.algorithms.rl.utils import transform_observations, _as_torch, is_door_close, \
+from marl_factory_grid.algorithms.marl.base_a2c import PolicyGradient, cumulate_discount
+from marl_factory_grid.algorithms.marl.constants import Names
+from marl_factory_grid.algorithms.marl.utils import transform_observations, _as_torch, is_door_close, \
    get_coin_piles_positions, update_target_pile, update_ordered_coin_piles, get_all_collected_coin_piles, \
    distribute_indices, set_agents_spawnpoints, get_ordered_coin_piles, handle_finished_episode, save_configs, \
-    save_agent_models, get_all_observations, get_agents_positions
-from marl_factory_grid.algorithms.utils import add_env_props
-from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps, plot_reward_development, \
-    create_info_maps
+    save_agent_models, get_all_observations, get_agents_positions, has_low_change_phase_started, significant_deviation, \
+    get_agent_models_path
+
+from marl_factory_grid.algorithms.utils import add_env_props, get_study_out_path
+from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps, plot_return_development, \
+    create_info_maps, plot_return_development_change

 nms = Names
 ListOrTensor = Union[List, torch.Tensor]


 class A2C:
-    def __init__(self, train_cfg, eval_cfg):
-        self.results_path = None
-        self.agents = None
-        self.act_dim = None
-        self.obs_dim = None
-        self.factory = add_env_props(train_cfg)
+    def __init__(self, train_cfg=None, eval_cfg=None, mode="train"):
+        self.mode = mode
+        if mode == nms.TRAIN:
+            self.train_factory = add_env_props(train_cfg)
+            self.train_cfg = train_cfg
+            self.n_agents = train_cfg[nms.ENV][nms.N_AGENTS]
+        else:
+            self.n_agents = eval_cfg[nms.ENV][nms.N_AGENTS]
        self.eval_factory = add_env_props(eval_cfg)
-        self.__training = True
-        self.train_cfg = train_cfg
        self.eval_cfg = eval_cfg
-        self.cfg = train_cfg
-        self.n_agents = train_cfg[nms.ENV][nms.N_AGENTS]
        self.setup()
-        self.reward_development = []
        self.action_probabilities = {agent_idx: [] for agent_idx in range(self.n_agents)}

    def setup(self):
        """ Initialize agents and create entry for run results according to configuration """
+        if self.mode == "train":
+            self.cfg = self.train_cfg
+            self.factory = self.train_factory
+            self.gamma = self.cfg[nms.ALGORITHM][nms.GAMMA]
+        else:
+            self.cfg = self.eval_cfg
+            self.factory = self.eval_factory
+            self.gamma = 0.99
+
+        seed = self.cfg[nms.ALGORITHM][nms.SEED]
+        print("Algorithm Seed: ", seed)
+        if seed == -1:
+            seed = np.random.choice(range(1000))
+            print("Algorithm seed is -1. Pick random seed: ", seed)
+
        self.obs_dim = 2 + 2 * len(get_coin_piles_positions(self.factory)) if self.cfg[nms.ALGORITHM][
                                                                                  nms.PILE_OBSERVABILITY] == nms.ALL else 4
        self.act_dim = 4  # The 4 movement directions
-        self.agents = [PolicyGradient(self.factory, agent_id=i, obs_dim=self.obs_dim, act_dim=self.act_dim) for i in
+        self.agents = [PolicyGradient(self.factory, seed=seed, gamma=self.gamma, agent_id=i, obs_dim=self.obs_dim, act_dim=self.act_dim) for i in
                       range(self.n_agents)]

        if self.cfg[nms.ENV][nms.SAVE_AND_LOG]:
            # Define study_out_path and check if it exists
-            base_dir = os.path.dirname(os.path.abspath(__file__))  # Directory of the script
-            study_out_path = os.path.join(base_dir, '../../../study_out')
-            study_out_path = os.path.abspath(study_out_path)
+            study_out_path = get_study_out_path()

            if not os.path.exists(study_out_path):
                raise FileNotFoundError(f"The directory {study_out_path} does not exist.")
@ -62,56 +75,86 @@ class A2C:
            # Save settings in results folder
            save_configs(self.results_path, self.cfg, self.factory.conf, self.eval_factory.conf)

-    def set_cfg(self, eval=False):
-        if eval:
-            self.cfg = self.eval_cfg
-        else:
-            self.cfg = self.train_cfg
-
-    def load_agents(self, runs_list):
+    def load_agents(self, config_name, runs_list):
        """ Initialize networks with parameters of already trained agents """
-        for idx, run in enumerate(runs_list):
-            run_path = f"./study_out/{run}"
-            self.agents[idx].pi.load_model_parameters(f"{run_path}/PolicyNet_model_parameters.pth")
-            self.agents[idx].vf.load_model_parameters(f"{run_path}/ValueNet_model_parameters.pth")
+        if len(runs_list) == 0 or runs_list is None:
+            if config_name == "coin_quadrant":
+                for idx in range(self.n_agents):
+                    self.agents[idx].pi.load_model_parameters(f"{get_agent_models_path()}/PolicyNet_model_parameters_coin_quadrant.pth")
+                    self.agents[idx].vf.load_model_parameters(f"{get_agent_models_path()}/ValueNet_model_parameters_coin_quadrant.pth")
+            elif config_name == "two_rooms":
+                for idx in range(self.n_agents):
+                    self.agents[idx].pi.load_model_parameters(f"{get_agent_models_path()}/PolicyNet_model_parameters_two_rooms_agent{idx+1}.pth")
+                    self.agents[idx].vf.load_model_parameters(f"{get_agent_models_path()}/ValueNet_model_parameters_two_rooms_agent{idx+1}.pth")
+            else:
+                print("No such config does exist! Abort...")
+        else:
+            for idx, run in enumerate(runs_list):
+                run_path = f"./study_out/{run}"
+                self.agents[idx].pi.load_model_parameters(f"{run_path}/PolicyNet_model_parameters.pth")
+                self.agents[idx].vf.load_model_parameters(f"{run_path}/ValueNet_model_parameters.pth")

    @torch.no_grad()
    def train_loop(self):
        """ Function for training agents """
        env = self.factory
-        n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
+        n_steps, max_steps = [self.train_cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
        global_steps, episode = 0, 0
-        indices = distribute_indices(env, self.cfg, self.n_agents)
+        indices = distribute_indices(env, self.train_cfg, self.n_agents)
        coin_piles_positions = get_coin_piles_positions(env)
        target_pile = [partition[0] for partition in
                       indices]  # list of pointers that point to the current target pile for each agent
        collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
+        low_change_phase_start_episode = -1
+        episode_rewards_development = []
+        return_change_development = []

        pbar = tqdm(total=max_steps)
-        while global_steps < max_steps:
+        loop_condition = True if self.train_cfg[nms.ALGORITHM][nms.EARLY_STOPPING] else global_steps < max_steps
+        while loop_condition:
            _ = env.reset()
-            if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
+            if self.train_cfg[nms.ENV][nms.TRAIN_RENDER]:
                env.render()
            set_agents_spawnpoints(env, self.n_agents)
-            ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.cfg, self.n_agents)
+            ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.train_cfg, self.n_agents)
            # Reset current target pile at episode begin if all piles have to be collected in one episode
-            if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.ALL:
+            if self.train_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.ALL:
                target_pile = [partition[0] for partition in indices]
                collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
+            episode_rewards_development.append([])

            # Supply each agent with its local observation
-            obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
-            done, rew_log = [False] * self.n_agents, 0
+            obs = transform_observations(env, ordered_coin_piles, target_pile, self.train_cfg, self.n_agents)
+            done, ep_return = [False] * self.n_agents, 0
+
+            if self.train_cfg[nms.ALGORITHM][nms.EARLY_STOPPING]:
+                if len(return_change_development) > self.train_cfg[nms.ALGORITHM][
+                    nms.LAST_N_EPISODES] and low_change_phase_start_episode == -1 and has_low_change_phase_started(
+                        return_change_development, self.train_cfg[nms.ALGORITHM][nms.LAST_N_EPISODES],
+                        self.train_cfg[nms.ALGORITHM][nms.MEAN_TARGET_CHANGE]):
+                    low_change_phase_start_episode = len(return_change_development)
+                    print(low_change_phase_start_episode)
+
+                # Check if requirements for early stopping are met
+                if low_change_phase_start_episode != -1 and significant_deviation(return_change_development, low_change_phase_start_episode):
+                    print(f"Early Stopping in Episode: {global_steps} because of significant deviation.")
+                    break
+                if low_change_phase_start_episode != -1 and (len(return_change_development) - low_change_phase_start_episode) >= 1000:
+                    print(f"Early Stopping in Episode: {global_steps} because of episode time limit")
+                    break
+                if low_change_phase_start_episode != -1 and global_steps >= max_steps:
+                    print(f"Early Stopping in Episode: {global_steps} because of global steps time limit")
+                    break

            while not all(done):
                action = self.use_door_or_move(env, obs, collected_coin_piles) \
                    if nms.DOORS in env.state.entities.keys() else self.get_actions(obs)
                _, next_obs, reward, done, info = env.step(action)
-                next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
+                next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.train_cfg, self.n_agents)

                # Handle case where agent is on field with coin
                reward, done = self.handle_coin(env, collected_coin_piles, ordered_coin_piles, target_pile, indices,
-                                                reward, done)
+                                                reward, done, self.train_cfg)

                if n_steps != 0 and (global_steps + 1) % n_steps == 0: done = True

@ -122,50 +165,67 @@ class A2C:
                        agent._episode[-1] = (next_obs[ag_i], action[ag_i], reward[ag_i], agent._episode[-1][-1])

                # Visualize state update
-                if self.cfg[nms.ENV][nms.TRAIN_RENDER]: env.render()
+                if self.train_cfg[nms.ENV][nms.TRAIN_RENDER]: env.render()

                obs = next_obs

-                if all(done): handle_finished_episode(obs, self.agents, self.cfg)
-
                global_steps += 1
-                rew_log += sum(reward)
+                episode_rewards_development[-1].extend(reward)

-                if global_steps >= max_steps: break
+                if all(done):
+                    handle_finished_episode(obs, self.agents, self.train_cfg)
+                    break

-            self.reward_development.append(rew_log)
+            if global_steps >= max_steps: break
+
+            return_change_development.append(
+                sum(episode_rewards_development[-1]) - sum(episode_rewards_development[-2])
+                if len(episode_rewards_development) > 1 else 0.0)
            episode += 1
            pbar.update(global_steps - pbar.n)

        pbar.close()
-        if self.cfg[nms.ENV][nms.SAVE_AND_LOG]:
-            plot_reward_development(self.reward_development, self.results_path)
-            create_info_maps(env, get_all_observations(env, self.cfg, self.n_agents),
+        if self.train_cfg[nms.ENV][nms.SAVE_AND_LOG]:
+            return_development = [np.sum(rewards) for rewards in episode_rewards_development]
+            discounted_return_development = [np.sum([reward * pow(self.gamma, i) for i, reward in enumerate(ep_rewards)]) for ep_rewards in episode_rewards_development]
+            plot_return_development(return_development, self.results_path)
+            plot_return_development(discounted_return_development, self.results_path, discounted=True)
+            plot_return_development_change(return_change_development, self.results_path)
+            create_info_maps(env, get_all_observations(env, self.train_cfg, self.n_agents),
                             get_coin_piles_positions(env), self.results_path, self.agents, self.act_dim, self)
+            metrics_data = {"episode_rewards_development": episode_rewards_development,
+                            "return_development": return_development,
+                            "discounted_return_development": discounted_return_development,
+                            "return_change_development": return_change_development}
+            with open(f"{self.results_path}/metrics", "wb") as pickle_file:
+                pickle.dump(metrics_data, pickle_file)
            save_agent_models(self.results_path, self.agents)
            plot_action_maps(env, [self], self.results_path)

    @torch.inference_mode(True)
-    def eval_loop(self, n_episodes):
+    def eval_loop(self, config_name, n_episodes):
        """ Function for performing inference """
        env = self.eval_factory
-        self.set_cfg(eval=True)
        episode, results = 0, []
        coin_piles_positions = get_coin_piles_positions(env)
-        indices = distribute_indices(env, self.cfg, self.n_agents)
+        if config_name == "coin_quadrant": print("Coin Piles positions", coin_piles_positions)
+        indices = distribute_indices(env, self.eval_cfg, self.n_agents)
        target_pile = [partition[0] for partition in
                       indices]  # list of pointers that point to the current target pile for each agent
-        if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
+        if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
            collected_coin_piles = [{coin_piles_positions[idx]: False for idx in indices[i]} for i in
                                  range(self.n_agents)]
-        else: collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
+        else:
+            collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
+
+        collected_coin_piles_per_step = []

        while episode < n_episodes:
            _ = env.reset()
            set_agents_spawnpoints(env, self.n_agents)
-            if self.cfg[nms.ENV][nms.EVAL_RENDER]:
+            if self.eval_cfg[nms.ENV][nms.EVAL_RENDER]:
                # Don't render auxiliary piles
-                if self.cfg[nms.ALGORITHM][nms.AUXILIARY_PILES]:
+                if self.eval_cfg[nms.ALGORITHM][nms.AUXILIARY_PILES]:
                    auxiliary_piles = [pile for idx, pile in enumerate(env.state.entities[nms.COIN_PILES]) if
                                       idx % 2 == 0]
                    for pile in auxiliary_piles:
@ -174,19 +234,23 @@ class A2C:
                env._renderer.fps = 5  # Slow down agent movement

            # Reset current target pile at episode begin if all piles have to be collected in one episode
-            if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED, nms.SHARED]:
+            if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED, nms.SHARED]:
                target_pile = [partition[0] for partition in indices]
-                if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
+                if self.eval_cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.DISTRIBUTED:
                    collected_coin_piles = [{coin_piles_positions[idx]: False for idx in indices[i]} for i in
                                          range(self.n_agents)]
-                else: collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]
+                else:
+                    collected_coin_piles = [{pos: False for pos in coin_piles_positions} for _ in range(self.n_agents)]

-            ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.cfg, self.n_agents)
+            ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.eval_cfg, self.n_agents)

            # Supply each agent with its local observation
-            obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
+            obs = transform_observations(env, ordered_coin_piles, target_pile, self.eval_cfg, self.n_agents)
            done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)

+            collected_coin_piles_per_step.append([])
+
+            ep_steps = 0
            while not all(done):
                action = self.use_door_or_move(env, obs, collected_coin_piles, det=True) \
                    if nms.DOORS in env.state.entities.keys() else self.execute_policy(obs, env,
@ -195,20 +259,44 @@ class A2C:

                # Handle case where agent is on field with coin
                reward, done = self.handle_coin(env, collected_coin_piles, ordered_coin_piles, target_pile, indices,
-                                                reward, done)
+                                                reward, done, self.eval_cfg)
+
+                ordered_coin_piles = get_ordered_coin_piles(env, collected_coin_piles, self.eval_cfg, self.n_agents)

                # Get transformed next_obs that might have been updated because of handle_coin
-                next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.cfg, self.n_agents)
+                next_obs = transform_observations(env, ordered_coin_piles, target_pile, self.eval_cfg, self.n_agents)

                done = [done] * self.n_agents if isinstance(done, bool) else done

-                if self.cfg[nms.ENV][nms.EVAL_RENDER]: env.render()
+                if self.eval_cfg[nms.ENV][nms.EVAL_RENDER]: env.render()

                obs = next_obs

-            episode += 1
+                # Count the overall number of cleaned coin piles in each step
+                collected_piles = 0
+                for dict in collected_coin_piles:
+                    for value in dict.values():
+                        if value:
+                            collected_piles += 1
+                collected_coin_piles_per_step[-1].append(collected_piles)

-    # -------------------------------------- HELPER FUNCTIONS ------------------------------------------------- #
+                ep_steps += 1
+
+            episode += 1
+            print("Number of environment steps:", ep_steps)
+            if config_name == "coin_quadrant":
+                print("Collected coins per step:", collected_coin_piles_per_step)
+            else:
+                # For the RL agent, we encode the flags internally as coins as well.
+                # Also, we have to subtract the auxiliary pile in the emergence prevention mechanism case
+                print("Reached flags per step:", [[max(0, coin_pile - 1) for coin_pile in ele] for ele in collected_coin_piles_per_step])
+
+        if self.eval_cfg[nms.ENV][nms.SAVE_AND_LOG]:
+            metrics_data = {"collected_coin_piles_per_step": collected_coin_piles_per_step}
+            with open(f"{self.results_path}/metrics", "wb") as pickle_file:
+                pickle.dump(metrics_data, pickle_file)
+
+    ########## Helper functions ########

    def get_actions(self, observations) -> ListOrTensor:
        """ Given local observations, get actions for both agents """
@ -247,14 +335,18 @@ class A2C:
                            a.name == nms.USE_DOOR))
                        # Don't include action in agent experience
                    else:
-                        if det: action.append(int(agent.pi(agent_obs, det=True)[0]))
-                        else: action.append(int(agent.step(agent_obs)))
+                        if det:
+                            action.append(int(agent.pi(agent_obs, det=True)[0]))
+                        else:
+                            action.append(int(agent.step(agent_obs)))
                else:
-                    if det: action.append(int(agent.pi(agent_obs, det=True)[0]))
-                    else: action.append(int(agent.step(agent_obs)))
+                    if det:
+                        action.append(int(agent.pi(agent_obs, det=True)[0]))
+                    else:
+                        action.append(int(agent.step(agent_obs)))
        return action

-    def handle_coin(self, env, collected_coin_piles, ordered_coin_piles, target_pile, indices, reward, done):
+    def handle_coin(self, env, collected_coin_piles, ordered_coin_piles, target_pile, indices, reward, done, cfg):
        """ Check if agent moved on field with coin. If that is the case collect coin automatically """
        agents_positions = get_agents_positions(env, self.n_agents)
        coin_piles_positions = get_coin_piles_positions(env)
@ -269,10 +361,10 @@ class A2C:
                            reward[idx] += 50
                            collected_coin_piles[idx][pos] = True
                            # Set pointer to next coin pile
-                            update_target_pile(env, idx, target_pile, indices, self.cfg)
+                            update_target_pile(env, idx, target_pile, indices, cfg)
                            update_ordered_coin_piles(idx, collected_coin_piles, ordered_coin_piles, env,
-                                                      self.cfg, self.n_agents)
-                            if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SINGLE:
+                                                      cfg, self.n_agents)
+                            if cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SINGLE:
                                done = True
                                if all(collected_coin_piles[idx].values()):
                                    # Reset collected_coin_piles indicator
@ -285,11 +377,15 @@ class A2C:
                    # Indicate that renderer can hide coin pile
                    coin_at_position = env.state[nms.COIN_PILES].by_pos(pos)
                    coin_at_position[0].set_new_amount(0)
+                    """
+                    coin_at_position = env.state[nms.COIN_PILES].by_pos(pos)[0]
+                    env.state[nms.COIN_PILES].delete_env_object(coin_at_position)
+                    """

-            if self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED]:
+            if cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] in [nms.ALL, nms.DISTRIBUTED]:
                if all([all(collected_coin_piles[i].values()) for i in range(self.n_agents)]):
                    done = True
-            elif self.cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SHARED:
+            elif cfg[nms.ALGORITHM][nms.PILE_ALL_DONE] == nms.SHARED:
                # End episode if both agents together have collected all coin piles
                if all(get_all_collected_coin_piles(coin_piles_positions, collected_coin_piles, self.n_agents).values()):
                    done = True
--- a/marl_factory_grid/algorithms/marl/a2c_dirt.py
+++ b/marl_factory_grid/algorithms/marl/a2c_dirt.py
@ -1,755 +0,0 @@
-import copy
-import os
-import random
-
-import imageio # requires ffmpeg install on operating system and imageio-ffmpeg package for python
-from scipy import signal
-import matplotlib.pyplot as plt
-import torch
-from typing import Union, List, Dict
-import numpy as np
-from torch.distributions import Categorical
-
-from marl_factory_grid.algorithms.marl.base_a2c import PolicyGradient, cumulate_discount
-from marl_factory_grid.algorithms.marl.memory import MARLActorCriticMemory
-from marl_factory_grid.algorithms.utils import add_env_props, instantiate_class
-from pathlib import Path
-from collections import deque
-
-from marl_factory_grid.environment.actions import Noop
-from marl_factory_grid.modules import Clean, DoorUse
-from marl_factory_grid.utils.plotting.plot_single_runs import plot_action_maps
-
-
-class Names:
-    REWARD          = 'reward'
-    DONE            = 'done'
-    ACTION          = 'action'
-    OBSERVATION     = 'observation'
-    LOGITS          = 'logits'
-    HIDDEN_ACTOR    = 'hidden_actor'
-    HIDDEN_CRITIC   = 'hidden_critic'
-    AGENT           = 'agent'
-    ENV             = 'env'
-    ENV_NAME        = 'env_name'
-    N_AGENTS        = 'n_agents'
-    ALGORITHM       = 'algorithm'
-    MAX_STEPS       = 'max_steps'
-    N_STEPS         = 'n_steps'
-    BUFFER_SIZE     = 'buffer_size'
-    CRITIC          = 'critic'
-    BATCH_SIZE      = 'bnatch_size'
-    N_ACTIONS       = 'n_actions'
-    TRAIN_RENDER    = 'train_render'
-    EVAL_RENDER     = 'eval_render'
-
-
-nms = Names
-ListOrTensor = Union[List, torch.Tensor]
-
-
-class A2C:
-    def __init__(self, train_cfg, eval_cfg):
-        self.factory = add_env_props(train_cfg)
-        self.eval_factory = add_env_props(eval_cfg)
-        self.__training = True
-        self.train_cfg = train_cfg
-        self.eval_cfg = eval_cfg
-        self.cfg = train_cfg
-        self.n_agents = train_cfg[nms.AGENT][nms.N_AGENTS]
-        self.setup()
-        self.reward_development = []
-        self.action_probabilities = {agent_idx:[] for agent_idx in range(self.n_agents)}
-
-    def setup(self):
-        dirt_piles_positions = [self.factory.state.entities['DirtPiles'][pile_idx].pos for pile_idx in
-                                range(len(self.factory.state.entities['DirtPiles']))]
-        if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
-            obs_dim = 2 + 2*len(dirt_piles_positions)
-        else:
-            obs_dim = 4
-        self.obs_dim = obs_dim
-        self.act_dim = 4
-        # act_dim=4, because we want the agent to only learn a routing problem
-        self.agents = [PolicyGradient(self.factory, agent_id=i, obs_dim=obs_dim, act_dim=self.act_dim) for i in range(self.n_agents)]
-        if self.cfg[nms.ENV]["save_and_log"]:
-            # Create results folder
-            runs = os.listdir("../study_out/")
-            run_numbers = [int(run[3:]) for run in runs if run[:3] == "run"]
-            next_run_number = max(run_numbers)+1 if run_numbers else 0
-            self.results_path = f"../study_out/run{next_run_number}"
-            os.mkdir(self.results_path)
-            # Save settings in results folder
-            self.save_configs()
-            if self.cfg[nms.ENV]["record"]:
-                self.recorder = imageio.get_writer(f'{self.results_path}/pygame_recording.mp4', fps=5)
-
-    def set_cfg(self, eval=False):
-        if eval:
-            self.cfg = self.eval_cfg
-        else:
-            self.cfg = self.train_cfg
-
-    @classmethod
-    def _as_torch(cls, x):
-        if isinstance(x, np.ndarray):
-            return torch.from_numpy(x)
-        elif isinstance(x, List):
-            return torch.tensor(x)
-        elif isinstance(x, (int, float)):
-            return torch.tensor([x])
-        return x
-
-    def get_actions(self, observations) -> ListOrTensor:
-        # Given an observation, get actions for both agents
-        actions = [agent.step(self._as_torch(observations[ag_i]).view(-1).to(torch.float32)) for ag_i, agent in enumerate(self.agents)]
-        return actions
-
-    def execute_policy(self, observations, env, cleaned_dirt_piles) -> ListOrTensor:
-        # Use deterministic policy for inference
-        actions = [agent.policy(self._as_torch(observations[ag_i]).view(-1).to(torch.float32)) for ag_i, agent in enumerate(self.agents)]
-        for agent_idx in range(self.n_agents):
-            if all(cleaned_dirt_piles[agent_idx].values()):
-                actions[agent_idx] = np.array(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "Noop"))
-        return actions
-
-    def transform_observations(self, env, ordered_dirt_piles, target_pile):
-        """ Assumes that agent has observations -DirtPiles and -Self """
-        agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
-        if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
-            trans_obs = [torch.zeros(2+2*len(ordered_dirt_piles[0])) for _ in range(len(agent_positions))]
-        else:
-            # Only show current target pile
-            trans_obs = [torch.zeros(4) for _ in range(len(agent_positions))]
-        for i, pos in enumerate(agent_positions):
-            agent_x, agent_y = pos[0], pos[1]
-            trans_obs[i][0] = agent_x
-            trans_obs[i][1] = agent_y
-            idx = 2
-            if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
-                for pile_pos in ordered_dirt_piles[i]:
-                    trans_obs[i][idx] = pile_pos[0]
-                    trans_obs[i][idx + 1] = pile_pos[1]
-                    idx += 2
-            else:
-                trans_obs[i][2] = ordered_dirt_piles[i][target_pile[i]][0]
-                trans_obs[i][3] = ordered_dirt_piles[i][target_pile[i]][1]
-        return trans_obs
-
-    def get_all_observations(self, env):
-        dirt_piles_positions = [env.state.entities['DirtPiles'][pile_idx].pos for pile_idx in
-                                range(len(env.state.entities['DirtPiles']))]
-        if self.cfg[nms.ALGORITHM]["pile-observability"] == "all":
-            obs = [torch.zeros(2 + 2 * len(dirt_piles_positions))]
-            observations = [[]]
-            # Fill in pile positions
-            idx = 2
-            for pile_pos in dirt_piles_positions:
-                obs[0][idx] = pile_pos[0]
-                obs[0][idx + 1] = pile_pos[1]
-                idx += 2
-        else:
-            # Have multiple observation layers of the map for each dirt pile one
-            obs = [torch.zeros(4) for _ in range(self.n_agents) for _ in dirt_piles_positions]
-            observations = [[] for _ in dirt_piles_positions]
-            for idx, pile_pos in enumerate(dirt_piles_positions):
-                obs[idx][2] = pile_pos[0]
-                obs[idx][3] = pile_pos[1]
-        valid_agent_positions = env.state.entities.floorlist
-        #observations_shape = (max(t[0] for t in valid_agent_positions) + 2, max(t[1] for t in valid_agent_positions) + 2)
-        for idx, pos in enumerate(valid_agent_positions):
-            for obs_layer in range(len(obs)):
-                observation = copy.deepcopy(obs[obs_layer])
-                observation[0] = pos[0]
-                observation[1] = pos[1]
-                observations[obs_layer].append(observation)
-
-        return observations
-
-    def get_dirt_piles_positions(self, env):
-        return [env.state.entities['DirtPiles'][pile_idx].pos for pile_idx in range(len(env.state.entities['DirtPiles']))]
-
-    def get_ordered_dirt_piles(self, env, cleaned_dirt_piles, target_pile):
-        """ Each agent can have it's individual pile order """
-        ordered_dirt_piles = [[] for _ in range(self.n_agents)]
-        dirt_pile_positions = self.get_dirt_piles_positions(env)
-        agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
-        for agent_idx in range(self.n_agents):
-            if self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "agents"]:
-                ordered_dirt_piles[agent_idx] = dirt_pile_positions
-            elif self.cfg[nms.ALGORITHM]["pile-order"] == "random":
-                ordered_dirt_piles[agent_idx] = dirt_pile_positions
-                random.shuffle(ordered_dirt_piles)
-            elif self.cfg[nms.ALGORITHM]["pile-order"] == "none":
-                ordered_dirt_piles[agent_idx] = None
-            elif self.cfg[nms.ALGORITHM]["pile-order"] in ["smart", "dynamic"]:
-                # Calculate distances for remaining unvisited dirt piles
-                remaining_target_piles = [pos for pos, value in cleaned_dirt_piles[agent_idx].items() if not value]
-                pile_distances = {pos:0 for pos in remaining_target_piles}
-                agent_pos = agent_positions[agent_idx]
-                for pos in remaining_target_piles:
-                    pile_distances[pos] = np.abs(agent_pos[0] - pos[0]) + np.abs(agent_pos[1] - pos[1])
-
-                if self.cfg[nms.ALGORITHM]["pile-order"] == "smart":
-                    # Check if there is an agent in line with any of the remaining dirt piles
-                    for pile_pos in remaining_target_piles:
-                        for other_pos in agent_positions:
-                            if other_pos != agent_pos:
-                                if agent_pos[0] == other_pos[0] == pile_pos[0] or agent_pos[1] == other_pos[1] == pile_pos[1]:
-                                    # Get the line between the agent and the goal
-                                    path = self.bresenham(agent_pos[0], agent_pos[1], pile_pos[0], pile_pos[1])
-
-                                    # Check if the entity lies on the path between the agent and the goal
-                                    if other_pos in path:
-                                        pile_distances[pile_pos] += np.abs(agent_pos[0] - other_pos[0]) + np.abs(agent_pos[1] - other_pos[1])
-
-                sorted_pile_distances = dict(sorted(pile_distances.items(), key=lambda item: item[1]))
-                # Insert already visited dirt piles
-                ordered_dirt_piles[agent_idx] = [pos for pos in dirt_pile_positions if pos not in remaining_target_piles]
-                # Fill up with sorted positions
-                for pos in sorted_pile_distances.keys():
-                    ordered_dirt_piles[agent_idx].append(pos)
-
-            else:
-                print("Not a valid pile order option.")
-                exit()
-
-        return ordered_dirt_piles
-
-    def bresenham(self, x0, y0, x1, y1):
-        """Bresenham's line algorithm to get the coordinates of a line between two points."""
-        dx = np.abs(x1 - x0)
-        dy = np.abs(y1 - y0)
-        sx = 1 if x0 < x1 else -1
-        sy = 1 if y0 < y1 else -1
-        err = dx - dy
-
-        coordinates = []
-        while True:
-            coordinates.append((x0, y0))
-            if x0 == x1 and y0 == y1:
-                break
-            e2 = 2 * err
-            if e2 > -dy:
-                err -= dy
-                x0 += sx
-            if e2 < dx:
-                err += dx
-                y0 += sy
-        return coordinates
-
-    def update_ordered_dirt_piles(self, agent_idx, cleaned_dirt_piles, ordered_dirt_piles, env, target_pile):
-        # Only update ordered_dirt_pile for agent that reached its target pile
-        updated_ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
-        for i in range(len(ordered_dirt_piles[agent_idx])):
-            ordered_dirt_piles[agent_idx][i] = updated_ordered_dirt_piles[agent_idx][i]
-
-    def distribute_indices(self, env):
-        indices = []
-        n_dirt_piles = len(self.get_dirt_piles_positions(env))
-        if n_dirt_piles == 1 or self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "random", "none", "dynamic", "smart"]:
-            indices = [[0] for _ in range(self.n_agents)]
-        else:
-            base_count = n_dirt_piles // self.n_agents
-            remainder = n_dirt_piles % self.n_agents
-
-            start_index = 0
-            for i in range(self.n_agents):
-                # Add an extra index to the first 'remainder' objects
-                end_index = start_index + base_count + (1 if i < remainder else 0)
-                indices.append(list(range(start_index, end_index)))
-                start_index = end_index
-
-            # Static form: auxiliary pile, primary pile, auxiliary pile, ...
-            # -> Starting with index 0 even piles are auxiliary piles, odd piles are primary piles
-            if self.cfg[nms.ALGORITHM]["auxiliary_piles"] and "Doors" in env.state.entities.keys():
-                door_positions = [door.pos for door in env.state.entities["Doors"]]
-                agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
-                distances = {door_pos:[] for door_pos in door_positions}
-
-                # Calculate distance of every agent to every door
-                for door_pos in door_positions:
-                    for agent_pos in agent_positions:
-                        distances[door_pos].append(np.abs(door_pos[0] - agent_pos[0]) + np.abs(door_pos[1] - agent_pos[1]))
-
-                def duplicate_indices(lst, item):
-                    return [i for i, x in enumerate(lst) if x == item]
-
-                # Get agent indices of agents with same distance to door
-                affected_agents = {door_pos:{} for door_pos in door_positions}
-                for door_pos in distances.keys():
-                    dist = distances[door_pos]
-                    dist_set = set(dist)
-                    for d in dist_set:
-                        affected_agents[door_pos][str(d)] = duplicate_indices(dist, d)
-
-                # TODO: Make generic for multiple doors
-                updated_indices = []
-                if len(affected_agents[door_positions[0]]) == 0:
-                    # Remove auxiliary piles for all agents
-                    updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
-                else:
-                    for distance, agent_indices in affected_agents[door_positions[0]].items():
-                        # Pick random agent to keep auxiliary pile and remove it for all others
-                        #selected_agent = np.random.choice(agent_indices)
-                        selected_agent = 0
-                        for agent_idx in agent_indices:
-                            if agent_idx == selected_agent:
-                                updated_indices.append(indices[agent_idx])
-                            else:
-                                updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])
-
-                indices = updated_indices
-
-        return indices
-
-    def update_target_pile(self, env, agent_idx, target_pile, indices):
-        if self.cfg[nms.ALGORITHM]["pile-order"] in ["fixed", "random", "none", "dynamic", "smart"]:
-            if target_pile[agent_idx] + 1 < len(self.get_dirt_piles_positions(env)):
-                target_pile[agent_idx] += 1
-            else:
-                target_pile[agent_idx] = 0
-        else:
-            if target_pile[agent_idx] + 1 in indices[agent_idx]:
-                target_pile[agent_idx] += 1
-
-    def door_is_close(self, env, agent_idx):
-        neighbourhood = [y for x in env.state.entities.neighboring_positions(env.state["Agent"][agent_idx].pos)
-                        for y in env.state.entities.pos_dict[x] if "Door" in y.name]
-        if neighbourhood:
-            return neighbourhood[0]
-
-    def use_door_or_move(self, env, obs, cleaned_dirt_piles, target_pile, det=False):
-        action = []
-        for agent_idx, agent in enumerate(self.agents):
-            agent_obs = self._as_torch((obs)[agent_idx]).view(-1).to(torch.float32)
-            # If agent already reached its target
-            if all(cleaned_dirt_piles[agent_idx].values()):
-                action.append(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "Noop"))
-                if not det:
-                    # Include agent experience entry manually
-                    agent._episode.append((None, None, None, agent.vf(agent_obs)))
-            else:
-                if door := self.door_is_close(env, agent_idx):
-                    if door.is_closed:
-                        action.append(next(action_i for action_i, a in enumerate(env.state["Agent"][agent_idx].actions) if a.name == "use_door"))
-                        # Don't include action in agent experience
-                    else:
-                        if det:
-                            action.append(int(agent.pi(agent_obs, det=True)[0]))
-                        else:
-                            action.append(int(agent.step(agent_obs)))
-                else:
-                    if det:
-                        action.append(int(agent.pi(agent_obs, det=True)[0]))
-                    else:
-                        action.append(int(agent.step(agent_obs)))
-        return action
-
-    def reward_distance(self, env, obs, target_pile, reward):
-        agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
-        # Give a negative reward for every step that keeps agent from getting closer to currently selected target pile/ closest pile
-        for idx, pos in enumerate(agent_positions):
-            last_pos = (int(obs[idx][0]), int(obs[idx][1].item()))
-            target_pile_pos = self.get_dirt_piles_positions(env)[target_pile[idx]]
-            last_distance = np.abs(target_pile_pos[0] - last_pos[0]) + np.abs(target_pile_pos[1] - last_pos[1])
-            new_distance = np.abs(target_pile_pos[0] - pos[0]) + np.abs(target_pile_pos[1] - pos[1])
-            if new_distance >= last_distance:
-                reward[idx] -= 0.05  # 0.05
-        return reward
-
-    def punish_entering_same_field(self, next_obs, passed_fields, reward):
-        # Give a high negative reward if agent enters same field twice
-        for idx in range(self.n_agents):
-            if (next_obs[idx][0], next_obs[idx][1]) in passed_fields[idx]:
-                reward[idx] += -0.1
-            else:
-                passed_fields[idx].append((next_obs[idx][0], next_obs[idx][1]))
-
-
-    def handle_dirt_quadrant_observation_bugs(self, obs, env):
-        try:
-            # Check that dirt position and amount are still correct
-            assert np.where(obs[0][0] == 0.5)[0][0] == 1 and np.where(obs[0][0] == 0.5)[0][0] == 1
-        except:
-            print("Missing dirt pile")
-            # Manually place dirt on defined position
-            obs[0][0][1][1] = 0.5
-        try:
-            # Check that self still returns a valid agent position on the map
-            assert np.where(obs[0][1] == 1)[0][0] and np.where(obs[0][1] == 1)[1][0]
-        except:
-            # Place agent manually in obs object on last known position
-            x, y = env.state.moving_entites[0].pos[0], env.state.moving_entites[0].pos[1]
-            obs[0][1][x][y] = 1
-            print("Missing agent position")
-
-    def get_all_cleaned_dirt_piles(self, dirt_piles_positions, cleaned_dirt_piles):
-        meta_cleaned_dirt_piles = {pos: False for pos in dirt_piles_positions}
-        for agent_idx in range(self.n_agents):
-            for (pos, cleaned) in cleaned_dirt_piles[agent_idx].items():
-                if cleaned:
-                    meta_cleaned_dirt_piles[pos] = True
-        return meta_cleaned_dirt_piles
-
-    def handle_dirt(self, env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done):
-        # Check if agent moved on field with dirt. If that is the case collect dirt automatically
-        agent_positions = [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)]
-        dirt_piles_positions = self.get_dirt_piles_positions(env)
-        if any([True for pos in agent_positions if pos in dirt_piles_positions]):
-            # Do Noop for agent that does not collect dirt
-            """action = [np.array(5), np.array(5)]
-
-            # Execute real step in environment
-            for idx, pos in enumerate(agent_positions):
-                if pos in cleaned_dirt_piles[idx].keys() and not cleaned_dirt_piles[idx][pos]:
-                    action[idx] = np.array(4)
-                    # Collect dirt
-                    _, next_obs, reward, done, info = env.step(action)
-                    cleaned_dirt_piles[idx][pos] = True
-                    break"""
-
-            # Only simulate collecting the dirt
-            for idx, pos in enumerate(agent_positions):
-                if pos in cleaned_dirt_piles[idx].keys() and not cleaned_dirt_piles[idx][pos]:
-                    # print(env.state.entities["Agent"][idx], pos, idx, target_pile, ordered_dirt_piles)
-                    # If dirt piles should be cleaned in a specific order
-                    if ordered_dirt_piles[idx]:
-                        if pos == ordered_dirt_piles[idx][target_pile[idx]]:
-                            reward[idx] += 50  # 1
-                            cleaned_dirt_piles[idx][pos] = True
-                            # Set pointer to next dirt pile
-                            self.update_target_pile(env, idx, target_pile, indices)
-                            self.update_ordered_dirt_piles(idx, cleaned_dirt_piles, ordered_dirt_piles, env, target_pile)
-                            if self.cfg[nms.ALGORITHM]["pile_all_done"] == "single":
-                                done = True
-                                if all(cleaned_dirt_piles[idx].values()):
-                                    # Reset cleaned_dirt_piles indicator
-                                    for pos in dirt_piles_positions:
-                                        cleaned_dirt_piles[idx][pos] = False
-                    else:
-                        reward[idx] += 50  # 1
-                        cleaned_dirt_piles[idx][pos] = True
-
-            if self.cfg[nms.ALGORITHM]["pile_all_done"] in ["all", "distributed"]:
-                if all([all(cleaned_dirt_piles[i].values()) for i in range(self.n_agents)]):
-                    done = True
-            elif self.cfg[nms.ALGORITHM]["pile_all_done"] == "shared":
-                # End episode if both agents together have cleaned all dirt piles
-                if all(self.get_all_cleaned_dirt_piles(dirt_piles_positions, cleaned_dirt_piles).values()):
-                    done = True
-
-        return reward, done
-
-    def handle_finished_episode(self, obs):
-        with torch.inference_mode(False):
-            for ag_i, agent in enumerate(self.agents):
-                # Get states, actions, rewards and values from rollout buffer
-                data = agent.finish_episode()
-                # Chunk episode data, such that there will be no memory failure for very long episodes
-                chunks = self.split_into_chunks(data)
-                for (s, a, R, V) in chunks:
-                    # Calculate discounted return and advantage
-                    G = cumulate_discount(R, self.cfg[nms.ALGORITHM]["gamma"])
-                    if self.cfg[nms.ALGORITHM]["advantage"] == "Reinforce":
-                        A = G
-                    elif self.cfg[nms.ALGORITHM]["advantage"] == "Advantage-AC":
-                        A = G - V  # Actor-Critic Advantages
-                    elif self.cfg[nms.ALGORITHM]["advantage"] == "TD-Advantage-AC":
-                        with torch.no_grad():
-                            A = R + self.cfg[nms.ALGORITHM]["gamma"] * np.append(V[1:], agent.vf(
-                                self._as_torch(obs[ag_i]).view(-1).to(
-                                    torch.float32)).numpy()) - V  # TD Actor-Critic Advantages
-                    else:
-                        print("Not a valid advantage option.")
-                        exit()
-
-                    rollout = (torch.tensor(x.copy()).to(torch.float32) for x in (s, a, G, A))
-                    # Update policy and value net of agent with experience from rollout buffer
-                    agent.train(*rollout)
-
-    def split_into_chunks(self, data_tuple):
-        result = [data_tuple]
-        chunk_size = self.cfg[nms.ALGORITHM]["chunk-episode"]
-        if chunk_size > 0:
-            # Get the maximum length of the lists in the tuple to handle different lengths
-            max_length = max(len(lst) for lst in data_tuple)
-
-            # Prepare a list to store the result
-            result = []
-
-            # Split each list into chunks and add them to the result
-            for i in range(0, max_length, chunk_size):
-                # Create a sublist containing the ith chunk from each list
-                sublist = [lst[i:i + chunk_size] for lst in data_tuple if i < len(lst)]
-                result.append(sublist)
-
-        return result
-
-    def set_agent_spawnpoint(self, env):
-        for agent_idx in range(self.n_agents):
-            agent_name = list(env.state.agents_conf.keys())[agent_idx]
-            current_pos_pointer = env.state.agents_conf[agent_name]["pos_pointer"]
-            # Making the reset dependent on the number of spawnpoints and not the number of dirtpiles allows
-            # for having multiple subsequent spawnpoints with the same target pile
-            if current_pos_pointer == len(env.state.agents_conf[agent_name]['positions']) - 1:
-                env.state.agents_conf[agent_name]["pos_pointer"] = 0
-            else:
-                env.state.agents_conf[agent_name]["pos_pointer"] += 1
-
-    @torch.no_grad()
-    def train_loop(self):
-        env = self.factory
-        n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
-        global_steps, episode = 0, 0
-        indices = self.distribute_indices(env)
-        dirt_piles_positions = self.get_dirt_piles_positions(env)
-        used_actions = {i:0 for i in range(len(env.state.entities["Agent"][0]._actions))} # Assume both agents have the same actions
-        target_pile = [partition[0] for partition in indices]  # pointer that points to the target pile for each agent. (point to same pile, point to different piles)
-        cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)] # Have own dictionary for each agent
-
-        while global_steps < max_steps:
-            print(global_steps)
-            obs = env.reset() # !!!!!!!!Commented seems to work better? Only if a fixed spawnpoint is given
-            if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
-                env.render()
-            self.set_agent_spawnpoint(env)
-            ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
-            # Reset current target pile at episode begin if all piles have to be cleaned in one episode
-            if self.cfg[nms.ALGORITHM]["pile_all_done"] == "all":
-                target_pile = [partition[0] for partition in indices]
-                cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
-            """passed_fields = [[] for _ in range(self.n_agents)]"""
-
-            """obs = list(obs.values())"""
-            obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
-            done, rew_log       = [False] * self.n_agents, 0
-
-            print("Agents spawnpoints:", [env.state.moving_entites[agent_idx].pos for agent_idx in range(self.n_agents)])
-            print("Agents target piles:", target_pile)
-            print("Agents initial observation:", obs)
-            print("Agents cleaned dirt piles:", cleaned_dirt_piles)
-
-            # Add Clean and Noop actions to agent actions so that they can be executed when the agent comes on a dirpile
-            """for i in range(self.n_agents):
-                self.factory.state['Agent'][i].actions.extend([Clean(), Noop()])"""
-
-            while not all(done):
-                # 0="North", 1="East", 2="South", 3="West", 4="Clean", 5="Noop"
-                action = self.use_door_or_move(env, obs, cleaned_dirt_piles, target_pile) \
-                    if "Doors" in env.state.entities.keys() else self.get_actions(obs)
-                used_actions[int(action[0])] += 1
-                _, next_obs, reward, done, info = env.step(action)
-                if done:
-                    print("DoneAtMaxStepsReached:", len(self.agents[0]._episode))
-                next_obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
-
-                # Add small negative reward if agent has moved away from the target_pile
-                # reward = self.reward_distance(env, obs, target_pile, reward)
-
-                # Check and handle if agent is on field with dirt. This method can change the observation for the next step.
-                # If pile_all_done is "single", the episode ends if agents reached its target pile and the new episode begins
-                # with the updated observation. The observation that is saved to the rollout buffer, which resulted in reaching
-                # the target pile should not be updated before saving. Thus, the self.transform_observations call must happen
-                # before this method is called.
-                reward, done = self.handle_dirt(env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done)
-
-                if n_steps != 0 and (global_steps + 1) % n_steps == 0:
-                    print("max_steps reached")
-                    done = True
-
-                done = [done] * self.n_agents if isinstance(done, bool) else done
-                for ag_i, agent in enumerate(self.agents):
-                    # For forced actions like door opening, we have to call the step function with this action, but
-                    # since we are not allowed to exceed the dimensions range, we can't log the corresponding step info.
-                    if action[ag_i] in range(self.act_dim):
-                        # Add agent results into respective rollout buffers
-                        agent._episode[-1] = (next_obs[ag_i], action[ag_i], reward[ag_i], agent._episode[-1][-1])
-
-                if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
-                    env.render()
-
-                obs = next_obs
-
-                if all(done): self.handle_finished_episode(obs)
-
-                global_steps += 1
-                rew_log += sum(reward)
-
-                if global_steps >= max_steps:
-                    break
-
-            print(f'reward at episode: {episode} = {rew_log}')
-            self.reward_development.append(rew_log)
-            episode += 1
-
-        self.plot_reward_development()
-        if self.cfg[nms.ENV]["save_and_log"]:
-            self.create_info_maps(env, used_actions)
-            self.save_agent_models()
-            plot_action_maps(env, [self], self.results_path)
-
-    @torch.inference_mode(True)
-    def eval_loop(self, n_episodes, render=False):
-        env = self.eval_factory
-        self.set_cfg(eval=True)
-        episode, results = 0, []
-        dirt_piles_positions = self.get_dirt_piles_positions(env)
-        indices = self.distribute_indices(env)
-        target_pile = [partition[0] for partition in indices]  # pointer that points to the target pile for each agent. (point to same pile, point to different piles)
-        if self.cfg[nms.ALGORITHM]["pile_all_done"] == "distributed":
-            cleaned_dirt_piles = [{dirt_piles_positions[idx]: False for idx in indices[i]} for i in range(self.n_agents)]
-        else:
-            cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
-
-        while episode < n_episodes:
-            obs = env.reset()
-            if self.cfg[nms.ENV][nms.EVAL_RENDER]:
-                if self.cfg[nms.ENV]["save_and_log"] and self.cfg[nms.ENV]["record"]:
-                    env.set_recorder(self.recorder)
-                env.render()
-                env._renderer.fps = 5
-            self.set_agent_spawnpoint(env)
-            """obs = list(obs.values())"""
-            # Reset current target pile at episode begin if all piles have to be cleaned in one episode
-            if self.cfg[nms.ALGORITHM]["pile_all_done"] in ["all", "distributed", "shared"]:
-                target_pile = [partition[0] for partition in indices]
-                if self.cfg[nms.ALGORITHM]["pile_all_done"] == "distributed":
-                    cleaned_dirt_piles = [{dirt_piles_positions[idx]: False for idx in indices[i]} for i in range(self.n_agents)]
-                else:
-                    cleaned_dirt_piles = [{pos: False for pos in dirt_piles_positions} for _ in range(self.n_agents)]
-
-            ordered_dirt_piles = self.get_ordered_dirt_piles(env, cleaned_dirt_piles, target_pile)
-
-            obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
-            done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)
-
-            # Add Clean and Noop actions to agent actions so that they can be executed when the agent comes on a dirpile
-            """for i in range(self.n_agents):
-                self.factory.state['Agent'][i].actions.extend([Clean(), Noop()])"""
-
-            while not all(done):
-                action = self.use_door_or_move(env, obs, cleaned_dirt_piles, target_pile, det=True) \
-                    if "Doors" in env.state.entities.keys() else self.execute_policy(obs, env, cleaned_dirt_piles) # zero exploration
-                _, next_obs, reward, done, info = env.step(action) # Note that this call seems to flip the lists in indices
-                if done:
-                    print("DoneAtMaxStepsReached:", len(self.agents[0]._episode))
-
-                # Add small negative reward if agent has moved away from the target_pile
-                # reward = self.reward_distance(env, obs, target_pile, reward)
-
-                # Check and handle if agent is on field with dirt
-                reward, done = self.handle_dirt(env, cleaned_dirt_piles, ordered_dirt_piles, target_pile, indices, reward, done)
-
-                # Get transformed next_obs that might have been updated because of self.handle_dirt.
-                # For eval, where pile_all_done is "all", it's mandatory that the potential change of the target pile
-                # in the observation, caused by self.handle_dirt, is already considered when the next action is calculated.
-                next_obs = self.transform_observations(env, ordered_dirt_piles, target_pile)
-
-                done = [done] * self.n_agents if isinstance(done, bool) else done
-
-                if self.cfg[nms.ENV][nms.EVAL_RENDER]:
-                    env.render()
-
-                obs = next_obs
-
-            episode += 1
-
-        # Properly finalize the video file
-        if self.cfg[nms.ENV]["save_and_log"] and self.cfg[nms.ENV]["record"]:
-            self.recorder.close()
-
-    def plot_reward_development(self):
-        smoothed_data = np.convolve(self.reward_development, np.ones(10) / 10, mode='valid')
-        plt.plot(smoothed_data)
-        plt.ylim([-10, max(smoothed_data) + 20])
-        plt.title('Smoothed Reward Development')
-        plt.xlabel('Episode')
-        plt.ylabel('Reward')
-        if self.cfg[nms.ENV]["save_and_log"]:
-            plt.savefig(f"{self.results_path}/smoothed_reward_development.png")
-        plt.show()
-
-    def save_configs(self):
-        with open(f"{self.results_path}/MARL_config.txt", "w") as txt_file:
-            txt_file.write(str(self.cfg))
-        with open(f"{self.results_path}/train_env_config.txt", "w") as txt_file:
-            txt_file.write(str(self.factory.conf))
-        with open(f"{self.results_path}/eval_env_config.txt", "w") as txt_file:
-            txt_file.write(str(self.eval_factory.conf))
-
-    def save_agent_models(self):
-        for idx, agent in enumerate(self.agents):
-            agent_name = list(self.factory.state.agents_conf.keys())[idx]
-            agent.pi.save_model_parameters(self.results_path, agent_name)
-            agent.vf.save_model_parameters(self.results_path, agent_name)
-
-    def load_agents(self, runs_list):
-        for idx, run in enumerate(runs_list):
-            run_path = f"../study_out/{run}"
-            agent_name = list(self.eval_factory.state.agents_conf.keys())[idx]
-            self.agents[idx].pi.load_model_parameters(f"{run_path}/{agent_name}_PolicyNet_model_parameters.pth")
-            self.agents[idx].vf.load_model_parameters(f"{run_path}/{agent_name}_ValueNet_model_parameters.pth")
-
-    def create_info_maps(self, env, used_actions):
-        # Create value map
-        all_valid_observations = self.get_all_observations(env)
-        dirt_piles_positions = self.get_dirt_piles_positions(env)
-        with open(f"{self.results_path}/info_maps.txt", "w") as txt_file:
-            for obs_layer, pos in enumerate(dirt_piles_positions):
-                observations_shape = (
-                max(t[0] for t in env.state.entities.floorlist) + 2, max(t[1] for t in env.state.entities.floorlist) + 2)
-                value_maps = [np.zeros(observations_shape) for _ in self.agents]
-                likeliest_action = [np.full(observations_shape, np.NaN) for _ in self.agents]
-                action_probabilities = [np.zeros((observations_shape[0], observations_shape[1], self.act_dim)) for
-                                        _ in self.agents]
-                for obs in all_valid_observations[obs_layer]:
-                    """obs = self._as_torch(obs).view(-1).to(torch.float32)"""
-                    for idx, agent in enumerate(self.agents):
-                        """indices = np.where(obs[1] == 1) # Get agent position on grid (1 indicates the position)
-                        x, y = indices[0][0], indices[1][0]"""
-                        x, y = int(obs[0]), int(obs[1])
-                        try:
-                            value_maps[idx][x][y] = agent.vf(obs)
-                            probs = agent.pi.distribution(obs).probs
-                            likeliest_action[idx][x][y] = torch.argmax(probs)  # get the likeliest action at the current agent position
-                            action_probabilities[idx][x][y] = probs
-                        except:
-                            pass
-
-                txt_file.write("=======Value Maps=======\n")
-                print("=======Value Maps=======")
-                for agent_idx, vmap in enumerate(value_maps):
-                    txt_file.write(f"Value map of agent {agent_idx} for target pile {pos}:\n")
-                    print(f"Value map of agent {agent_idx} for target pile {pos}:")
-                    vmap = self._as_torch(vmap).round(decimals=4)
-                    max_digits = max(len(str(vmap.max().item())), len(str(vmap.min().item())))
-                    for idx, row in enumerate(vmap):
-                        txt_file.write(' '.join(f" {elem:>{max_digits + 1}}" for elem in row.tolist()))
-                        txt_file.write("\n")
-                        print(' '.join(f" {elem:>{max_digits + 1}}" for elem in row.tolist()))
-                txt_file.write("\n")
-                txt_file.write("=======Likeliest Action=======\n")
-                print("=======Likeliest Action=======")
-                for agent_idx, amap in enumerate(likeliest_action):
-                    txt_file.write(f"Likeliest action map of agent {agent_idx} for target pile {pos}:\n")
-                    print(f"Likeliest action map of agent {agent_idx} for target pile {pos}:")
-                    txt_file.write(np.array2string(amap))
-                    print(amap)
-                txt_file.write("\n")
-                txt_file.write("=======Action Probabilities=======\n")
-                print("=======Action Probabilities=======")
-                for agent_idx, pmap in enumerate(action_probabilities):
-                    self.action_probabilities[agent_idx].append(pmap)
-                    txt_file.write(f"Action probability map of agent {agent_idx} for target pile {pos}:\n")
-                    print(f"Action probability map of agent {agent_idx} for target pile {pos}:")
-                    for d in range(pmap.shape[0]):
-                        row = '['
-                        for r in range(pmap.shape[1]):
-                            row += "[" + ', '.join(f"{x:7.4f}" for x in pmap[d, r]) + "]"
-                        txt_file.write(row + "]")
-                        txt_file.write("\n")
-                        print(row + "]")
-                txt_file.write(f"Used actions: {used_actions}\n")
-                print("Used actions:", used_actions)
-
--- a/marl_factory_grid/algorithms/marl/base_a2c.py
+++ b/marl_factory_grid/algorithms/marl/base_a2c.py
@ -2,8 +2,6 @@ import numpy as np; import torch as th; import scipy as sp;
 from collections import deque
 from torch import nn

-# RLLab Magic for calculating the discounted return G(t) = R(t) + gamma * R(t-1)
-# cf. https://github.com/rll/rllab/blob/ba78e4c16dc492982e648f117875b22af3965579/rllab/misc/special.py#L107
 cumulate_discount = lambda x, gamma: sp.signal.lfilter([1], [1, - gamma], x[::-1], axis=0)[::-1]

 class Net(th.nn.Module):
@ -21,11 +19,11 @@ class Net(th.nn.Module):
        if module.bias is not None:
          nn.init.uniform_(module.bias, a=-0.1, b=0.1)

-  def save_model(self, path, agent_name):
-    th.save(self.net, f"{path}/{agent_name}_{self.__class__.__name__}_model.pth")
+  def save_model(self, path):
+    th.save(self.net, f"{path}/{self.__class__.__name__}_model.pth")

-  def save_model_parameters(self, path, agent_name):
-    th.save(self.net.state_dict(), f"{path}/{agent_name}_{self.__class__.__name__}_model_parameters.pth")
+  def save_model_parameters(self, path):
+    th.save(self.net.state_dict(), f"{path}/{self.__class__.__name__}_model_parameters.pth")

  def load_model_parameters(self, path):
    self.net.load_state_dict(th.load(path))
--- a/marl_factory_grid/algorithms/marl/constants.py
+++ b/marl_factory_grid/algorithms/marl/constants.py
@ -1,3 +1,4 @@
+
 class Names:
    ENV             = 'env'
    ENV_NAME = 'env_name'
@ -35,3 +36,8 @@ class Names:
    SINGLE = 'single'
    DISTRIBUTED = 'distributed'
    SHARED = 'shared'
+    EARLY_STOPPING = 'early_stopping'
+    TRAIN = 'train'
+    SEED = 'seed'
+    LAST_N_EPISODES = 'last_n_episodes'
+    MEAN_TARGET_CHANGE = 'mean_target_change'
--- a/marl_factory_grid/algorithms/marl/multi_agent_configs/coin_quadrant_eval_config.yaml
+++ b/marl_factory_grid/algorithms/marl/multi_agent_configs/coin_quadrant_eval_config.yaml
@ -0,0 +1,12 @@
+env:
+  classname:          marl_factory_grid.configs.marl.multi_agent_configs
+  env_name:           "marl/multi_agent_configs/coin_quadrant_eval_config"
+  n_agents:           2 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  pile-order:         "smart" # Triggers implementation of our emergence prevention mechanism. Agents consider distance to other agent
+  pile-observability: "single" # Agents can only perceive one coin pile at any given time step
+  pile_all_done:      "shared" # Indicates that agents don't have to collect the same coin piles
+  auxiliary_piles:    False # Coin quadrant does not use this option
--- a/marl_factory_grid/algorithms/marl/multi_agent_configs/coin_quadrant_eval_config_emergent.yaml
+++ b/marl_factory_grid/algorithms/marl/multi_agent_configs/coin_quadrant_eval_config_emergent.yaml
@ -0,0 +1,13 @@
+# Configuration that shows emergent behavior in out coin-quadrant environment
+env:
+  classname:          marl_factory_grid.configs.marl.multi_agent_configs
+  env_name:           "marl/multi_agent_configs/coin_quadrant_eval_config"
+  n_agents:           2 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  pile-order:         "dynamic" # Agents only decide on next target pile based on the distance to the respective piles
+  pile-observability: "single" # Agents can only perceive one coin pile at any given time step
+  pile_all_done:      "shared" # Indicates that agents don't have to collect the same coin piles
+  auxiliary_piles:    False # Coin quadrant does not use this option
--- a/marl_factory_grid/algorithms/marl/multi_agent_configs/two_rooms_eval_config.yaml
+++ b/marl_factory_grid/algorithms/marl/multi_agent_configs/two_rooms_eval_config.yaml
@ -0,0 +1,16 @@
+env:
+  classname:          marl_factory_grid.configs.marl.multi_agent_configs
+  env_name:           "marl/multi_agent_configs/two_rooms_eval_config"
+  n_agents:           2 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  # Piles (=encoded flags) are evenly distributed among the two agents and have to be collected in the order defined
+  # by the environment config (cf. coords_or_quantity)
+  pile-order:         "agents"
+  pile-observability: "single" # Agents can only perceive one dirt pile at any given time step
+  pile_all_done:      "distributed" # Indicates that agents must clean their specifically assigned dirt piles
+  auxiliary_piles:    True # Allows agents to go to an auxiliary pile
+
+
--- a/marl_factory_grid/algorithms/marl/multi_agent_configs/two_rooms_eval_config_emergent.yaml
+++ b/marl_factory_grid/algorithms/marl/multi_agent_configs/two_rooms_eval_config_emergent.yaml
@ -0,0 +1,17 @@
+# Configuration that shows emergent behavior in our two-rooms environment
+env:
+  classname:          marl_factory_grid.configs..marl.multi_agent_configs
+  env_name:           "marl/multi_agent_configs/two_rooms_eval_config_emergent"
+  n_agents:           2 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  # Piles (=encoded flags) are evenly distributed among the two agents and have to be collected in the order defined
+  # by the environment config (cf. coords_or_quantity)
+  pile-order:         "agents"
+  pile-observability: "single" # Agents can only perceive one dirt pile at any given time step
+  pile_all_done:      "distributed" # Indicates that agents must clean their specifically assigned dirt piles
+  auxiliary_piles:    False # Shows emergent behavior
+
+
--- a/marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_eval_config.yaml
+++ b/marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_eval_config.yaml
@ -0,0 +1,13 @@
+env:
+  classname:          marl_factory_grid.configs.marl.single_agent_configs
+  env_name:           "marl/single_agent_configs/coin_quadrant_agent1_eval_config"
+  n_agents:           1 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  pile-order:         "fixed" # Clean coin piles in a fixed order specified by the environment config (cf. coords_or_quantity)
+  pile-observability: "single" # Agent can only perceive one coin pile at any given time step
+  pile_all_done:      "all" # During inference the episode ends only when all coin piles are cleaned
+  auxiliary_piles:    False # Coin quadrant does not use this option
+
--- a/marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_train_config.yaml
+++ b/marl_factory_grid/algorithms/marl/single_agent_configs/coin_quadrant_train_config.yaml
@ -0,0 +1,21 @@
+env:
+  classname:          marl_factory_grid.configs.marl.single_agent_configs
+  env_name:           "marl/single_agent_configs/coin_quadrant_agent1_train_config"
+  n_agents:           1 # Number of agents in the environment
+  train_render:       False # If training should be graphically visualized
+  save_and_log:       True # If configurations and potential logging files should be saved
+algorithm:
+  seed:               9 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  gamma:              0.99 # The gamma value that is used as discounting factor
+  n_steps:            0 # How much experience should be sampled at most until the next value- and policy-net updates are performed. (0 = Monte Carlo)
+  chunk-episode:      20000 # For update, splits very large episodes in batches of approximately equal size. (0 = update networks with full episode at once)
+  max_steps:          400000 # Number of training steps used for agent1 (=agent2)
+  early_stopping:     True # If the early stopping functionality should be used
+  last_n_episodes:    100 # To determine if low change phase has begun, the last n episodes are checked if the mean target change is reached
+  mean_target_change: 2.0 # What should be the accepted fluctuation for determining if a low change phase has begun
+  advantage:          "Advantage-AC" # Defines the used actor critic model
+  pile-order:         "fixed" # Clean coin piles in a fixed order specified by the environment config (cf. coords_or_quantity)
+  pile-observability: "single" # Agent can only perceive one coin pile at any given time step
+  pile_all_done:      "single" # Episode ends when the current target pile is cleaned
+  auxiliary_piles:    False # Coin quadrant does not use this option
+
--- a/marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_eval_config.yaml
+++ b/marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_eval_config.yaml
@ -0,0 +1,14 @@
+env:
+  classname:          marl_factory_grid.configs.marl.single_agent_configs
+  env_name:           "marl/single_agent_configs/two_rooms_agent2_eval_config"
+  n_agents:           1 # Number of agents in the environment
+  eval_render:        True # If inference should be graphically visualized
+  save_and_log:       False # If configurations and potential logging files should be saved
+algorithm:
+  seed:               42 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  pile-order:         "fixed" # Clean coin piles (=encoded flags) in a fixed order specified by the environment config (cf. coords_or_quantity)
+  pile-observability: "single" # Agent can only perceive one coin pile at any given time step
+  pile_all_done:      "all" # During inference the episode ends only when all coin piles are cleaned
+  auxiliary_piles:    False # Auxiliary piles are only differentiated from regular target piles during marl eval
+
+
--- a/marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_train_config.yaml
+++ b/marl_factory_grid/algorithms/marl/single_agent_configs/two_rooms_train_config.yaml
@ -0,0 +1,22 @@
+env:
+  classname:          marl_factory_grid.configs.marl.single_agent_configs
+  env_name:           "marl/single_agent_configs/two_rooms_agent2_train_config"
+  n_agents:           1 # Number of agents in the environment
+  train_render:       False # If training should be graphically visualized
+  save_and_log:       True # If configurations and potential logging files should be saved
+algorithm:
+  seed:               9 # Picks seed to make random parts of algorithm reproducible. -1 for random seed
+  gamma:              0.99 # The gamma value that is used as discounting factor
+  n_steps:            0 # How much experience should be sampled at most until the next value- and policy-net updates are performed. (0 = Monte Carlo)
+  chunk-episode:      20000 # For update, splits very large episodes in batches of approximately equal size. (0 = update networks with full episode at once)
+  max_steps:          300000 # Number of training steps used to train the agent. Here, only a placeholder value
+  early_stopping:     True # If the early stopping functionality should be used
+  last_n_episodes:    100 # To determine if low change phase has begun, the last n episodes are checked if the mean target change is reached
+  mean_target_change: 2.0 # What should be the accepted fluctuation for determining if a low change phase has begun
+  advantage:          "Advantage-AC" # Defines the used actor critic model
+  pile-order:         "fixed" # Clean coin piles (=encoded flags) in a fixed order specified by the environment config (cf. coords_or_quantity)
+  pile-observability: "single" # Agent can only perceive one coin pile at any given time step
+  pile_all_done:      "single" # Episode ends when the current target pile is cleaned
+  auxiliary_piles:    False # Auxiliary piles are only differentiated from regular target piles during marl eval
+
+
--- a/marl_factory_grid/algorithms/marl/utils.py
+++ b/marl_factory_grid/algorithms/marl/utils.py
@ -1,11 +1,14 @@
 import copy
+import os
+from pathlib import Path
 from typing import List
 import numpy as np
+import pandas as pd
 import torch

-from marl_factory_grid.algorithms.rl.constants import Names as nms
+from marl_factory_grid.algorithms.marl.constants import Names as nms

-from marl_factory_grid.algorithms.rl.base_a2c import cumulate_discount
+from marl_factory_grid.algorithms.marl.base_a2c import cumulate_discount


 def _as_torch(x):
@ -187,7 +190,7 @@ def distribute_indices(env, cfg, n_agents):
        # -> Starting with index 0 even piles are auxiliary piles, odd piles are primary piles
        if cfg[nms.ALGORITHM][nms.AUXILIARY_PILES] and nms.DOORS in env.state.entities.keys():
            door_positions = [door.pos for door in env.state.entities[nms.DOORS]]
-            distances = {door_pos: [] for door_pos in door_positions}
+            distances = {door_pos:[] for door_pos in door_positions}

            # Calculate distance of every agent to every door
            for door_pos in door_positions:
@ -198,7 +201,7 @@ def distribute_indices(env, cfg, n_agents):
                return [i for i, x in enumerate(lst) if x == item]

            # Get agent indices of agents with same distance to door
-            affected_agents = {door_pos: {} for door_pos in door_positions}
+            affected_agents = {door_pos:{} for door_pos in door_positions}
            for door_pos in distances.keys():
                dist = distances[door_pos]
                dist_set = set(dist)
@ -206,22 +209,20 @@ def distribute_indices(env, cfg, n_agents):
                    affected_agents[door_pos][str(d)] = duplicate_indices(dist, d)

            updated_indices = []
-
-            for door_pos, agent_distances in affected_agents.items():
-                if len(agent_distances) == 0:
-                    # Remove auxiliary piles for all agents
-                    # (In config, we defined every pile with an even numbered index to be an auxiliary pile)
-                    updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
-                else:
-                    for distance, agent_indices in agent_distances.items():
-                        # For each distance group, pick one random agent to keep the auxiliary pile
-                        # selected_agent = np.random.choice(agent_indices)
-                        selected_agent = 0
-                        for agent_idx in agent_indices:
-                            if agent_idx == selected_agent:
-                                updated_indices.append(indices[agent_idx])
-                            else:
-                                updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])
+            if len(affected_agents[door_positions[0]]) == 0:
+                # Remove auxiliary piles for all agents
+                # (In config, we defined every pile with an even numbered index to be an auxiliary pile)
+                updated_indices = [[ele for ele in lst if ele % 2 != 0] for lst in indices]
+            else:
+                for distance, agent_indices in affected_agents[door_positions[0]].items():
+                    # Pick random agent to keep auxiliary pile and remove it for all others
+                    #selected_agent = np.random.choice(agent_indices)
+                    selected_agent = 0
+                    for agent_idx in agent_indices:
+                        if agent_idx == selected_agent:
+                            updated_indices.append(indices[agent_idx])
+                        else:
+                            updated_indices.append([ele for ele in indices[agent_idx] if ele % 2 != 0])

            indices = updated_indices

@ -335,3 +336,42 @@ def save_agent_models(results_path, agents):
    for idx, agent in enumerate(agents):
        agent.pi.save_model_parameters(results_path)
        agent.vf.save_model_parameters(results_path)
+
+
+def has_low_change_phase_started(return_change_development, last_n_episodes, mean_target_change):
+    """ Checks if training has reached a phase with only marginal average change """
+    if np.mean(np.abs(return_change_development[-last_n_episodes:])) < mean_target_change:
+        print("Low change phase started.")
+        return True
+    return False
+
+
+def significant_deviation(return_change_development, low_change_phase_start_episode):
+    """ Determines if a significant return deviation has occurred in the last episode """
+    return_change_development = return_change_development[low_change_phase_start_episode:]
+
+    df = pd.DataFrame({'Episode': range(len(return_change_development)), 'DeltaReturn': return_change_development})
+    df['Difference'] = df['DeltaReturn'].diff().abs()
+
+    # Only the most extreme changes (those that are greater than 99.99% of all changes) will be considered significant
+    threshold = df['Difference'].quantile(0.9999)
+
+    # Identify significant changes
+    significant_changes = df[df['Difference'] > threshold]
+    print("Threshold: ", threshold, "Significant changes: ", significant_changes)
+
+    if len(significant_changes["Episode"]) > 0:
+        return True
+    return False
+
+
+def get_algorithms_marl_path():
+    return Path(Path(__file__).parent)
+
+
+def get_configs_marl_path():
+    return Path(os.path.join(Path(__file__).parent.parent.parent, "configs"))
+
+
+def get_agent_models_path():
+    return Path(os.path.join(Path(__file__).parent.parent, "agent_models"))
--- a/marl_factory_grid/algorithms/rl/init.py
+++ b/marl_factory_grid/algorithms/rl/init.py
@ -1 +0,0 @@
-from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
--- a/marl_factory_grid/algorithms/rl/base_a2c.py
+++ b/marl_factory_grid/algorithms/rl/base_a2c.py
@ -1,112 +0,0 @@
-import numpy as np
-import torch as th
-import scipy as sp
-from collections import deque
-from torch import nn
-
-cumulate_discount = lambda x, gamma: sp.signal.lfilter([1], [1, - gamma], x[::-1], axis=0)[::-1]
-
-
-class Net(th.nn.Module):
-    def __init__(self, shape, activation, lr):
-        super().__init__()
-        self.net = th.nn.Sequential(*[layer
-                                      for io, a in zip(zip(shape[:-1], shape[1:]),
-                                                       [activation] * (len(shape) - 2) + [th.nn.Identity])
-                                      for layer in [th.nn.Linear(*io), a()]])
-        self.optimizer = th.optim.Adam(self.net.parameters(), lr=lr)
-
-        # Initialize weights uniformly, so that for the policy net all actions have approximately the same
-        # probability in the beginning
-        for module in self.modules():
-            if isinstance(module, nn.Linear):
-                nn.init.uniform_(module.weight, a=-0.1, b=0.1)
-                if module.bias is not None:
-                    nn.init.uniform_(module.bias, a=-0.1, b=0.1)
-
-    def save_model(self, path):
-        th.save(self.net, f"{path}/{self.__class__.__name__}_model.pth")
-
-    def save_model_parameters(self, path):
-        th.save(self.net.state_dict(), f"{path}/{self.__class__.__name__}_model_parameters.pth")
-
-    def load_model_parameters(self, path):
-        self.net.load_state_dict(th.load(path))
-        self.net.eval()
-
-
-class ValueNet(Net):
-    def __init__(self, obs_dim, hidden_sizes=[64, 64], activation=th.nn.ReLU, lr=1e-3):
-        super().__init__([obs_dim] + hidden_sizes + [1], activation, lr)
-
-    def forward(self, obs): return self.net(obs)
-
-    def loss(self, states, returns): return ((returns - self(states)) ** 2).mean()
-
-
-class PolicyNet(Net):
-    def __init__(self, obs_dim, act_dim, hidden_sizes=[64, 64], activation=th.nn.Tanh, lr=3e-4):
-        super().__init__([obs_dim] + hidden_sizes + [act_dim], activation, lr)
-        self.distribution = lambda obs: th.distributions.Categorical(logits=self.net(obs))
-
-    def forward(self, obs, act=None, det=False):
-        """Given an observation: Returns policy distribution and probablilty for a given action
-          or Returns a sampled action and its corresponding probablilty"""
-        pi = self.distribution(obs)
-        if act is not None: return pi, pi.log_prob(act)
-        act = self.net(obs).argmax() if det else pi.sample()  # sample from the learned distribution
-        return act, pi.log_prob(act)
-
-    def loss(self, states, actions, advantages):
-        _, logp = self.forward(states, actions)
-        loss = -(logp * advantages).mean()
-        return loss
-
-
-class PolicyGradient:
-    """ Autonomous agent using vanilla policy gradient. """
-
-    def __init__(self, env, seed=42, gamma=0.99, agent_id=0, act_dim=None, obs_dim=None):
-        self.env = env
-        self.gamma = gamma                                  # Setup env and discount
-        th.manual_seed(seed)
-        np.random.seed(seed)                                # Seed Torch, numpy and gym
-        # Keep track of previous rewards and performed steps to calcule the mean Return metric
-        self._episode, self.ep_returns, self.num_steps = [], deque(maxlen=100), 0
-        # Get observation and action shapes
-        if not obs_dim:
-            obs_size = env.observation_space.shape if len(env.state.entities.by_name("Agents")) == 1 \
-                else env.observation_space[agent_id].shape  # Single agent case vs. multi-agent case
-            obs_dim = np.prod(obs_size)
-        if not act_dim:
-            act_dim = env.action_space[agent_id].n
-        self.vf = ValueNet(obs_dim)                         # Setup Value Network (Critic)
-        self.pi = PolicyNet(obs_dim, act_dim)               # Setup Policy Network (Actor)
-
-    def step(self, obs):
-        """ Given an observation, get action and probs from policy and values from critic"""
-        with th.no_grad():
-            (a, _), v = self.pi(obs), self.vf(obs)
-        self._episode.append((None, None, None, v))
-        return a.numpy()
-
-    def policy(self, obs, det=True):
-        return self.pi(obs, det=det)[0].numpy()
-
-    def finish_episode(self):
-        """Process self._episode & reset self.env, Returns (s,a,G,V)-Tuple and new inital state"""
-        s, a, r, v = (np.array(e) for e in zip(*self._episode))  # Get trajectories from rollout
-        self.ep_returns.append(sum(r))
-        self._episode = []                  # Add episode return to buffer & reset
-        return s, a, r, v                   # state, action, Return, Value Tensors
-
-    def train(self, states, actions, returns, advantages):  # Update policy weights
-        self.pi.optimizer.zero_grad()
-        self.vf.optimizer.zero_grad()       # Reset optimizer
-        states = states.flatten(1, -1)      # Reduce dimensionality to rollout_dim x input_dim
-        policy_loss = self.pi.loss(states, actions, advantages)  # Calculate Policy loss
-        policy_loss.backward()
-        self.pi.optimizer.step()            # Apply Policy loss
-        value_loss = self.vf.loss(states, returns)  # Calculate Value loss
-        value_loss.backward()
-        self.vf.optimizer.step()            # Apply Value loss
--- a/marl_factory_grid/algorithms/rl/base_ac.py
+++ b/marl_factory_grid/algorithms/rl/base_ac.py
@ -1,242 +0,0 @@
-import torch
-from typing import Union, List, Dict
-import numpy as np
-from torch.distributions import Categorical
-from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
-from marl_factory_grid.algorithms.utils import add_env_props, instantiate_class
-from pathlib import Path
-import pandas as pd
-from collections import deque
-
-
-class Names:
-    REWARD          = 'reward'
-    DONE            = 'done'
-    ACTION          = 'action'
-    OBSERVATION     = 'observation'
-    LOGITS          = 'logits'
-    HIDDEN_ACTOR    = 'hidden_actor'
-    HIDDEN_CRITIC   = 'hidden_critic'
-    AGENT           = 'agent'
-    ENV             = 'env'
-    ENV_NAME        = 'env_name'
-    N_AGENTS        = 'n_agents'
-    ALGORITHM       = 'algorithm'
-    MAX_STEPS       = 'max_steps'
-    N_STEPS         = 'n_steps'
-    BUFFER_SIZE     = 'buffer_size'
-    CRITIC          = 'critic'
-    BATCH_SIZE      = 'bnatch_size'
-    N_ACTIONS       = 'n_actions'
-    TRAIN_RENDER    = 'train_render'
-    EVAL_RENDER     = 'eval_render'
-
-
-nms = Names
-ListOrTensor = Union[List, torch.Tensor]
-
-
-class BaseActorCritic:
-    def __init__(self, cfg):
-        self.factory = add_env_props(cfg)
-        self.__training = True
-        self.cfg = cfg
-        self.n_agents = cfg[nms.AGENT][nms.N_AGENTS]
-        self.reset_memory_after_epoch = True
-        self.setup()
-
-    def setup(self):
-        self.net = instantiate_class(self.cfg[nms.AGENT])
-        self.optimizer = torch.optim.RMSprop(self.net.parameters(), lr=3e-4, eps=1e-5)
-
-    @classmethod
-    def _as_torch(cls, x):
-        if isinstance(x, np.ndarray):
-            return torch.from_numpy(x)
-        elif isinstance(x, List):
-            return torch.tensor(x)
-        elif isinstance(x, (int, float)):
-            return torch.tensor([x])
-        return x
-
-    def train(self):
-        self.__training = False
-        networks = [self.net] if not isinstance(self.net, List) else self.net
-        for net in networks:
-            net.train()
-
-    def eval(self):
-        self.__training = False
-        networks = [self.net] if not isinstance(self.net, List) else self.net
-        for net in networks:
-            net.eval()
-
-    def load_state_dict(self, path: Path):
-        pass
-
-    def get_actions(self, out) -> ListOrTensor:
-        actions = [Categorical(logits=logits).sample().item() for logits in out[nms.LOGITS]]
-        return actions
-
-    def init_hidden(self) -> Dict[str, ListOrTensor]:
-        pass
-
-    def forward(self,
-                observations:  ListOrTensor,
-                actions:       ListOrTensor,
-                hidden_actor:  ListOrTensor,
-                hidden_critic: ListOrTensor
-                ) -> Dict[str, ListOrTensor]:
-        pass
-
-    @torch.no_grad()
-    def train_loop(self, checkpointer=None):
-        env = self.factory
-        if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
-            env.render()
-        n_steps, max_steps = [self.cfg[nms.ALGORITHM][k] for k in [nms.N_STEPS, nms.MAX_STEPS]]
-        tm = MARLActorCriticMemory(self.n_agents, self.cfg[nms.ALGORITHM].get(nms.BUFFER_SIZE, n_steps))
-        global_steps, episode, df_results = 0, 0, []
-        reward_queue = deque(maxlen=2000)
-
-        while global_steps < max_steps:
-            obs = env.reset()
-            obs = list(obs.values())
-            last_hiddens        = self.init_hidden()
-            last_action, reward = [-1] * self.n_agents, [0.] * self.n_agents
-            done, rew_log       = [False] * self.n_agents, 0
-
-            if self.reset_memory_after_epoch:
-                tm.reset()
-
-            tm.add(observation=obs, action=last_action,
-                   logits=torch.zeros(self.n_agents, 1, self.cfg[nms.AGENT][nms.N_ACTIONS]),
-                   values=torch.zeros(self.n_agents, 1), reward=reward, done=done, **last_hiddens)
-
-            while not all(done):
-                out = self.forward(obs, last_action, **last_hiddens)
-                action = self.get_actions(out)
-                _, next_obs, reward, done, info = env.step(action)
-                done = [done] * self.n_agents if isinstance(done, bool) else done
-
-                if self.cfg[nms.ENV][nms.TRAIN_RENDER]:
-                    env.render()
-
-                last_hiddens = dict(hidden_actor=out[nms.HIDDEN_ACTOR],
-                                    hidden_critic=out[nms.HIDDEN_CRITIC])
-
-                logits = torch.stack([tensor.squeeze(0) for tensor in out.get(nms.LOGITS, None)], dim=0)
-                values = torch.stack([tensor.squeeze(0) for tensor in out.get(nms.CRITIC, None)], dim=0)
-
-                tm.add(observation=obs, action=action, reward=reward, done=done,
-                       logits=logits, values=values,
-                       **last_hiddens)
-
-                obs = next_obs
-                last_action = action
-
-                if (global_steps+1) % n_steps == 0 or all(done):
-                    with torch.inference_mode(False):
-                        self.learn(tm)
-
-                global_steps += 1
-                rew_log += sum(reward)
-                reward_queue.extend(reward)
-
-                if checkpointer is not None:
-                    checkpointer.step([
-                        (f'agent#{i}', agent)
-                        for i, agent in enumerate([self.net] if not isinstance(self.net, List) else self.net)
-                    ])
-
-                if global_steps >= max_steps:
-                    break
-            if global_steps%100 == 0:
-                print(f'reward at episode: {episode} = {rew_log}')
-            episode += 1
-            df_results.append([episode, rew_log, *reward])
-        df_results = pd.DataFrame(df_results,
-                                  columns=['steps', 'reward', *[f'agent#{i}' for i in range(self.n_agents)]]
-                                  )
-        if checkpointer is not None:
-            df_results.to_csv(checkpointer.path / 'results.csv', index=False)
-        return df_results
-
-    @torch.inference_mode(True)
-    def eval_loop(self, n_episodes, render=False):
-        env = self.factory
-        if self.cfg[nms.ENV][nms.EVAL_RENDER]:
-            env.render()
-        episode, results = 0, []
-        while episode < n_episodes:
-            obs = env.reset()
-            obs = list(obs.values())
-            last_hiddens           = self.init_hidden()
-            last_action, reward    = [-1] * self.n_agents, [0.] * self.n_agents
-            done, rew_log, eps_rew = [False] * self.n_agents, 0, torch.zeros(self.n_agents)
-            while not all(done):
-                out    = self.forward(obs, last_action, **last_hiddens)
-                action = self.get_actions(out)
-                _, next_obs, reward, done, info = env.step(action)
-
-                if self.cfg[nms.ENV][nms.EVAL_RENDER]:
-                    env.render()
-
-                if isinstance(done, bool):
-                    done = [done] * obs[0].shape[0]
-                obs = next_obs
-                last_action = action
-                last_hiddens = dict(hidden_actor=out.get(nms.HIDDEN_ACTOR,   None),
-                                    hidden_critic=out.get(nms.HIDDEN_CRITIC, None)
-                                    )
-                eps_rew += torch.tensor(reward)
-            results.append(eps_rew.tolist() + [sum(eps_rew).item()] + [episode])
-            episode += 1
-        agent_columns = [f'agent#{i}' for i in range(self.cfg[nms.ENV][nms.N_AGENTS])]
-        results = pd.DataFrame(results, columns=agent_columns + ['sum', 'episode'])
-        results = pd.melt(results, id_vars=['episode'], value_vars=agent_columns + ['sum'],
-                          value_name='reward', var_name='agent')
-        return results
-
-    @staticmethod
-    def compute_advantages(critic, reward, done, gamma, gae_coef=0.0):
-        tds = (reward + gamma * (1.0 - done) * critic[:, 1:].detach()) - critic[:, :-1]
-
-        if gae_coef <= 0:
-            return tds
-
-        gae = torch.zeros_like(tds[:, -1])
-        gaes = []
-        for t in range(tds.shape[1]-1, -1, -1):
-            gae = tds[:, t] + gamma * gae_coef * (1.0 - done[:, t]) * gae
-            gaes.insert(0, gae)
-        gaes = torch.stack(gaes, dim=1)
-        return gaes
-
-    def actor_critic(self, tm, network, gamma, entropy_coef, vf_coef, gae_coef=0.0, **kwargs):
-        obs, actions, done, reward = tm.observation, tm.action, tm.done[:, 1:], tm.reward[:, 1:]
-
-        out = network(obs, actions, tm.hidden_actor[:, 0].squeeze(0), tm.hidden_critic[:, 0].squeeze(0))
-        logits = out[nms.LOGITS][:, :-1]  # last one only needed for v_{t+1}
-        critic = out[nms.CRITIC]
-
-        entropy_loss = Categorical(logits=logits).entropy().mean(-1)
-        advantages = self.compute_advantages(critic, reward, done, gamma, gae_coef)
-        value_loss = advantages.pow(2).mean(-1)  # n_agent
-
-        # policy loss
-        log_ap = torch.log_softmax(logits, -1)
-        log_ap = torch.gather(log_ap, dim=-1, index=actions[:, 1:].unsqueeze(-1)).squeeze()
-        a2c_loss = -(advantages.detach() * log_ap).mean(-1)
-        # weighted loss
-        loss = a2c_loss + vf_coef*value_loss - entropy_coef * entropy_loss
-        return loss.mean()
-
-    def learn(self, tm: MARLActorCriticMemory, **kwargs):
-        loss = self.actor_critic(tm, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
-        # remove next_obs, will be added in next iter
-        self.optimizer.zero_grad()
-        loss.backward()
-        torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
-        self.optimizer.step()
-
--- a/marl_factory_grid/algorithms/rl/configs/MultiAgentConfigs/dirt_quadrant_config.yaml
+++ b/marl_factory_grid/algorithms/rl/configs/MultiAgentConfigs/dirt_quadrant_config.yaml
@ -1,34 +0,0 @@
-agent:
-  classname:           marl_factory_grid.algorithms.rl.networks.RecurrentAC
-  n_agents:            2
-  obs_emb_size:        96
-  action_emb_size:     16
-  hidden_size_actor:   64
-  hidden_size_critic:  64
-  use_agent_embedding: False
-env:
-  classname:          marl_factory_grid.configs.custom
-  env_name:           "custom/MultiAgentConfigs/dirt_quadrant_train_config"
-  n_agents:           2
-  max_steps:          250
-  pomdp_r:            2
-  stack_n_frames:     0
-  individual_rewards: True
-  train_render:       False
-  eval_render:        True
-  save_and_log:       True
-  record:             False
-method:               marl_factory_grid.algorithms.rl.LoopSEAC
-algorithm:
-  gamma:              0.99
-  entropy_coef:       0.01
-  vf_coef:            0.05
-  n_steps:            0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
-  max_steps:          200000
-  advantage:          "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
-  pile-order:         "dynamic" # Use "dynamic" to see emergent phenomenon and "smart" to prevent it
-  pile-observability: "single" # Options: "single", "all"
-  pile_all_done:      "shared" # Options: "single", "all" ("single" for training, "all" for eval), "shared"
-  auxiliary_piles:    False # Option that is only considered when pile-order = "agents"
-  chunk-episode:      20000 # Chunk size. (0 = update networks with full episode at once)
-
--- a/marl_factory_grid/algorithms/rl/configs/MultiAgentConfigs/two_rooms_one_door_modified_config.yaml
+++ b/marl_factory_grid/algorithms/rl/configs/MultiAgentConfigs/two_rooms_one_door_modified_config.yaml
@ -1,35 +0,0 @@
-agent:
-  classname:           marl_factory_grid.algorithms.rl.networks.RecurrentAC
-  n_agents:            2
-  obs_emb_size:        96
-  action_emb_size:     16
-  hidden_size_actor:   64
-  hidden_size_critic:  64
-  use_agent_embedding: False
-env:
-  classname:          marl_factory_grid.configs.custom
-  env_name:           "custom/two_rooms_one_door_modified_train_config"
-  n_agents:           2
-  max_steps:          250
-  pomdp_r:            2
-  stack_n_frames:     0
-  individual_rewards: True
-  train_render:       False
-  eval_render:        True
-  save_and_log:       True
-  record:             False
-method:               marl_factory_grid.algorithms.rl.LoopSEAC
-algorithm:
-  gamma:              0.99
-  entropy_coef:       0.01
-  vf_coef:            0.05
-  n_steps:            0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
-  max_steps:          260000
-  advantage:          "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
-  pile-order:         "agents" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
-  pile-observability: "single" # Options: "single", "all"
-  pile_all_done:      "distributed" # Options: "single", "all" ("single" for training, "all" and "distributed" for eval)
-  auxiliary_piles:    True # Use True to see emergent phenomenon and False to prevent it
-  chunk-episode:      20000 # Chunk size. (0 = update networks with full episode at once)
-
-
--- a/marl_factory_grid/algorithms/rl/configs/dirt_quadrant_config.yaml
+++ b/marl_factory_grid/algorithms/rl/configs/dirt_quadrant_config.yaml
@ -1,34 +0,0 @@
-agent:
-  classname:           marl_factory_grid.algorithms.rl.networks.RecurrentAC
-  n_agents:            1
-  obs_emb_size:        96
-  action_emb_size:     16
-  hidden_size_actor:   64
-  hidden_size_critic:  64
-  use_agent_embedding: False
-env:
-  classname:          marl_factory_grid.configs.custom
-  env_name:           "custom/dirt_quadrant_train_config"
-  n_agents:           1
-  max_steps:          250
-  pomdp_r:            2
-  stack_n_frames:     0
-  individual_rewards: True
-  train_render:       False
-  eval_render:        True
-  save_and_log:       True
-  record:             False
-method:               marl_factory_grid.algorithms.rl.LoopSEAC
-algorithm:
-  gamma:              0.99
-  entropy_coef:       0.01
-  vf_coef:            0.05
-  n_steps:            0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
-  max_steps:          240000
-  advantage:          "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
-  pile-order:         "fixed" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
-  pile-observability: "single" # Options: "single", "all"
-  pile_all_done:      "single" # Options: "single", "all" ("single" for training, "all" for eval)
-  auxiliary_piles:    False # Option that is only considered when pile-order = "agents"
-  chunk-episode:      20000 # Chunk size. (0 = update networks with full episode at once)
-
--- a/marl_factory_grid/algorithms/rl/configs/environment_changes
+++ b/marl_factory_grid/algorithms/rl/configs/environment_changes
@ -1,8 +0,0 @@
-marl_factory_grid>environment>rules.py#SpawnEntity.on_reset()
-marl_factory_grid>environment>rewards.py
-marl_factory_grid>modules>clean_up>groups.py#DirtPiles.trigger_spawn()
-marl_factory_grid>environment>rules.py#AgentSpawnRule
-marl_factory_grid>utils>states.py#GameState.__init__()
-marl_factory_grid>environment>factory.py>Factory#render
-marl_factory_grid>environment>factory.py>Factory#set_recorder
-marl_factory_grid>utils>renderer.py>Renderer#render
--- a/marl_factory_grid/algorithms/rl/configs/two_rooms_one_door_modified_config.yaml
+++ b/marl_factory_grid/algorithms/rl/configs/two_rooms_one_door_modified_config.yaml
@ -1,35 +0,0 @@
-agent:
-  classname:           marl_factory_grid.algorithms.rl.networks.RecurrentAC
-  n_agents:            1
-  obs_emb_size:        96
-  action_emb_size:     16
-  hidden_size_actor:   64
-  hidden_size_critic:  64
-  use_agent_embedding: False
-env:
-  classname:          marl_factory_grid.configs.custom
-  env_name:           "custom/two_rooms_one_door_modified_train_config"
-  n_agents:           1
-  max_steps:          250
-  pomdp_r:            2
-  stack_n_frames:     0
-  individual_rewards: True
-  train_render:       False
-  eval_render:        True
-  save_and_log:       False
-  record:             False
-method:               marl_factory_grid.algorithms.rl.LoopSEAC
-algorithm:
-  gamma:              0.99
-  entropy_coef:       0.01
-  vf_coef:            0.05
-  n_steps:            0 # How much experience should be sampled at most (n-TD) until the next value and policy update is performed. Default 0: MC
-  max_steps:          260000
-  advantage:          "Advantage-AC" # Options: "Advantage-AC", "TD-Advantage-AC", "Reinforce"
-  pile-order:         "fixed" # Options: "fixed", "random", "none", "agents", "dynamic", "smart" (Use "fixed", "random" and "none" for single agent training and the other for multi agent inference)
-  pile-observability: "single" # Options: "single", "all"
-  pile_all_done:      "single" # Options: "single", "all" ("single" for training, "all" for eval)
-  auxiliary_piles:    False # Option that is only considered when pile-order = "agents"
-  chunk-episode:      20000 # Chunk size. (0 = update networks with full episode at once)
-
-
--- a/marl_factory_grid/algorithms/rl/iac.py
+++ b/marl_factory_grid/algorithms/rl/iac.py
@ -1,57 +0,0 @@
-import torch
-from marl_factory_grid.algorithms.rl.base_ac import BaseActorCritic, nms
-from marl_factory_grid.algorithms.utils import instantiate_class
-from pathlib import Path
-from natsort import natsorted
-from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
-
-
-class LoopIAC(BaseActorCritic):
-
-    def __init__(self, cfg):
-        super(LoopIAC, self).__init__(cfg)
-
-    def setup(self):
-        self.net = [
-            instantiate_class(self.cfg[nms.AGENT]) for _ in range(self.n_agents)
-        ]
-        self.optimizer = [
-            torch.optim.RMSprop(self.net[ag_i].parameters(), lr=3e-4, eps=1e-5) for ag_i in range(self.n_agents)
-        ]
-
-    def load_state_dict(self, path: Path):
-        paths = natsorted(list(path.glob('*.pt')))
-        for path, net in zip(paths, self.net):
-            net.load_state_dict(torch.load(path))
-
-    @staticmethod
-    def merge_dicts(ds):  # todo could be recursive for more than 1 hierarchy
-        d = {}
-        for k in ds[0].keys():
-            d[k] = [d[k] for d in ds]
-        return d
-
-    def init_hidden(self):
-        ha  = [net.init_hidden_actor()  for net in self.net]
-        hc  = [net.init_hidden_critic() for net in self.net]
-        return dict(hidden_actor=ha, hidden_critic=hc)
-
-    def forward(self, observations, actions, hidden_actor, hidden_critic):
-        outputs = [
-            net(
-                self._as_torch(observations[ag_i]).unsqueeze(0).unsqueeze(0),  # agent x time
-                self._as_torch(actions[ag_i]).unsqueeze(0),
-                hidden_actor[ag_i],
-                hidden_critic[ag_i]
-                ) for ag_i, net in enumerate(self.net)
-        ]
-        return self.merge_dicts(outputs)
-
-    def learn(self, tms: MARLActorCriticMemory, **kwargs):
-        for ag_i in range(self.n_agents):
-            tm, net = tms(ag_i), self.net[ag_i]
-            loss = self.actor_critic(tm, net, **self.cfg[nms.ALGORITHM], **kwargs)
-            self.optimizer[ag_i].zero_grad()
-            loss.backward()
-            torch.nn.utils.clip_grad_norm_(net.parameters(), 0.5)
-            self.optimizer[ag_i].step()
--- a/marl_factory_grid/algorithms/rl/mappo.py
+++ b/marl_factory_grid/algorithms/rl/mappo.py
@ -1,66 +0,0 @@
-from marl_factory_grid.algorithms.rl.base_ac import Names as nms
-from marl_factory_grid.algorithms.rl.snac import LoopSNAC
-from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
-import torch
-from torch.distributions import Categorical
-from marl_factory_grid.algorithms.utils import instantiate_class
-
-
-class LoopMAPPO(LoopSNAC):
-    def __init__(self, *args, **kwargs):
-        super(LoopMAPPO, self).__init__(*args, **kwargs)
-        self.reset_memory_after_epoch = False
-
-    def setup(self):
-        self.net = instantiate_class(self.cfg[nms.AGENT])
-        self.optimizer = torch.optim.Adam(self.net.parameters(), lr=3e-4, eps=1e-5)
-
-    def learn(self, tm: MARLActorCriticMemory, **kwargs):
-        if len(tm) >= self.cfg['algorithm']['buffer_size']:
-            # only learn when buffer is full
-            for batch_i in range(self.cfg['algorithm']['n_updates']):
-                batch = tm.chunk_dataloader(chunk_len=self.cfg['algorithm']['n_steps'],
-                                            k=self.cfg['algorithm']['batch_size'])
-                loss = self.mappo(batch, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
-                self.optimizer.zero_grad()
-                loss.backward()
-                torch.nn.utils.clip_grad_norm_(self.net.parameters(), 0.5)
-                self.optimizer.step()
-
-    def monte_carlo_returns(self, rewards, done, gamma):
-        rewards_ = []
-        discounted_reward = torch.zeros_like(rewards[:, -1])
-        for t in range(rewards.shape[1]-1, -1, -1):
-            discounted_reward = rewards[:, t] + (gamma * (1.0 - done[:, t]) * discounted_reward)
-            rewards_.insert(0, discounted_reward)
-        rewards_ = torch.stack(rewards_, dim=1)
-        return rewards_
-
-    def mappo(self, batch, network, gamma, entropy_coef, vf_coef, clip_range, **__):
-        out = network(batch[nms.OBSERVATION], batch[nms.ACTION], batch[nms.HIDDEN_ACTOR], batch[nms.HIDDEN_CRITIC])
-        logits = out[nms.LOGITS][:, :-1]  # last one only needed for v_{t+1}
-
-        old_log_probs = torch.log_softmax(batch[nms.LOGITS], -1)
-        old_log_probs = torch.gather(old_log_probs, index=batch[nms.ACTION][:, 1:].unsqueeze(-1), dim=-1).squeeze()
-
-        # monte carlo returns
-        mc_returns = self.monte_carlo_returns(batch[nms.REWARD], batch[nms.DONE], gamma)
-        mc_returns = (mc_returns - mc_returns.mean()) / (mc_returns.std() + 1e-8)  # todo: norm across agent ok?
-        advantages = mc_returns - out[nms.CRITIC][:, :-1]
-
-        # policy loss
-        log_ap = torch.log_softmax(logits, -1)
-        log_ap = torch.gather(log_ap, dim=-1, index=batch[nms.ACTION][:, 1:].unsqueeze(-1)).squeeze()
-        ratio = (log_ap - old_log_probs).exp()
-        surr1 = ratio * advantages.detach()
-        surr2 = torch.clamp(ratio, 1 - clip_range, 1 + clip_range) * advantages.detach()
-        policy_loss = -torch.min(surr1, surr2).mean(-1)
-
-        # entropy & value loss
-        entropy_loss = Categorical(logits=logits).entropy().mean(-1)
-        value_loss = advantages.pow(2).mean(-1)  # n_agent
-
-        # weighted loss
-        loss = policy_loss + vf_coef*value_loss - entropy_coef * entropy_loss
-
-        return loss.mean()
--- a/marl_factory_grid/algorithms/rl/memory.py
+++ b/marl_factory_grid/algorithms/rl/memory.py
@ -1,221 +0,0 @@
-import numpy as np
-from collections import deque
-import torch
-from typing import Union
-from torch import Tensor
-from torch.utils.data import Dataset, ConcatDataset
-import random
-
-
-class ActorCriticMemory(object):
-    def __init__(self, capacity=10):
-        self.capacity = capacity
-        self.reset()
-
-    def reset(self):
-        self.__actions        = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__hidden_actor   = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__hidden_critic  = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__states         = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__rewards        = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__dones          = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__logits         = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-        self.__values         = LazyTensorFiFoQueue(maxlen=self.capacity+1)
-
-    def __len__(self):
-        return len(self.__rewards) - 1
-
-    @property
-    def observation(self, sls=slice(0, None)):  # add time dimension through stacking
-        return self.__states[sls].unsqueeze(0)      # 1 x time x hidden dim
-
-    @property
-    def hidden_actor(self,  sls=slice(0, None)):  # 1 x n_layers x dim
-        return self.__hidden_actor[sls].unsqueeze(0)    # 1 x time x n_layers x dim
-
-    @property
-    def hidden_critic(self, sls=slice(0, None)):  # 1 x n_layers x dim
-        return self.__hidden_critic[sls].unsqueeze(0)    # 1 x time x n_layers x dim
-
-    @property
-    def reward(self, sls=slice(0, None)):
-        return self.__rewards[sls].squeeze().unsqueeze(0)  # 1 x time
-
-    @property
-    def action(self, sls=slice(0, None)):
-        return self.__actions[sls].long().squeeze().unsqueeze(0)  # 1 x time
-
-    @property
-    def done(self, sls=slice(0, None)):
-        return self.__dones[sls].float().squeeze().unsqueeze(0)  # 1 x time
-
-    @property
-    def logits(self, sls=slice(0, None)):  # assumes a trailing 1 for time dimension - common when using output from NN
-        return self.__logits[sls].squeeze().unsqueeze(0)  # 1 x time x actions
-
-    @property
-    def values(self, sls=slice(0, None)):
-        return self.__values[sls].squeeze().unsqueeze(0)  # 1 x time x actions
-
-    def add_observation(self, state:  Union[Tensor, np.ndarray]):
-        self.__states.append(state    if isinstance(state, Tensor) else torch.from_numpy(state))
-
-    def add_hidden_actor(self, hidden: Tensor):
-        # layers x hidden dim
-        self.__hidden_actor.append(hidden)
-
-    def add_hidden_critic(self, hidden: Tensor):
-        # layers x hidden dim
-        self.__hidden_critic.append(hidden)
-
-    def add_action(self, action: Union[int, Tensor]):
-        if not isinstance(action, Tensor):
-            action = torch.tensor(action)
-        self.__actions.append(action)
-
-    def add_reward(self, reward: Union[float, Tensor]):
-        if not isinstance(reward, Tensor):
-            reward = torch.tensor(reward)
-        self.__rewards.append(reward)
-
-    def add_done(self, done:   bool):
-        if not isinstance(done, Tensor):
-            done = torch.tensor(done)
-        self.__dones.append(done)
-
-    def add_logits(self, logits: Tensor):
-        self.__logits.append(logits)
-
-    def add_values(self, values: Tensor):
-        self.__values.append(values)
-
-    def add(self, **kwargs):
-        for k, v in kwargs.items():
-            func = getattr(ActorCriticMemory, f'add_{k}')
-            func(self, v)
-
-
-class MARLActorCriticMemory(object):
-    def __init__(self, n_agents, capacity):
-        self.n_agents = n_agents
-        self.memories = [
-            ActorCriticMemory(capacity) for _ in range(n_agents)
-        ]
-
-    def __call__(self, agent_i):
-        return self.memories[agent_i]
-
-    def __len__(self):
-        return len(self.memories[0])  # todo add assertion check!
-
-    def reset(self):
-        for mem in self.memories:
-            mem.reset()
-
-    def add(self, **kwargs):
-        for agent_i in range(self.n_agents):
-            for k, v in kwargs.items():
-                func = getattr(ActorCriticMemory, f'add_{k}')
-                func(self.memories[agent_i], v[agent_i])
-
-    def __getattr__(self, attr):
-        all_attrs = [getattr(mem, attr) for mem in self.memories]
-        return torch.cat(all_attrs, 0)  # agent x time ...
-
-    def chunk_dataloader(self, chunk_len, k):
-        datasets = [ExperienceChunks(mem, chunk_len, k) for mem in self.memories]
-        dataset = ConcatDataset(datasets)
-        data = [dataset[i] for i in range(len(dataset))]
-        data = custom_collate_fn(data)
-        return data
-
-
-def custom_collate_fn(batch):
-    elem = batch[0]
-    return {key: torch.cat([d[key] for d in batch], dim=0) for key in elem}
-
-
-class ExperienceChunks(Dataset):
-    def __init__(self, memory, chunk_len, k):
-        assert chunk_len <= len(memory), 'chunk_len cannot be longer than the size of the memory'
-        self.memory = memory
-        self.chunk_len = chunk_len
-        self.k = k
-
-    @property
-    def whitelist(self):
-        whitelist = torch.ones(len(self.memory) - self.chunk_len)
-        for d in self.memory.done.squeeze().nonzero().flatten():
-            whitelist[max((0, d-self.chunk_len-1)):d+2] = 0
-        whitelist[0] = 0
-        return whitelist.tolist()
-
-    def sample(self, start=1):
-        cl = self.chunk_len
-        sample = dict(observation=self.memory.observation[:, start:start+cl+1],
-                      action=self.memory.action[:, start-1:start+cl],
-                      hidden_actor=self.memory.hidden_actor[:, start-1],
-                      hidden_critic=self.memory.hidden_critic[:, start-1],
-                      reward=self.memory.reward[:, start:start + cl],
-                      done=self.memory.done[:, start:start + cl],
-                      logits=self.memory.logits[:, start:start + cl],
-                      values=self.memory.values[:, start:start + cl])
-        return sample
-
-    def __len__(self):
-        return self.k
-
-    def __getitem__(self, i):
-        idx = random.choices(range(0, len(self.memory) - self.chunk_len), weights=self.whitelist, k=1)
-        return self.sample(idx[0])
-
-
-class LazyTensorFiFoQueue:
-    def __init__(self, maxlen):
-        self.maxlen = maxlen
-        self.reset()
-
-    def reset(self):
-        self.__lazy_queue = deque(maxlen=self.maxlen)
-        self.shape = None
-        self.queue = None
-
-    def shape_init(self, tensor: Tensor):
-        self.shape = torch.Size([self.maxlen, *tensor.shape])
-
-    def build_tensor_queue(self):
-        if len(self.__lazy_queue) > 0:
-            block = torch.stack(list(self.__lazy_queue), dim=0)
-            l = block.shape[0]
-            if self.queue is None:
-                self.queue = block
-            elif self.true_len() <= self.maxlen:
-                self.queue = torch.cat((self.queue, block),  dim=0)
-            else:
-                self.queue = torch.cat((self.queue[l:], block),  dim=0)
-            self.__lazy_queue.clear()
-
-    def append(self, data):
-        if self.shape is None:
-            self.shape_init(data)
-        self.__lazy_queue.append(data)
-        if len(self.__lazy_queue) >= self.maxlen:
-            self.build_tensor_queue()
-
-    def true_len(self):
-        return len(self.__lazy_queue) + (0 if self.queue is None else self.queue.shape[0])
-
-    def __len__(self):
-        return min((self.true_len(), self.maxlen))
-
-    def __str__(self):
-        return f'LazyTensorFiFoQueue\tmaxlen: {self.maxlen}, shape: {self.shape}, ' \
-               f'len: {len(self)}, true_len: {self.true_len()}, elements in lazy queue: {len(self.__lazy_queue)}'
-
-    def __getitem__(self, item_or_slice):
-        self.build_tensor_queue()
-        return self.queue[item_or_slice]
-
-
-
-
--- a/marl_factory_grid/algorithms/rl/networks.py
+++ b/marl_factory_grid/algorithms/rl/networks.py
@ -1,103 +0,0 @@
-import numpy as np
-import torch
-import torch.nn as nn
-import torch.nn.functional as F
-
-
-class RecurrentAC(nn.Module):
-    def __init__(self, observation_size, n_actions, obs_emb_size,
-                 action_emb_size, hidden_size_actor, hidden_size_critic,
-                 n_agents, use_agent_embedding=True):
-        super(RecurrentAC, self).__init__()
-        observation_size = np.prod(observation_size)
-        self.n_layers = 1
-        self.n_actions = n_actions
-        self.use_agent_embedding = use_agent_embedding
-        self.hidden_size_actor = hidden_size_actor
-        self.hidden_size_critic = hidden_size_critic
-        self.action_emb_size    = action_emb_size
-        self.obs_proj   = nn.Linear(observation_size, obs_emb_size)
-        self.action_emb =  nn.Embedding(n_actions+1, action_emb_size, padding_idx=0)
-        self.agent_emb  =  nn.Embedding(n_agents, action_emb_size)
-        mix_in_size = obs_emb_size+action_emb_size if not use_agent_embedding else obs_emb_size+n_agents*action_emb_size
-        self.mix = nn.Sequential(nn.Tanh(),
-                                 nn.Linear(mix_in_size, obs_emb_size),
-                                 nn.Tanh(),
-                                 nn.Linear(obs_emb_size, obs_emb_size)
-                                 )
-        self.gru_actor   = nn.GRU(obs_emb_size, hidden_size_actor,  batch_first=True, num_layers=self.n_layers)
-        self.gru_critic  = nn.GRU(obs_emb_size, hidden_size_critic, batch_first=True, num_layers=self.n_layers)
-        self.action_head = nn.Sequential(
-            nn.Linear(hidden_size_actor, hidden_size_actor),
-            nn.Tanh(),
-            nn.Linear(hidden_size_actor, n_actions)
-        )
-        #            spectral_norm(nn.Linear(hidden_size_actor, hidden_size_actor)),
-        self.critic_head = nn.Sequential(
-            nn.Linear(hidden_size_critic, hidden_size_critic),
-            nn.Tanh(),
-            nn.Linear(hidden_size_critic, 1)
-        )
-        #self.action_head[-1].weight.data.uniform_(-3e-3, 3e-3)
-        #self.action_head[-1].bias.data.uniform_(-3e-3, 3e-3)
-
-    def init_hidden_actor(self):
-        return torch.zeros(1, self.n_layers, self.hidden_size_actor)
-
-    def init_hidden_critic(self):
-        return torch.zeros(1, self.n_layers, self.hidden_size_critic)
-
-    def forward(self, observations, actions, hidden_actor=None, hidden_critic=None):
-        n_agents, t, *_ = observations.shape
-        obs_emb    = self.obs_proj(observations.view(n_agents, t, -1).float())
-        action_emb = self.action_emb(actions+1)  # shift by one due to padding idx
-
-        if not self.use_agent_embedding:
-            x_t = torch.cat((obs_emb, action_emb), -1)
-        else:
-            agent_emb = self.agent_emb(
-                torch.cat([torch.arange(0, n_agents, 1).view(-1, 1)] * t, 1)
-            )
-            x_t = torch.cat((obs_emb, agent_emb, action_emb), -1)
-
-        mixed_x_t   = self.mix(x_t)
-        output_p, _ = self.gru_actor(input=mixed_x_t,  hx=hidden_actor.swapaxes(1, 0))
-        output_c, _ = self.gru_critic(input=mixed_x_t, hx=hidden_critic.swapaxes(1, 0))
-
-        logits = self.action_head(output_p)
-        critic = self.critic_head(output_c).squeeze(-1)
-        return dict(logits=logits, critic=critic, hidden_actor=output_p, hidden_critic=output_c)
-
-
-class RecurrentACL2(RecurrentAC):
-    def __init__(self, *args, **kwargs):
-        super().__init__(*args, **kwargs)
-        self.action_head = nn.Sequential(
-            nn.Linear(self.hidden_size_actor, self.hidden_size_actor),
-            nn.Tanh(),
-            NormalizedLinear(self.hidden_size_actor, self.n_actions, trainable_magnitude=True)
-        )
-
-
-class NormalizedLinear(nn.Linear):
-    def __init__(self, in_features: int, out_features: int,
-                 device=None, dtype=None, trainable_magnitude=False):
-        super(NormalizedLinear, self).__init__(in_features, out_features, False, device, dtype)
-        self.d_sqrt = in_features**0.5
-        self.trainable_magnitude = trainable_magnitude
-        self.scale = nn.Parameter(torch.tensor([1.]), requires_grad=trainable_magnitude)
-
-    def forward(self, in_array):
-        normalized_input = F.normalize(in_array, dim=-1, p=2, eps=1e-5)
-        normalized_weight = F.normalize(self.weight, dim=-1, p=2, eps=1e-5)
-        return F.linear(normalized_input, normalized_weight) * self.d_sqrt * self.scale
-
-
-class L2Norm(nn.Module):
-    def __init__(self, in_features, trainable_magnitude=False):
-        super(L2Norm, self).__init__()
-        self.d_sqrt = in_features**0.5
-        self.scale = nn.Parameter(torch.tensor([1.]), requires_grad=trainable_magnitude)
-
-    def forward(self, x):
-        return F.normalize(x, dim=-1, p=2, eps=1e-5) * self.d_sqrt * self.scale
--- a/marl_factory_grid/algorithms/rl/seac.py
+++ b/marl_factory_grid/algorithms/rl/seac.py
@ -1,55 +0,0 @@
-import torch
-from torch.distributions import Categorical
-from marl_factory_grid.algorithms.rl.iac import LoopIAC
-from marl_factory_grid.algorithms.rl.base_ac import nms
-from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory
-
-
-class LoopSEAC(LoopIAC):
-    def __init__(self, cfg):
-        super(LoopSEAC, self).__init__(cfg)
-
-    def actor_critic(self, tm, networks, gamma, entropy_coef, vf_coef, gae_coef=0.0, **kwargs):
-        obs, actions, done, reward = tm.observation, tm.action, tm.done[:, 1:], tm.reward[:, 1:]
-        outputs = [net(obs, actions, tm.hidden_actor[:, 0], tm.hidden_critic[:, 0]) for net in networks]
-
-        with torch.inference_mode(True):
-            true_action_logp = torch.stack([
-                torch.log_softmax(out[nms.LOGITS][ag_i, :-1], -1)
-                .gather(index=actions[ag_i, 1:, None], dim=-1)
-                for ag_i, out in enumerate(outputs)
-            ], 0).squeeze()
-
-        losses = []
-
-        for ag_i, out in enumerate(outputs):
-            logits = out[nms.LOGITS][:, :-1]  # last one only needed for v_{t+1}
-            critic = out[nms.CRITIC]
-
-            entropy_loss = Categorical(logits=logits[ag_i]).entropy().mean()
-            advantages = self.compute_advantages(critic, reward, done, gamma, gae_coef)
-
-            # policy loss
-            log_ap = torch.log_softmax(logits, -1)
-            log_ap = torch.gather(log_ap, dim=-1, index=actions[:, 1:].unsqueeze(-1)).squeeze()
-
-            # importance weights
-            iw = (log_ap - true_action_logp).exp().detach()  # importance_weights
-
-            a2c_loss = (-iw*log_ap * advantages.detach()).mean(-1)
-
-            value_loss = (iw*advantages.pow(2)).mean(-1)  # n_agent
-
-            # weighted loss
-            loss = (a2c_loss + vf_coef*value_loss - entropy_coef * entropy_loss).mean()
-            losses.append(loss)
-
-        return losses
-
-    def learn(self, tms: MARLActorCriticMemory, **kwargs):
-        losses = self.actor_critic(tms, self.net, **self.cfg[nms.ALGORITHM], **kwargs)
-        for ag_i, loss in enumerate(losses):
-            self.optimizer[ag_i].zero_grad()
-            loss.backward()
-            torch.nn.utils.clip_grad_norm_(self.net[ag_i].parameters(), 0.5)
-            self.optimizer[ag_i].step()
--- a/marl_factory_grid/algorithms/rl/snac.py
+++ b/marl_factory_grid/algorithms/rl/snac.py
@ -1,33 +0,0 @@
-from marl_factory_grid.algorithms.rl.base_ac import BaseActorCritic
-from marl_factory_grid.algorithms.rl.base_ac import nms
-import torch
-from torch.distributions import Categorical
-from pathlib import Path
-
-
-class LoopSNAC(BaseActorCritic):
-    def __init__(self, cfg):
-        super().__init__(cfg)
-
-    def load_state_dict(self, path: Path):
-        path2weights = list(path.glob('*.pt'))
-        assert len(path2weights) == 1, f'Expected a single set of weights but got {len(path2weights)}'
-        self.net.load_state_dict(torch.load(path2weights[0]))
-
-    def init_hidden(self):
-        hidden_actor = self.net.init_hidden_actor()
-        hidden_critic = self.net.init_hidden_critic()
-        return dict(hidden_actor=torch.cat([hidden_actor]   * self.n_agents,  0),
-                    hidden_critic=torch.cat([hidden_critic] * self.n_agents,  0)
-                    )
-
-    def get_actions(self, out):
-        actions = Categorical(logits=out[nms.LOGITS]).sample().squeeze()
-        return actions
-
-    def forward(self, observations, actions, hidden_actor, hidden_critic):
-        out = self.net(self._as_torch(observations).unsqueeze(1),
-                       self._as_torch(actions).unsqueeze(1),
-                       hidden_actor, hidden_critic
-                       )
-        return out
--- a/marl_factory_grid/algorithms/static/TSP_base_agent.py
+++ b/marl_factory_grid/algorithms/static/TSP_base_agent.py
@ -33,6 +33,7 @@ class TSPBaseAgent(ABC):
        self.local_optimization = True
        self._env = state
        self.state = self._env.state[c.AGENT][agent_i]
+        self.spawn_position = np.array(self.state.pos)
        self._position_graph = self.generate_pos_graph()
        self._static_route = None
        self.cached_route = None
@ -79,7 +80,7 @@ class TSPBaseAgent(ABC):
        start_time = time.time()

        if self.cached_route is not None:
-            print(f" Used cached route: {self.cached_route}")
+            #print(f" Used cached route: {self.cached_route}")
            return copy.deepcopy(self.cached_route)

        else:
@ -89,7 +90,7 @@ class TSPBaseAgent(ABC):
                    [self.state.pos] + \
                    [x for x in positions if max(abs(np.subtract(x, self.state.pos))) < 3]
                try:
-                    while len(nodes) < 7:
+                    while len(nodes) < 13:
                        nodes += [next(x for x in positions if x not in nodes)]
                except StopIteration:
                    nodes = [self.state.pos] + positions
@ -100,11 +101,11 @@ class TSPBaseAgent(ABC):
            route = tsp.traveling_salesman_problem(self._position_graph,
                                                   nodes=nodes, cycle=True, method=tsp.greedy_tsp)
            self.cached_route = copy.deepcopy(route)
-            print(f"Cached route: {self.cached_route}")
+            #print(f"Cached route: {self.cached_route}")

        end_time = time.time()
        duration = end_time - start_time
-        print("TSP calculation took {:.2f} seconds to execute".format(duration))
+        #print("TSP calculation took {:.2f} seconds to execute".format(duration))
        return route

    def _door_is_close(self, state):
--- a/marl_factory_grid/algorithms/static/TSP_runner.py
+++ b/marl_factory_grid/algorithms/static/TSP_runner.py
@ -0,0 +1,96 @@
+import os
+import pickle
+from pathlib import Path
+
+from tqdm import trange
+
+from marl_factory_grid import Factory
+from marl_factory_grid.algorithms.static.contortions import get_coin_quadrant_tsp_agents, get_two_rooms_tsp_agents
+
+
+def coin_quadrant_multi_agent_tsp_eval(emergent_phenomenon):
+    run_tsp_setting("coin_quadrant", emergent_phenomenon, log=False)
+
+
+def two_rooms_multi_agent_tsp_eval(emergent_phenomenon):
+    run_tsp_setting("two_rooms", emergent_phenomenon, log=False)
+
+
+def run_tsp_setting(config_name, emergent_phenomenon, n_episodes=1, log=False):
+    # Render at each step?
+    render = True
+
+    # Path to config File
+    path = Path(f'./marl_factory_grid/configs/tsp/{config_name}.yaml')
+
+    # Create results folder
+    runs = os.listdir("./study_out/")
+    run_numbers = [int(run[7:]) for run in runs if run[:7] == "tsp_run"]
+    next_run_number = max(run_numbers) + 1 if run_numbers else 0
+    results_path = f"./study_out/tsp_run{next_run_number}"
+    os.mkdir(results_path)
+
+    # Env Init
+    factory = Factory(path)
+
+    with open(f"{results_path}/env_config.txt", "w") as txt_file:
+        txt_file.write(str(factory.conf))
+
+    still_existing_coin_piles = []
+    reached_flags = []
+
+    for episode in trange(n_episodes):
+        _ = factory.reset()
+        still_existing_coin_piles.append([])
+        reached_flags.append([])
+        done = False
+        if render:
+            factory.render()
+            factory._renderer.fps = 5
+        if config_name == "coin_quadrant":
+            agents = get_coin_quadrant_tsp_agents(emergent_phenomenon, factory)
+        elif config_name == "two_rooms":
+            agents = get_two_rooms_tsp_agents(emergent_phenomenon, factory)
+        else:
+            print("Config name does not exist. Abort...")
+            break
+        ep_steps = 0
+        while not done:
+            a = [x.predict() for x in agents]
+            # Have this condition, to terminate as soon as all coin piles are collected. This ensures that the implementation
+            # of the TSP agent is equivalent to that of the RL agent
+            if 'CoinPiles' in list(factory.state.entities.keys()) and factory.state.entities['CoinPiles'].global_amount == 0.0:
+                break
+            obs_type, _, _, done, info = factory.step(a)
+            if 'CoinPiles' in list(factory.state.entities.keys()):
+                still_existing_coin_piles[-1].append(len(factory.state.entities['CoinPiles']))
+            if 'Destinations' in list(factory.state.entities.keys()):
+                reached_flags[-1].append(sum([1 for ele in [x.was_reached() for x in factory.state['Destinations']] if ele]))
+            ep_steps += 1
+            if render:
+                factory.render()
+            if done:
+                break
+
+        collected_coin_piles_per_step = []
+        if 'CoinPiles' in list(factory.state.entities.keys()):
+            for ep in still_existing_coin_piles:
+                collected_coin_piles_per_step.append([max(ep)-ep[idx] for idx, value in enumerate(ep)])
+            # Remove first element and add last element where all coin piles have been collected
+            del collected_coin_piles_per_step[-1][0]
+            collected_coin_piles_per_step[-1].append(max(still_existing_coin_piles[-1]))
+
+        # Add last entry to reached_flags
+        print("Number of environment steps:", ep_steps)
+        if 'CoinPiles' in list(factory.state.entities.keys()):
+            print("Collected coins per step:", collected_coin_piles_per_step)
+        if 'Destinations' in list(factory.state.entities.keys()):
+            print("Reached flags per step:", reached_flags)
+
+        if log:
+            if 'CoinPiles' in list(factory.state.entities.keys()):
+                metrics_data = {"collected_coin_piles_per_step": collected_coin_piles_per_step}
+            if 'Destinations' in list(factory.state.entities.keys()):
+                metrics_data = {"reached_flags": reached_flags}
+            with open(f"{results_path}/metrics", "wb") as pickle_file:
+                pickle.dump(metrics_data, pickle_file)
--- a/marl_factory_grid/algorithms/static/contortions.py
+++ b/marl_factory_grid/algorithms/static/contortions.py
@ -0,0 +1,55 @@
+import numpy as np
+from marl_factory_grid.algorithms.static.TSP_coin_agent import TSPCoinAgent
+from marl_factory_grid.algorithms.static.TSP_target_agent import TSPTargetAgent
+
+
+def get_coin_quadrant_tsp_agents(emergent_phenomenon, factory):
+    agents = [TSPCoinAgent(factory, 0), TSPCoinAgent(factory, 1)]
+    if not emergent_phenomenon:
+        edge_costs = {}
+        # Add costs for horizontal edges
+        for i in range(1, 10):
+            for j in range(1, 9):
+                # Add costs for both traversal directions
+                edge_costs[f"{(i, j)}-{i, j + 1}"] = 0.55 + (i - 1) * 0.05
+                edge_costs[f"{i, j + 1}-{(i, j)}"] = 0.55 + (i - 1) * 0.05
+
+        # Add costs for vertical edges
+        for i in range(1, 9):
+            for j in range(1, 10):
+                # Add costs for both traversal directions
+                edge_costs[f"{(i, j)}-{i + 1, j}"] = 0.55 + (i) * 0.05
+                edge_costs[f"{i + 1, j}-{(i, j)}"] = 0.55 + (i - 1) * 0.05
+
+
+        for agent in agents:
+            for u, v, weight in agent._position_graph.edges(data='weight'):
+                agent._position_graph[u][v]['weight'] = edge_costs[f"{u}-{v}"]
+
+
+    return agents
+
+
+def get_two_rooms_tsp_agents(emergent_phenomenon, factory):
+    agents = [TSPTargetAgent(factory, 0), TSPTargetAgent(factory, 1)]
+    if not emergent_phenomenon:
+        edge_costs = {}
+        # Add costs for horizontal edges
+        for i in range(1, 6):
+            for j in range(1, 13):
+                # Add costs for both traversal directions
+                edge_costs[f"{(i, j)}-{i, j + 1}"] = np.abs(5/i*np.cbrt(((j+1)/4 - 1)) - 1)
+                edge_costs[f"{i, j + 1}-{(i, j)}"] = np.abs(5/i*np.cbrt((j/4 - 1)) - 1)
+
+        # Add costs for vertical edges
+        for i in range(1, 5):
+            for j in range(1, 14):
+                # Add costs for both traversal directions
+                edge_costs[f"{(i, j)}-{i + 1, j}"] = np.abs(5/(i+1)*np.cbrt((j/4 - 1)) - 1)
+                edge_costs[f"{i + 1, j}-{(i, j)}"] = np.abs(5/i*np.cbrt((j/4 - 1)) - 1)
+
+
+        for agent in agents:
+            for u, v, weight in agent._position_graph.edges(data='weight'):
+                agent._position_graph[u][v]['weight'] = edge_costs[f"{u}-{v}"]
+    return agents
--- a/marl_factory_grid/algorithms/utils.py
+++ b/marl_factory_grid/algorithms/utils.py
@ -1,9 +1,11 @@
+import os
 from pathlib import Path

 import numpy as np
 import yaml

 from marl_factory_grid import Factory
+from marl_factory_grid.algorithms.marl.utils import get_configs_marl_path


 def load_class(classname):
@ -43,6 +45,10 @@ def get_class(arguments):
        return c


+def get_study_out_path():
+    return Path(os.path.join(Path(__file__).parent.parent.parent, "study_out"))
+
+
 def get_arguments(arguments):
    d = dict(arguments)
    if "classname" in d:
@ -58,19 +64,13 @@ def load_yaml_file(path: Path):

 def add_env_props(cfg):
    # Path to config File
-    env_path = Path(f'../marl_factory_grid/configs/{cfg["env"]["env_name"]}.yaml')
+    env_path = Path(f'{get_configs_marl_path()}/{cfg["env"]["env_name"]}.yaml')
+    print(cfg)

    # Env Init
    factory = Factory(env_path)
    _ = factory.reset()

-    # Agent Init
-    if len(factory.state.moving_entites) == 1: # Single agent setting
-        observation_size = list(factory.observation_space.shape)
-    else: # Multi-agent setting
-        observation_size = list(factory.observation_space[0].shape)
-    cfg['agent'].update(dict(observation_size=observation_size, n_actions=factory.action_space[0].n))
-
    return factory


--- a/marl_factory_grid/configs/custom/dirt_quadrant_eval_config.yaml
+++ b/marl_factory_grid/configs/custom/dirt_quadrant_eval_config.yaml
@ -1,78 +0,0 @@
-General:
-  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
-  env_seed: 69
-  # Individual vs global rewards
-  individual_rewards: true
-  # The level.txt file to load from marl_factory_grid/levels
-  level_name: quadrant
-  # Radius of Partially observable Markov decision process
-  pomdp_r: 0 # default 3
-  # Print all messages and events
-  verbose: false
-  # Run tests
-  tests: false
-
-# In the "clean and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
-# other agents aim to collect coin piles.
-Agents:
-  # The collect coin agents
-  #Sigmund:
-    #Actions:
-      #- Move4
-      #- Noop
-    #Observations:
-      #- CoinPiles
-      #- Self
-    #Positions:
-      #- (9,1)
-      #- (1,1)
-      #- (2,4)
-      #- (4,7)
-      #- (7,9)
-      #- (2,4)
-      #- (4,7)
-      #- (7,9)
-      #- (9,9)
-      #- (9,1)
-  Wolfgang:
-    Actions:
-      - Move4
-    Observations:
-      - CoinPiles
-      - Self
-    Positions:
-      - (9,5)
-      #- (1,1)
-      #- (2,4)
-      #- (4,7)
-      #- (7,9)
-      #- (2,4)
-      #- (4,7)
-      #- (7,9)
-      #- (9,9)
-      #- (9,5)
-
-Entities:
-  CoinPiles:
-    coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) #(9,9), (7,9), (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (4,7), (2,4), (1, 1) # (1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
-    coin_spawn_r_var: 0
-    max_global_amount: 12
-    max_local_amount: 1
-
-# Rules section specifies the rules governing the dynamics of the environment.
-Rules:
-
-  # Utilities
-  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
-  # Can be omitted/ignored if you do not want to take care of collisions at all.
-  WatchCollisions:
-    done_at_collisions: false
-
-  # Done Conditions
-  # Define the conditions for the environment to stop. Either success or a fail conditions.
-  # The environment stops when all coins are collected
-  DoneOnAllCoinsCollected:
-  #DoneAtMaxStepsReached:
-    #max_steps: 200
--- a/marl_factory_grid/configs/custom/two_rooms_one_door_modified_eval_config.yaml
+++ b/marl_factory_grid/configs/custom/two_rooms_one_door_modified_eval_config.yaml
@ -1,62 +0,0 @@
-General:
-  env_seed: 69
-  # Individual vs global rewards
-  individual_rewards: true
-  # The level.txt file to load from marl_factory_grid/levels
-  level_name: two_rooms_modified
-  # View Radius; 0 = full observatbility
-  pomdp_r: 0
-  # Print all messages and events
-  verbose: false
-  # Run tests
-  tests: false
-
-# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
-# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
-Agents:
-  #Sigmund:
-    #Actions:
-      #- Move4
-      #- DoorUse
-    #Observations:
-      #- CoinPiles
-      #- Self
-    #Positions:
-      #- (3,1)
-      #- (2,1)
-  Wolfgang:
-    Actions:
-      - Move4
-      - DoorUse
-    Observations:
-      - CoinPiles
-      - Self
-    Positions:
-      - (3,13)
-      - (2,13)
-
-Entities:
-  CoinPiles:
-    coords_or_quantity: (2,13), (3,2) # (2,1), (3,12)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
-    coin_spawn_r_var: 0
-    max_global_amount: 12
-    max_local_amount: 1
-
-  Doors: { }
-
-Rules:
-  # Environment Dynamics
-  #DoorAutoClose:
-    #close_frequency: 10
-
-  # Utilities
-  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
-  WatchCollisions:
-    done_at_collisions: false
-
-  # Done Conditions
-  #DoneOnAllDirtCleaned:
-  DoneAtMaxStepsReached:
-    max_steps: 50
--- a/marl_factory_grid/configs/custom/two_rooms_one_door_modified_train_config.yaml
+++ b/marl_factory_grid/configs/custom/two_rooms_one_door_modified_train_config.yaml
@ -1,75 +0,0 @@
-General:
-  env_seed: 69
-  # Individual vs global rewards
-  individual_rewards: true
-  # The level.txt file to load from marl_factory_grid/levels
-  level_name: two_rooms_modified
-  # View Radius; 0 = full observatbility
-  pomdp_r: 0
-  # Print all messages and events
-  verbose: false
-  # Run tests
-  tests: false
-
-# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
-# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
-Agents:
-  #Sigmund:
-    #Actions:
-      #- Move4
-    #Observations:
-      #- CoinPiles
-      #- Self
-    #Positions:
-      #- (3,1)
-      #- (1,1)
-      #- (3,1)
-      #- (5,1)
-      #- (3,1)
-      #- (1,8)
-      #- (3,1)
-      #- (5,8)
-  Wolfgang:
-    Actions:
-      - Move4
-    Observations:
-      - CoinPiles
-      - Self
-    Positions:
-      - (3,13)
-      - (2,13)
-      - (1,13)
-      - (3,13)
-      - (1,8)
-      - (2,6)
-      - (3,10)
-      - (4,6)
-
-Entities:
-  CoinPiles:
-    coords_or_quantity: (2,13), (3,2) # (2,1), (3,12)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
-    coin_spawn_r_var: 0
-    max_global_amount: 12
-    max_local_amount: 1
-
-  #Doors: { }
-
-Rules:
-  # Environment Dynamics
-  #DoorAutoClose:
-    #close_frequency: 10
-
-  # Utilities
-  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
-  WatchCollisions:
-    done_at_collisions: false
-
-  # Done Conditions
-  DoneOnAllCoinsCollected:
-  #DoneAtMaxStepsReached:
-    #max_steps: 100
-
-  AgentSpawnRule:
-    spawn_rule: "order"
--- a/marl_factory_grid/configs/default_config.yaml
+++ b/marl_factory_grid/configs/default_config.yaml
@ -26,6 +26,28 @@ Agents:
      - Noop
      - Charge
      - Clean
+      - DestAction
+      - DoorUse
+      - ItemAction
+      - Move8
+    Observations:
+      - Combined:
+          - Other
+          - Walls
+      - GlobalPosition
+      - Battery
+      - ChargePods
+      - DirtPiles
+      - Destinations
+      - Doors
+      - Items
+      - Inventory
+      - DropOffLocations
+      - Maintainers
+  Herbert:
+    Actions:
+      - Noop
+      - Charge
      - Collect
      - DestAction
      - DoorUse
@ -39,7 +61,6 @@ Agents:
      - Battery
      - ChargePods
      - CoinPiles
-      - DirtPiles
      - Destinations
      - Doors
      - Items
@ -62,10 +83,10 @@ Entities:
  # CoinPiles: Entities that can be collected by an agent.
  CoinPiles:
    coords_or_quantity: 10
-    initial_amount: 2
+    initial_amount: 1
    collect_amount: 1
    coin_spawn_r_var: 0.1
-    max_global_amount: 20
+    max_global_amount: 10
    max_local_amount: 5

  # Destinations: Entities representing target locations for agents.
--- a/marl_factory_grid/configs/custom/MultiAgentConfigs/dirt_quadrant_train_config.yaml
+++ b/marl_factory_grid/configs/custom/MultiAgentConfigs/dirt_quadrant_train_config.yaml
@ -5,60 +5,47 @@ General:
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
  level_name: quadrant
-  # Radius of Partially observable Markov decision process
-  pomdp_r: 0 # default 3
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In the "collect and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
-# other agents aim to collect coin piles.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
-  # The collect coin agents
-  Sigmund:
+  # The clean agents
+  Agent1:
    Actions:
      - Move4
-      #- Collect
-      #- Noop
+      - Noop
    Observations:
      - CoinPiles
      - Self
    Positions:
      - (9,1)
-      - (4,5)
-      - (1,1)
-      - (4,5)
-      - (9,1)
-      - (9,9)
-  Wolfgang:
+  Agent2:
    Actions:
      - Move4
-      #- Collect
-      #- Noop
+      - Noop
    Observations:
      - CoinPiles
      - Self
    Positions:
      - (9,5)
-      - (4,5)
-      - (1,1)
-      - (4,5)
-      - (9,5)
-      - (9,9)

 Entities:
  CoinPiles:
-    coords_or_quantity: (9,9), (1,1), (4,5)  # (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
+    coords_or_quantity: (9,9), (7,9), (4,7), (2,4), (1, 1)
+    initial_amount: 0.5
+    clean_amount: 1
    coin_spawn_r_var: 0
    max_global_amount: 12
    max_local_amount: 1
+    randomize: False # If coins should spawn at random positions instead of the positions defined above

 # Rules section specifies the rules governing the dynamics of the environment.
 Rules:
-
  # Utilities
  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
  # Can be omitted/ignored if you do not want to take care of collisions at all.
@ -67,7 +54,5 @@ Rules:

  # Done Conditions
  # Define the conditions for the environment to stop. Either success or a fail conditions.
-  # The environment stops when all coins are collected
+  # The environment stops when all coin is cleaned
  DoneOnAllCoinsCollected:
-  #DoneAtMaxStepsReached: # An episode should last for at most max_steps steps
-    #max_steps: 100
--- a/marl_factory_grid/configs/custom/MultiAgentConfigs/two_rooms_one_door_modified_eval_config.yaml
+++ b/marl_factory_grid/configs/custom/MultiAgentConfigs/two_rooms_one_door_modified_eval_config.yaml
@ -1,20 +1,20 @@
 General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
  env_seed: 69
  # Individual vs global rewards
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
-  level_name: two_rooms_modified
-  # View Radius; 0 = full observatbility
-  pomdp_r: 0
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
-# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
-  Sigmund:
+  Agent1:
    Actions:
      - Move4
      - DoorUse
@ -24,7 +24,7 @@ Agents:
      - Self
    Positions:
      - (3,1)
-  Wolfgang:
+  Agent2:
    Actions:
      - Move4
      - DoorUse
@ -36,10 +36,11 @@ Agents:
      - (3,13)

 Entities:
+  # For RL-agent we model the flags as coin piles to be more flexible
  CoinPiles:
    coords_or_quantity: (2,1), (3,12), (2,13), (3,2) # Static form: auxiliary pile, primary pile, auxiliary pile, ...
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
+    initial_amount: 0.5
+    clean_amount: 1
    coin_spawn_r_var: 0
    max_global_amount: 12
    max_local_amount: 1
@ -47,16 +48,13 @@ Entities:
  Doors: { }

 Rules:
-  # Environment Dynamics
-  #DoorAutoClose:
-    #close_frequency: 10
-
  # Utilities
  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
  WatchCollisions:
    done_at_collisions: false

  # Done Conditions
-  #DoneOnAllDirtCleaned:
+  # Define the conditions for the environment to stop. Either success or a fail conditions.
+  # Environment execution stops after 30 steps
  DoneAtMaxStepsReached:
-    max_steps: 50
+    max_steps: 30
--- a/marl_factory_grid/configs/custom/MultiAgentConfigs/two_rooms_one_door_modified_eval_config_emergent.yaml
+++ b/marl_factory_grid/configs/custom/MultiAgentConfigs/two_rooms_one_door_modified_eval_config_emergent.yaml
@ -1,20 +1,20 @@
 General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
  env_seed: 69
  # Individual vs global rewards
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
-  level_name: two_rooms_modified
-  # View Radius; 0 = full observatbility
-  pomdp_r: 0
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
-# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
-  Sigmund:
+  Agent1:
    Actions:
      - Move4
      - DoorUse
@ -24,7 +24,7 @@ Agents:
      - Self
    Positions:
      - (3,1)
-  Wolfgang:
+  Agent2:
    Actions:
      - Move4
      - DoorUse
@ -36,10 +36,11 @@ Agents:
      - (3,13)

 Entities:
+  # For RL-agent we model the flags as coin piles to be more flexible
  CoinPiles:
-    coords_or_quantity: (3,12), (3,2) # Static form: auxiliary pile, primary pile, auxiliary pile, ...
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
+    coords_or_quantity: (3,12), (3,2) # Locations of flags
+    initial_amount: 0.5
+    clean_amount: 1
    coin_spawn_r_var: 0
    max_global_amount: 12
    max_local_amount: 1
@ -47,16 +48,13 @@ Entities:
  Doors: { }

 Rules:
-  # Environment Dynamics
-  #DoorAutoClose:
-    #close_frequency: 10
-
  # Utilities
  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
  WatchCollisions:
    done_at_collisions: false

  # Done Conditions
-  #DoneOnAllDirtCleaned:
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # Environment execution stops after 30 steps
  DoneAtMaxStepsReached:
    max_steps: 30
--- a/marl_factory_grid/configs/marl/single_agent_configs/coin_quadrant_agent1_eval_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/coin_quadrant_agent1_eval_config.yaml
@ -0,0 +1,48 @@
+General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
+  env_seed: 69
+  # Individual vs global rewards
+  individual_rewards: true
+  # The level.txt file to load from marl_factory_grid/levels
+  level_name: quadrant
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
+  # Print all messages and events
+  verbose: false
+  # Run tests
+  tests: false
+
+# Define Agents, their actions, observations and spawnpoints
+Agents:
+  # The clean agents
+  Agent1:
+    Actions:
+      - Move4
+      - Noop
+    Observations:
+      - CoinPiles
+      - Self
+    Positions:
+      - (9,1)
+
+Entities:
+  CoinPiles:
+    coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
+    coin_spawn_r_var: 0
+    max_global_amount: 12
+    max_local_amount: 1
+
+# Rules section specifies the rules governing the dynamics of the environment.
+Rules:
+  # Utilities
+  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
+  # Can be omitted/ignored if you do not want to take care of collisions at all.
+  WatchCollisions:
+    done_at_collisions: false
+
+  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions.
+  # The environment stops when all coin is cleaned
+  DoneOnAllCoinsCollected:
--- a/marl_factory_grid/configs/marl/single_agent_configs/coin_quadrant_agent1_train_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/coin_quadrant_agent1_train_config.yaml
@ -5,69 +5,45 @@ General:
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
  level_name: quadrant
-  # Radius of Partially observable Markov decision process
-  pomdp_r: 0 # default 3
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In the "collect and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
-# other agents aim to collect coin piles.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
  # The clean agents
-  #Sigmund:
-    #Actions:
-      #- Move4
-    #Observations:
-      #- CoinPiles
-      #- Self
-    #Positions:
-      #- (9,1)
-      #- (1,1)
-      #- (2,4)
-      #- (4,7)
-      #- (6,8)
-      #- (7,9)
-      #- (2,4)
-      #- (4,7)
-      #- (6,8)
-      #- (7,9)
-      #- (9,9)
-      #- (9,1)
-  Wolfgang:
+  Agent1:
    Actions:
      - Move4
    Observations:
      - CoinPiles
      - Self
-    Positions:
-      - (9,5)
+    Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
+      - (9,1)
      - (1,1)
      - (2,4)
      - (4,7)
-      - (6,8)
      - (7,9)
      - (2,4)
      - (4,7)
-      - (6,8)
      - (7,9)
      - (9,9)
-      - (9,5)
-
+      - (9,1)

 Entities:
  CoinPiles:
-    coords_or_quantity: (1, 1), (2,4), (4,7), (6,8), (7,9), (9,9)  # (4,7), (2,4), (1, 1) #(1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
+    coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9) # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
    coin_spawn_r_var: 0
    max_global_amount: 12
    max_local_amount: 1

 # Rules section specifies the rules governing the dynamics of the environment.
 Rules:
-
  # Utilities
  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
  # Can be omitted/ignored if you do not want to take care of collisions at all.
@ -76,10 +52,8 @@ Rules:

  # Done Conditions
  # Define the conditions for the environment to stop. Either success or a fail conditions.
-  # The environment stops when all coins are collected
+  # The environment stops when all coin is cleaned
  DoneOnAllCoinsCollected:
-  #DoneAtMaxStepsReached: # An episode should last for at most max_steps steps
-    #max_steps: 1000

  # Define how agents spawn.
  # Options: "random" (Spawn agent at a random position from the list of defined positions)
--- a/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent1_eval_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent1_eval_config.yaml
@ -0,0 +1,50 @@
+General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
+  env_seed: 69
+  # Individual vs global rewards
+  individual_rewards: true
+  # The level.txt file to load from marl_factory_grid/levels
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
+  # Print all messages and events
+  verbose: false
+  # Run tests
+  tests: false
+
+# Define Agents, their actions, observations and spawnpoints
+Agents:
+  Agent1:
+    Actions:
+      - Move4
+      - DoorUse
+    Observations:
+      - CoinPiles
+      - Self
+    Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
+      - (3,1)
+      - (2,1) # spawnpoint only required if agent1 should go to its auxiliary pile
+
+Entities:
+  CoinPiles:
+    coords_or_quantity: (2,1), (3,12) # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
+    coin_spawn_r_var: 0
+    max_global_amount: 12
+    max_local_amount: 1
+
+  Doors: { }
+
+# Rules section specifies the rules governing the dynamics of the environment.
+Rules:
+  # Utilities
+  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
+  WatchCollisions:
+    done_at_collisions: false
+
+  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # Environment execution stops after 30 steps
+  DoneAtMaxStepsReached:
+    max_steps: 30
--- a/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent1_train_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent1_train_config.yaml
@ -0,0 +1,55 @@
+General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
+  env_seed: 69
+  # Individual vs global rewards
+  individual_rewards: true
+  # The level.txt file to load from marl_factory_grid/levels
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
+  # Print all messages and events
+  verbose: false
+  # Run tests
+  tests: false
+
+# Define Agents, their actions, observations and spawnpoints
+Agents:
+  Agent1:
+    Actions:
+      - Move4
+    Observations:
+      - CoinPiles
+      - Self
+    Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
+      - (5,1)
+      - (2,1)
+      - (1,1)
+
+Entities:
+  CoinPiles:
+    coords_or_quantity: (3,12) # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
+    coin_spawn_r_var: 0
+    max_global_amount: 12
+    max_local_amount: 1
+
+  #Doors: { }  # We leave out the door during training
+
+Rules:
+  # Utilities
+  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
+  WatchCollisions:
+    done_at_collisions: false
+
+  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # The environment stops when all coin is cleaned
+  DoneOnAllCoinsCollected:
+
+  # Define how agents spawn.
+  # Options: "random" (Spawn agent at a random position from the list of defined positions)
+  # "first" (Always spawn agent at first position regardless of the other provided positions)
+  # "order" (Loop through agent positions)
+  AgentSpawnRule:
+    spawn_rule: "order"
--- a/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent2_eval_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent2_eval_config.yaml
@ -0,0 +1,49 @@
+General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
+  env_seed: 69
+  # Individual vs global rewards
+  individual_rewards: true
+  # The level.txt file to load from marl_factory_grid/levels
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
+  # Print all messages and events
+  verbose: false
+  # Run tests
+  tests: false
+
+# Define Agents, their actions, observations and spawnpoints
+Agents:
+  Agent2:
+    Actions:
+      - Move4
+      - DoorUse
+    Observations:
+      - CoinPiles
+      - Self
+    Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
+      - (3,13)
+
+Entities:
+  CoinPiles:
+    coords_or_quantity: (3,2)  # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
+    coin_spawn_r_var: 0
+    max_global_amount: 12
+    max_local_amount: 1
+
+  Doors: { }
+
+# Rules section specifies the rules governing the dynamics of the environment.
+Rules:
+  # Utilities
+  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
+  WatchCollisions:
+    done_at_collisions: false
+
+  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # Environment execution stops after 30 steps
+  DoneAtMaxStepsReached:
+    max_steps: 30
--- a/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent2_train_config.yaml
+++ b/marl_factory_grid/configs/marl/single_agent_configs/two_rooms_agent2_train_config.yaml
@ -0,0 +1,54 @@
+General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
+  env_seed: 69
+  # Individual vs global rewards
+  individual_rewards: true
+  # The level.txt file to load from marl_factory_grid/levels
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
+  # Print all messages and events
+  verbose: false
+  # Run tests
+  tests: false
+
+# Define Agents, their actions, observations and spawnpoints
+Agents:
+  Agent2:
+    Actions:
+      - Move4
+    Observations:
+      - CoinPiles
+      - Self
+    Positions: # Each spawnpoint is mapped to one coin pile looping over coords_or_quantity (see below)
+      - (3,13)
+
+Entities:
+  CoinPiles:
+    coords_or_quantity: (3,2)  # Locations of coin piles
+    initial_amount: 0.5
+    clean_amount: 1
+    coin_spawn_r_var: 0
+    max_global_amount: 12
+    max_local_amount: 1
+
+  #Doors: { } # We leave out the door during training
+
+# Rules section specifies the rules governing the dynamics of the environment.
+Rules:
+  # Utilities
+  # This rule defines the collision mechanic, introduces a related DoneCondition and lets you specify rewards.
+  WatchCollisions:
+    done_at_collisions: false
+
+  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # The environment stops when all coin is cleaned
+  DoneOnAllCoinsCollected:
+
+  # Defines how agents spawn.
+  # Options: "random" (Spawn agent at a random position from the list of defined positions)
+  # "first" (Always spawn agent at first position regardless of the other provided positions)
+  # "order" (Loop through agent positions)
+  AgentSpawnRule:
+    spawn_rule: "order"
--- a/marl_factory_grid/configs/test_config.yaml
+++ b/marl_factory_grid/configs/test_config.yaml
@ -18,28 +18,28 @@ Agents:
 #      - Doors
 #      - Maintainers
 #    Clones: 0
-#  Item test agent:
-#    Actions:
-#      - Noop
-#      - Charge
-#      - DestAction
-#      - DoorUse
-#      - ItemAction
-#      - Move8
-#    Observations:
-#      - Combined:
-#          - Other
-#          - Walls
-#      - GlobalPosition
-#      - Battery
-#      - ChargePods
-#      - Destinations
-#      - Doors
-#      - Items
-#      - Inventory
-#      - DropOffLocations
-#      - Maintainers
-#    Clones: 0
+  Item test agent:
+    Actions:
+      - Noop
+      - Charge
+      - DestAction
+      - DoorUse
+      - ItemAction
+      - Move8
+    Observations:
+      - Combined:
+          - Other
+          - Walls
+      - GlobalPosition
+      - Battery
+      - ChargePods
+      - Destinations
+      - Doors
+      - Items
+      - Inventory
+      - DropOffLocations
+      - Maintainers
+    Clones: 0
 #  Target test agent:
 #    Actions:
 #      - Noop
@ -56,25 +56,25 @@ Agents:
 #      - Doors
 #      - Maintainers
 #    Clones: 1
-  Coin test agent:
-    Actions:
-      - Noop
-      - Charge
-      - Collect
-      - DoorUse
-      - Move8
-    Observations:
-      - Combined:
-          - Other
-          - Walls
-      - GlobalPosition
-      - Battery
-      - ChargePods
-      - CoinPiles
-      - Destinations
-      - Doors
-      - Maintainers
-    Clones: 1
+#  Coin test agent:
+#    Actions:
+#      - Noop
+#      - Charge
+#      - Collect
+#      - DoorUse
+#      - Move8
+#    Observations:
+#      - Combined:
+#          - Other
+#          - Walls
+#      - GlobalPosition
+#      - Battery
+#      - ChargePods
+#      - CoinPiles
+#      - Destinations
+#      - Doors
+#      - Maintainers
+#    Clones: 1

 Entities:

@ -93,7 +93,7 @@ Entities:
 #    dirt_spawn_r_var: 0.1
 #    max_global_amount: 20
 #    max_local_amount: 5
-  CoinPiles:
+  DirtPiles:
    coords_or_quantity: 10
    initial_amount: 2
    collect_amount: 1
@ -134,7 +134,7 @@ Rules:
  #    respawn_freq: 15
  RespawnItems:
    respawn_freq: 15
-  RespawnCoins:
+  RespawnDirt:
    respawn_freq: 15

  # Utilities
--- a/marl_factory_grid/configs/custom/MultiAgentConfigs/dirt_quadrant_eval_config.yaml
+++ b/marl_factory_grid/configs/custom/MultiAgentConfigs/dirt_quadrant_eval_config.yaml
@ -5,31 +5,34 @@ General:
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
  level_name: quadrant
-  # Radius of Partially observable Markov decision process
-  pomdp_r: 0 # default 3
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In the "clean and bring" Scenario one agent aims to pick up all items and drop them at drop-off locations while all
-# other agents aim to clean dirt piles.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
-  # The coin collect agents
-  Sigmund:
+  # The clean agents
+  Agent1:
    Actions:
      - Move4
+      - Collect
      - Noop
    Observations:
+      - Walls
      - CoinPiles
      - Self
    Positions:
      - (9,1)
-  Wolfgang:
+  Agent2:
    Actions:
      - Move4
+      - Collect
      - Noop
    Observations:
+      - Walls
      - CoinPiles
      - Self
    Positions:
@ -37,12 +40,13 @@ Agents:

 Entities:
  CoinPiles:
-    coords_or_quantity: (9,9), (7,9), (4,7), (2,4), (1, 1) # (4,7), (2,4), (1, 1) # (1, 1), (2,4), (4,7), (7,9), (9,9) # (1, 1), (1,2), (1,3), (2,4), (2,5), (3,6), (4,7), (5,8), (6,8), (7,9), (8,9), (9,9)
-    initial_amount: 0.5 # <1 to ensure that the robot which first attempts to collect this field, can collect the coin in one action
-    collect_amount: 1
+    coords_or_quantity: (1, 1), (2,4), (4,7), (7,9), (9,9)
+    initial_amount: 0.5
+    clean_amount: 1
    coin_spawn_r_var: 0
    max_global_amount: 12
    max_local_amount: 1
+    randomize: False # If coins should spawn at random positions instead of the positions defined above

 # Rules section specifies the rules governing the dynamics of the environment.
 Rules:
@ -55,7 +59,5 @@ Rules:

  # Done Conditions
  # Define the conditions for the environment to stop. Either success or a fail conditions.
-  # The environment stops when all coins are collected
+  # The environment stops when all coin is cleaned
  DoneOnAllCoinsCollected:
-  #DoneAtMaxStepsReached:
-    #max_steps: 200
--- a/marl_factory_grid/configs/two_rooms_one_door_modified.yaml
+++ b/marl_factory_grid/configs/two_rooms_one_door_modified.yaml
@ -1,40 +1,38 @@
 General:
+  # RNG-seed to sample the same "random" numbers every time, to make the different runs comparable.
  env_seed: 69
  # Individual vs global rewards
  individual_rewards: true
  # The level.txt file to load from marl_factory_grid/levels
-  level_name: two_rooms_modified
-  # View Radius; 0 = full observatbility
-  pomdp_r: 0
+  level_name: two_rooms_small
+  # View Radius
+  pomdp_r: 0 # Use custom partial observability setting
  # Print all messages and events
  verbose: false
  # Run tests
  tests: false

-# In "two rooms one door" scenario 2 agents spawn in 2 different rooms that are connected by a single door. Their aim
-# is to reach the destination in the room they didn't spawn in leading to a conflict at the door.
+# Define Agents, their actions, observations and spawnpoints
 Agents:
-  Wolfgang:
+  Agent1:
    Actions:
      - Move4
      - Noop
-      - DestAction
+      - DestAction # Action that is performed when the destination is reached
      - DoorUse
    Observations:
      - Walls
-      - Other
      - Doors
      - Destination
    Positions:
-      - (3,1) # Agent spawnpoint
-  Sigmund:
+      - (3,1)
+  Agent2:
    Actions:
      - Move4
      - Noop
      - DestAction
      - DoorUse
    Observations:
-      - Other
      - Walls
      - Destination
      - Doors
@ -45,10 +43,11 @@ Entities:
  Destinations:
    spawnrule:
      SpawnDestinationsPerAgent:
+        # Target coordinates
        coords_or_quantity:
-          Wolfgang:
-            - (3,12) # Target coordinates
-          Sigmund:
+          Agent1:
+            - (3,12)
+          Agent2:
            - (3,2)

  Doors: { }
@ -68,10 +67,12 @@ Rules:
  AssignGlobalPositions: { }

  DoneAtDestinationReach:
-    reward_at_done: 1
+    reward_at_done: 50
    # We want to give rewards only, when all targets have been reached.
    condition: "all"

  # Done Conditions
+  # Define the conditions for the environment to stop. Either success or a fail conditions
+  # Environment execution stops after 30 steps
  DoneAtMaxStepsReached:
-    max_steps: 50
+    max_steps: 30
--- a/marl_factory_grid/environment/factory.py
+++ b/marl_factory_grid/environment/factory.py
@ -1,3 +1,4 @@
+import copy
 import shutil

 from collections import defaultdict
@ -100,7 +101,7 @@ class Factory(gym.Env):

        parsed_entities = self.conf.load_entities()
        self.map = LevelParser(self.level_filepath, parsed_entities, self.conf.pomdp_r)
-        self.levels_that_require_masking = ['two_rooms']
+        self.levels_that_require_masking = ['two_rooms_small']

        # Init for later usage:
        # noinspection PyTypeChecker
@ -274,10 +275,15 @@ class Factory(gym.Env):
            global Renderer
            self._renderer = Renderer(self.map.level_shape, view_radius=self.conf.pomdp_r, fps=10)

-        render_entities = self.state.entities.render()
+        # Remove potential Nones from entities
+        render_entities_full = self.state.entities.render()

        # Hide entities where certain conditions are met (e.g., amount <= 0 for DirtPiles)
-        render_entities = self.filter_entities(render_entities)
+        maintain_indices = self.filter_entities(self.state.entities)
+        if maintain_indices:
+            render_entities = [render_entity for idx, render_entity in enumerate(render_entities_full) if idx in maintain_indices]
+        else:
+            render_entities = render_entities_full

        # Mask entities based on dynamic conditions instead of hardcoding level-specific logic
        if self.conf['General']['level_name'] in self.levels_that_require_masking:
@ -291,18 +297,18 @@ class Factory(gym.Env):

    def filter_entities(self, entities):
        """ Generalized method to filter out entities that shouldn't be rendered. """
-        if 'DirtPiles' in self.state.entities.keys():
-            entities = [entity for entity in entities if not (entity.name == 'DirtPiles' and entity.amount <= 0)]
-        return entities
+        if 'CoinPiles' in self.state.entities.keys():
+            all_entities = [item for sublist in [[e for e in entity] for entity in entities] for item in sublist]
+            return [idx for idx, entity in enumerate(all_entities) if not ('CoinPile' in entity.name and entity.amount <= 0)]

    def mask_entities(self, entities):
        """ Generalized method to mask entities based on dynamic conditions. """
        for entity in entities:
            if entity.name == 'CoinPiles':
-                # entity.name = 'Destinations'
-                # entity.value = 1
-                entity.mask = 'Destinations'
-                entity.mask_value = 1
+                entity.name = 'Destinations'
+                entity.value = 1
+                #entity.mask = 'Destinations'
+                #entity.mask_value = 1
        return entities

    def set_recorder(self, recorder):
--- a/Show More
+++ b/Show More
				`@ -1 +0,0 @@`
				`from marl_factory_grid.algorithms.rl.memory import MARLActorCriticMemory`