Building Your First Experiment with Hydra

FLEET Hydra

Overview

In this blog post, we’ll guide you through creating your first custom experiment using FLEET and Hydra, a powerful configuration management system. Hydra simplifies the management of experimental parameters by organizing them into Python dataclasses and YAML configuration files. This approach ensures a clean separation between default values (in the code) and user-defined customizations (in YAML).

By the end of this post, you'll know:

How configurations are structured in FLEET using Python dataclasses.
How to override parameters using YAML files and the command line.
How to run a custom model on a chosen topology.

Hydra's Role in FLEET

Hydra is a Python library for managing configurations. In FLEET, configurations are defined as Python dataclasses and grouped into logical sections such as dataset, fl_client, fl_server, and net. These dataclasses act as the single source of truth for all default values.

Key Features of Hydra in FLEET:

Python Dataclasses as Defaults: All default configurations are defined in Python code.
Hierarchical Configuration: YAML files are used to override specific fields in the dataclasses.
CLI Overrides: Any parameter can be overridden directly via the command line.
Experiment Logging: Automatically saves resolved configurations to disk for reproducibility.

Step 1: Exploring Configuration Dataclasses

Each configuration is defined in Python as a dataclass. Below is a list of all available configurations and their default values:

Dataset Configuration (`DatasetConfig`)

Defined in common/dataset_utils.py, this configuration controls dataset-related parameters.

@dataclass
class DatasetConfig:
    path: str = "static/data"                  # Directory to store datasets
    name: str = "cifar10"                     # Dataset name (e.g., "cifar10", "imdb")
    partitioner_cls_name: str = "IidPartitioner"  # Partitioner class name
    partitioner_kwargs: dict = field(default_factory=dict)  # Partitioner arguments
    force_create: bool = False                # Recreate dataset if already exists
    test_size: float = 0.2                    # Fraction of test data
    server_eval: bool = True                  # Enable server-side evaluation
    train_split_key: str = "train"            # Key for the training split
    test_split_key: str = "test"              # Key for the test split

FL Client Configuration (`ClientConfig`)

Defined in flcode_pytorch/utils/configs.py, this configuration controls federated learning (FL) client settings.

@dataclass
class ClientConfig:
    log_to_stream: bool = True                # Enable console logging
    logging_level: str = "INFO"               # Logging level: DEBUG, INFO, etc.
    train_batch_size: int = 32                # Training batch size
    val_batch_size: int = 128                 # Validation batch size
    local_epochs: int = 1                     # Number of local training epochs
    learning_rate: float = 0.001              # Learning rate for optimization
    log_interval: int = 100                   # Steps between logging metrics
    collect_metrics: bool = False            # Enable client-side metric collection
    collect_metrics_interval: int = 5         # Metric collection interval (seconds)
    server_address: str = "tcp://localhost:5555"  # FL server address
    zmq: Dict[str, Any] = field(default_factory=lambda: {
        "enable": False,                      # Enable ZMQ communication
        "host": "localhost",
        "port": 5555
    })
    extra: Dict[str, Any] = field(default_factory=dict)  # Additional client parameters

FL Server Configuration (`ServerConfig`)

Defined in flcode_pytorch/utils/configs.py, this configuration governs the FL server's behavior.

@dataclass
class ServerConfig:
    log_to_stream: bool = True                # Enable console logging
    logging_level: str = "INFO"               # Logging level: DEBUG, INFO, etc.
    strategy: str = "FedAvg"                  # FL strategy (default: FedAvg)
    min_fit_clients: int = 1                  # Minimum clients for training
    min_evaluate_clients: int = 1             # Minimum clients for evaluation
    min_available_clients: int = 1           # Minimum available clients
    num_rounds: int = 1                       # Number of FL rounds
    fraction_fit: float = 1.0                 # Fraction of clients to fit
    fraction_evaluate: float = 1.0            # Fraction of clients to evaluate
    server_eval: bool = False                 # Enable server-side evaluation
    val_batch_size: int = 128                 # Validation batch size
    server_param_init: bool = True            # Initialize server parameters
    stop_by_accuracy: bool = False            # Stop training by accuracy
    accuracy_level: float = 0.8               # Accuracy level to stop training
    collect_metrics: bool = False            # Enable server-side metric collection
    collect_metrics_interval: int = 60        # Metric collection interval (seconds)
    zmq: Dict[str, Any] = field(default_factory=lambda: {
        "enable": False,                      # Enable ZMQ communication
        "host": "localhost",
        "port": 5555
    })
    extra: Dict[str, Any] = field(default_factory=dict)  # Additional server parameters

Network Configuration (`NetConfig`)

Defined in containernet_code/config.py, this configuration handles the network topology and background traffic.

@dataclass
class NetConfig:
    topology: TopologyConfig = field(default_factory=TopologyConfig)  # Topology settings
    fl: FLClientConfig = field(default_factory=FLClientConfig)        # FL client settings
    bg: BGConfig = field(default_factory=BGConfig)                   # Background traffic settings
    sdn: SDNConfig = field(default_factory=SDNConfig)                # SDN settings

Topology Configuration (`TopologyConfig`)

@dataclass
class TopologyConfig:
    source: str = "topohub"                    # Topology source: topohub/custom
    topohub_id: Optional[str] = None           # Topohub ID for predefined topologies
    custom_topology: Dict = field(default_factory=lambda: {
        "path": "", "class_name": ""
    })                                        # Path to custom topology class
    link_util_key: str = "deg"                 # Key for link utilization (degree)
    link_config: Dict = field(default_factory=dict)  # Link-specific configurations
    switch_config: Dict = field(default_factory=lambda: {
        "failMode": "standalone", "stp": True
    })                                        # Switch-specific configurations
    extra: Dict = field(default_factory=dict)  # Additional topology parameters

Background Traffic Configuration (`BGConfig`)

@dataclass
class BGConfig:
    enabled: bool = False                      # Enable background traffic
    image: str = "bg-traffic:latest"           # Docker image for background traffic
    network: str = "10.1.0.0/16"               # IP range for background traffic
    clients_limits: Dict = field(default_factory=lambda: {
        "cpu": 0.5, "mem": 256
    })                                        # CPU/memory limits for clients
    generator_config: Dict = field(default_factory=lambda: {
        "name": "iperf"                        # Traffic generator (e.g., iperf)
    })
    pattern_config: Dict = field(default_factory=lambda: {
        "name": "poisson",                     # Traffic pattern (e.g., Poisson)
        "parallel_streams": 1,
        "max_rate": 100.0, "min_rate": 1.0
    })
    extra: Dict = field(default_factory=dict)  # Additional background traffic settings

Step 2: Running the Default Experiment

Before customizing, let’s run the default experiment provided in main.yaml. This uses:

The CIFAR-10 dataset.
10 clients on a Topohub topology.
A simple FL strategy (FedAvg) with 30 rounds.

Run the following command:

sudo .venv/bin/python main.py

What Happens?

FLEET loads the default Python dataclasses and YAML configurations.
Initializes the network topology with the specified number of clients.
Partitions the CIFAR-10 dataset for FL.
Starts the interactive Containernet CLI.

Step 3: Overriding Parameters in YAML

Hydra allows you to override any parameter in the Python dataclasses by modifying the YAML files. For example:

Changing the Dataset

To use the IMDB sentiment classification dataset, update static/config/dataset/default.yaml:

name: "imdb"

Modifying FL Parameters

To increase the number of training rounds from 30 to 50, edit static/config/fl_server/default.yaml:

num_rounds: 50

Step 4: Overriding Parameters from the CLI

Hydra makes it easy to override any parameter directly from the command line without editing YAML files. For example:

Changing the Dataset

To use the IMDB sentiment classification dataset instead of CIFAR-10:

sudo .venv/bin/python main.py dataset.name=imdb

Modifying FL Parameters

To increase the number of training rounds from 30 to 50 and add more clients:

sudo .venv/bin/python main.py fl_server.num_rounds=50 net.fl.clients_number=20

Changing the Network Topology

To use a different Topohub topology:

sudo .venv/bin/python main.py net.topology.topohub_id=ibm/10/0

Key Takeaways

Hydra allows you to manage configurations using Python dataclasses and YAML files.
You can override any parameter using either YAML files or CLI commands.
Adding custom YAML files makes it easy to define new configurations.
FLEET is highly extensible, allowing you to integrate custom models and strategies.

Next Steps

In the next post, we’ll dive deeper into dataset management, including how to partition data for IID and non-IID experiments. Stay tuned!

Previous: Introduction to FLEET: Federated Learning Testbed