Welcome to the third post in our FLEET tutorial series! 🚀 Today, we’ll focus on managing datasets and partitioning data for federated learning experiments. Datasets are a critical component of FL experiments, and FLEET's integration with Flower Datasets and HuggingFace Datasets makes this process seamless.
We'll explore:
Federated learning experiments often require datasets to be partitioned across multiple clients. FLEET simplifies this process through Hydra-based configurations, while leveraging partitioners from Flower. You can also adapt the dataset to support server evaluation (if applicable) and modify partitioning strategies to simulate realistic FL scenarios.
To choose the dataset, go to Hugging Face Datasets Hub and search for your dataset by name. You will pass that names to the dataset parameter of FederatedDataset. Note that the name is case-sensitive.
Note that once the dataset is available on HuggingFace Hub it can be immediately used in Flower Datasets (no approval from Flower team is needed, no custom code needed).
Here is how it looks for CIFAR10 dataset.

--
In FLEET, the dataset preparation process is automatically handled in the main.py script. This centralizes the setup and ensures that the dataset is ready before the experiment begins.
The preparation process involves:
Here’s how the preparation process is triggered in the script:
from common.dataset_utils import prepare_datasets
prepare_datasets(cfg.dataset)
This function reads the dataset configuration (cfg.dataset) and applies all relevant parameters (like partitioning strategy and test split size). No need to create the dataset configuration object manually, Hydra takes care of it.
FLEET integrates multiple partitioning strategies provided by Flower. By default, the partitioning strategy is set to IID, but you can configure non-IID partitioning as well, such as Dirichlet or Shard-based. Each partitioner has its own configurable parameters, which must be specified in the partitioner_kwargs field.
Here are some examples of partitioners supported by Flower:
IidPartitioner: Distributes data equally and randomly across clients.DirichletPartitioner: Distributes data based on a Dirichlet distribution, controlled by an alpha parameter.ShardPartitioner: Divides data into shards and assigns a specific number of shards to each client.Note: Each partitioner has its own set of parameters. To understand these parameters and how to use them, and for the full set of available partitioners, refer to the Flower Partitioners Documentation.
To use the DirichletPartitioner, you can update the dataset configuration in static/config/dataset/default.yaml:
partitioner_cls_name: "DirichletPartitioner"
partitioner_kwargs:
alpha: 0.5 # Controls the degree of non-IID-ness (lower values = higher non-IID-ness)
This will distribute data unevenly across clients, simulating real-world heterogeneity in federated learning.
Server evaluation allows the FL server to evaluate the global model on a test dataset. This feature is controlled by the server_eval parameter in the server configuration (static/config/fl_server/default.yaml):
server_eval: true
When server_eval is enabled:
server_eval).Some datasets (e.g., custom datasets or specific HuggingFace datasets) may not have a predefined test split. In such cases, if server_eval is enabled, FLEET will log a warning and skip server evaluation.
The dataset configuration (static/config/dataset/default.yaml) includes the following fields:
path: "static/data" # Directory where datasets are stored
name: "cifar10" # Dataset name (e.g., cifar10, imdb)
partitioner_cls_name: "IidPartitioner" # Type of partitioning
partitioner_kwargs: {} # Parameters for the partitioner
force_create: false # Recreate dataset if true
test_size: 0.2 # Proportion of test data
server_eval: true # Enable server evaluation
train_split_key: "train" # Key for training data split
test_split_key: "test" # Key for test data split
path: Directory where datasets are saved.name: Name of the dataset to load. Supported datasets include Flower datasets and HuggingFace datasets.partitioner_cls_name: The partitioning strategy (e.g., IidPartitioner, DirichletPartitioner).partitioner_kwargs: Additional parameters for the partitioner.force_create: If true, the dataset will be recreated even if it already exists.test_size: Proportion of data to reserve for the test split (only applies if no test split is provided).train_split_key / test_split_key: Keys used to access training and test data splits.FLEET supports datasets from HuggingFace, but it's important to ensure that you use the correct keys for the train and test splits. These keys vary between datasets and can be found on the HuggingFace dataset page.
For example, the IMDB dataset has the following splits:
- Train: "train"
- Test: "test"
Make sure to update the dataset configuration accordingly:
train_split_key: "train"
test_split_key: "test"
main.py script.server_eval parameter in the server configuration.In the next post, we’ll dive into network emulation and topology—how FLEET simulates realistic network conditions for FL experiments.
Stay tuned! 🎉
Previous: Building Your First Experiment with Hydra
Next: