flex.data package

Submodules

flex.data.dataset module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class flex.data.dataset.Dataset(X_data: LazyIndexable, y_data: LazyIndexable | None = None)[source]

Bases: object

Class used to represent the dataset from a node in a Federated Learning enviroment.

X_data

A numpy.array containing the data for the node.

Type:

LazyIndexable

y_data

A numpy.array containing the labels for the training data. Can be None if working on an unsupervised learning task. Default None.

Type:

LazyIndexable

X_data: LazyIndexable
classmethod from_array(X_array: list | ndarray, y_array: list | ndarray | None = None)[source]

Function that create a Dataset from array-like objects, list and numpy.

Args:

X_array (Union[list, np.ndarray]): Array-like containing X_data. y_array (Optional[Union[list, np.ndarray]]): Array-like containing the y_data. Default None.

Returns:

Dataset: a Dataset which encasulates X_array and/or y_array.

classmethod from_huggingface_dataset(hf_dataset, X_columns: list | None = None, label_columns: list | None = None)[source]

Function to conver an arrow dataset from the Datasets package (HuggingFace datasets library) to a FlexDataObject.

Args:

hf_dataset (Union[datasets.arrow_dataset.Dataset, str]): a dataset from the dataset library. If a string is recieved, it will load the dataset from the HuggingFace repository. When a string is given, the split has to be specified in the str variable as follows: ‘dataset;split’. Also, if the string contains a subset, for those datasets that have multiple subsets for differents tasks, it may be given as follow: ‘dataset;subset;split’, so we can download the dataset and the desired subset and split. X_columns (list): List containing the features names for training the model label_columns (list): List containing the name or names of the label column

Returns:

Dataset: a FlexDataObject which encapsulates the dataset.

classmethod from_tfds_image_dataset(tfds_dataset)[source]

Function to convert a dataset from tensorflow_datasets to a FlexDataObject.

Args:

tdfs_dataset (tf.data.Datasets): a tf dataset

Returns:

Dataset: a FlexDataObject which encapsulates the dataset.

classmethod from_tfds_text_dataset(tfds_dataset, X_columns: list | None = None, label_columns: list | None = None)[source]

Function to convert a dataset from tensorflow_datasets to a FlexDataObject.

Args:

tdfs_dataset (tf.data.Datasets): a tf dataset loaded. X_columns (list): List containing the features (input) of the model. label_columns (list): List containing the targets of the model.

Returns:

Dataset: a FlexDataObject which encapsulates the dataset.

classmethod from_torchtext_dataset(pytorch_text_dataset)[source]
Function to convert an object from torchtext.datasets.* to a FlexDataObject.

It is mandatory that the dataset contains at least the following transform: torchtext.transforms.ToTensor()

Args:

pytorch_text_dataset (torchtext.datasets.*): a torchtext dataset

Returns:

Dataset: a FlexDataObject which encapsulates the dataset.

classmethod from_torchvision_dataset(pytorch_dataset)[source]

Function to convert an object from torchvision.datasets.* to a FlexDataObject.

Args:

pytorch_dataset (torchvision.datasets.*): a torchvision dataset.

Returns:

Dataset: a FlexDataObject which encapsulates the dataset.

to_list()[source]

Function to return the FlexDataObject as list.

to_numpy(x_dtype=None, y_dtype=None)[source]

Function to return the FlexDataObject as numpy arrays.

to_tf_dataset()[source]

This function is an utility to transform a Dataset object to a tensorflow.data.Dataset object

Returns:

tensorflow.data.Dataset: tf dataset object instanciated using the contents of a Dataset

to_torchvision_dataset(**kwargs)[source]

This function transforms a Dataset into a Torchvision dataset object

Returns:

torvhcision.datasets.VisionDataset: a torchvision dataset with the contents of datasets. Note that transforms should be pased as arguments.

validate()[source]

Function that checks whether the object is correct or not.

y_data: LazyIndexable | None = None

flex.data.fed_data_distribution module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class flex.data.fed_data_distribution.FedDataDistribution(create_key: object | None = None)[source]

Bases: object

classmethod from_clustering_func(centralized_data: Dataset, clustering_func: Callable)[source]

This function federates data into nodes by means of a clustering function, that outputs to which node (cluster) a data point belongs.

Args:

centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. clustering_func (Callable): function that receives as arguments a pair of x and y elements from centralized_data and returns the name of the node (cluster) that should own it, the returned type must be Hashable. Note that we only support one node (cluster) per data point.

Returns:

federated_dataset (FedDataset): The federated dataset.

classmethod from_config(centralized_data: Dataset, config: FedDatasetConfig)[source]

This function prepare the data from a centralized data structure to a federated one. It will run different modifications to federate the data.

Args:

centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.

Returns:

federated_dataset (FedDataset): The federated dataset.

classmethod from_config_with_huggingface_dataset(data, config: FedDatasetConfig, X_columns: list, label_columns: list | None = None)[source]

This function federates a centralized hugginface dataset given a FlexDatasetConfig. This function will transform a dataset from the HuggingFace Hub datasets into a Dataset and then it will federate it.

Args:

data (Union[datasets.arrow_dataset.Dataset, str]): The hugginface dataset to federate. config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset. X_coluns (List[str]): List with the names of the columns to load. label_columns (list): List with the names of the label columns.

classmethod from_config_with_tfds_image_dataset(data, config: FedDatasetConfig)[source]

This function federates a centralized tensorflow dataset given a FlexDatasetConfig. This function will transform a dataset from the tensorflow_datasets module into a Dataset and then it will federate it.

Args:

data (Dataset): The tensorflow dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.

classmethod from_config_with_tfds_text_dataset(data, config: FedDatasetConfig, X_columns: list, label_columns: list)[source]

This function federates a centralized tensorflow dataset given a FlexDatasetConfig. This function will transform a dataset from the tensorflow_datasets module into a Dataset and then it will federate it.

Args:

data (Dataset): The tensorflow dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset. X_columns (List): List that contains the columns names for the input features. label_columns (List): List that contains the columns names for the output features.

classmethod from_config_with_torchtext_dataset(data, config: FedDatasetConfig)[source]

This function federates a centralized torchtext dataset given a FlexDatasetConfig. This function will transform the torchtext dataset into a Dataset and then it will federate it.

Args:

data (Dataset): The torchtext dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.

classmethod from_config_with_torchvision_dataset(data, config: FedDatasetConfig)[source]

This function federates a centralized torchvision dataset given a FlexDatasetConfig. This function will transform a dataset from the torchvision module into a Dataset and then it will federate it.

Args:

data (Dataset): The torchvision dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.

classmethod iid_distribution(centralized_data: Dataset, n_nodes: int = 2)[source]

Function to create a FedDataset for an IID experiment. We consider the simplest situation in which the data is distributed by giving the same amount of data to each node.

Args:

centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. n_nodes (int): Number of nodes in the Federated Learning experiment. Default 2.

Returns:

federated_dataset (FedDataset): The federated dataset.

flex.data.fed_dataset module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class flex.data.fed_dataset.FedDataset(dict=None, /, **kwargs)[source]

Bases: UserDict

Class that represents a federated dataset for the Flex library. The dataset contains the ids of the nodes and the dataset associated with each node.

data(collections.UserDict)

with the node ids as keys and the dataset as value.

Type:

The structure is a dictionary

apply(func: Callable, node_ids: List[Hashable] | None = None, num_proc: int = 1, **kwargs)[source]

This function lets apply a custom function to the FlexDataset in parallel.

The **kwargs provided to this function are all the kwargs of the custom function provided by the node.

Args:

func (Callable, optional): Function to apply to preprocess the data. node_ids (List[Hashtable], optional): List containig the the node ids where func will be applied. Each element of the list must be hashable and part of the FlexDataset. Defaults to None. num_proc (int, optional): Number of processes to parallelize, negative values are ignored. Default to 1

Returns:

FedDataset: The modified FlexDataset.

Raises:

ValueError: All node ids given must be in the FlexDataset.

get(k[, d]) D[k] if k in D, else d.  d defaults to None.[source]
normalize(node_ids: List[Hashable] | None = None, num_proc: int = 0, *args, **kwargs)[source]

Function that normalize the data over the nodes.

Args:

fld (FedDataset): FlexDataset containing all the data from the nodes. node_ids (List[Hashtable], optional): List containig the nodes id whether to normalize the data or not. Each element of the list must be hashable. Defaults to None. num_proc (int, optional): Number of processes to paralelize. Default to None (Use all).

Returns:

FedDataset: The FlexDataset normalized.

one_hot_encoding(node_ids: List[Hashable] | None = None, num_proc: int = 0, *args, **kwargs)[source]

Function that apply one hot encoding to the node labels.

Args:

fld (FedDataset): FlexDataset containing all the data from the nodes. node_ids (List[Hashtable], optional): List containing the nodes id whether to normalize the data or not. Each element of the list must be hashable. Defaults to None. num_proc (int, optional): Number of processes to paralelize. Default to None (Use all).

Returns:

FedDataset: The FlexDataset normalized.

flex.data.fed_dataset_config module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class flex.data.fed_dataset_config.FedDatasetConfig(seed: int | None = None, n_nodes: int = 2, shuffle: bool = False, node_ids: List[Hashable] | None = None, weights: ndarray[Any, dtype[_ScalarType_co]] | None = None, weights_per_label: ndarray[Any, dtype[_ScalarType_co]] | None = None, replacement: bool = False, labels_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None, features_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None, indexes_per_node: ndarray[Any, dtype[_ScalarType_co]] | None = None, group_by_label_index: int | None = None, keep_labels: List[bool] | None = None)[source]

Bases: object

Class used to represent a configuration to federate a centralized dataset. The following table shows the compatiblity of each option:

Options compatibility

n_nodes

node_ids

weights

weights_per_label

replacement

labels_per_node

features_per_node

indexes_per_node

group_by_label_index

keep_labels

shuffle

n_nodes

Y

Y

Y

Y

Y

Y

N

N

Y

Y

node_ids

Y

Y

Y

Y

Y

Y

N

Y

Y

weights

N

Y

Y

Y

N

N

Y

Y

weights_per_label

Y

N

N

N

N

Y

Y

replacement

Y

N

N

N

Y

Y

labels_per_node

N

N

N

Y

Y

features_per_node

N

N

Y

Y

indexes_per_node

N

Y

Y

group_by_label_index

N

Y

keep_labels

Y

shuffle

seed

Seed used to make the federated dataset generated reproducible with this configuration. Default None.

Type:

Optional[int]

n_nodes

Number of nodes among which to split a centralized dataset. Default 2.

Type:

int

shuffle

If True data is shuffled before being sampled. Default False.

Type:

bool

node_ids

Ids to identifty each node, if not provided, nodes will be indexed using integers. If n_nodes is also given, we consider up to n_nodes elements. Default None.

Type:

Optional[List[Hashable]]

weights

A numpy.array which provides the proportion of data to give to each node. Default None.

Type:

Optional[npt.NDArray]

weights_per_label

A numpy.array which provides the proportion of data to give to each node and class of the dataset to federate. We expect a bidimensional array of shape (n, m) where “n” is the number of nodes and “m” is the number of labels of the dataset to federate. Default None.

Type:

Optional[npt.NDArray]

replacement

Whether the samping procedure used to split a centralized dataset is with replacement or not. Default False

Type:

bool

labels_per_node

labels to assign to each node, if provided as an int, it is the number labels per node, if provided as a tuple of ints, it establishes a mininum and a maximum of number of labels per node, a random number sampled in such interval decides the number of labels of each node. If provided as a list of lists, it establishes the labels assigned to each node. Default None.

Type:

Optional[Union[int, npt.NDArray, Tuple[int]]]

features_per_node

Features to assign to each node, it share the same interface as labels_per_node. Default None.

Type:

Optional[Union[int, npt.NDArray, Tuple[int]]]

indexes_per_node

Data indexes to assign to each node. Default None.

Type:

Optional[npt.NDArray]

group_by_label_index

Index which indicates which feature unique values will be used to generate federated nodes. Default None.

Type:

Optional[int]

keep_labels

Whether each node keeps or not the labels or y_data

Type:

Optional[list[bool]]

features_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None
group_by_label_index: int | None = None
indexes_per_node: ndarray[Any, dtype[_ScalarType_co]] | None = None
keep_labels: List[bool] | None = None
labels_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None
n_nodes: int = 2
node_ids: List[Hashable] | None = None
replacement: bool = False
seed: int | None = None
shuffle: bool = False
validate()[source]

This function checks whether the configuration to federate a dataset is correct.

weights: ndarray[Any, dtype[_ScalarType_co]] | None = None
weights_per_label: ndarray[Any, dtype[_ScalarType_co]] | None = None
exception flex.data.fed_dataset_config.InvalidConfig[source]

Bases: ValueError

Raised when the input config is wrong

flex.data.pluggable_datasets module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class flex.data.pluggable_datasets.PluggableDataset(cls, bases, classdict, **kwds)[source]

Bases: EnumMeta

class flex.data.pluggable_datasets.PluggableDatasetString(cls, bases, classdict, **kwds)[source]

Bases: EnumMeta

class flex.data.pluggable_datasets.PluggableHuggingFace(value)[source]

Bases: Enum

Class containing some datasets that can be loaded to FLEXible. Other datasets can be plugged in, but it requires a special configuration, i.e., glue-cola. This is more about the user using correctly the arguments on the load_dataset function from huggingface datasets than a problem of our platform, so the user can easy-use other datasets.

We show some example datasets that can be loaded using the function FedDataDistribution.from_config_with_huggingface_dataset just giving a config and the string associated to each dataset from the Enum defined.

We selected this dataset as we can automatice the process of loading this datasets, but our framework support almost all the datasets, as they can be loaded as numpy arrays. We only show supports to this datasets as we can load the dataset as follows: dataset = load_dataset(name, split=’train’).

There are some datasets that need extra parameters like the version of the dataset, or that don’t have any split. This must be used by the user previously to load the dataset into FLEXible, but it will be easy and fast, as the user just need to select the X_train-y_train as np.arrays.

Args:

Enum (enum): Tuple containing name, X_columns and y_columns to use in the load_dataset function.

AG_NEWS_HF = ('ag_news', 'text', 'label')
AMAZON_POLARITY_HF = ('amazon_polarity', ['title', 'content'], 'label')
APPREVIEWS_HF = ('app_reviews', 'review', 'star')
FINANCIAL_PHRASEBANK_HF = ('financial_phrasebank', 'sentence', 'label')
GLUE_COLA_HF = ('glue', 'sentence', 'label')
IMDB_HF = ('imdb', 'text', 'label')
ROTTEN_TOMATOES_HF = ('rotten_tomatoes', 'text', 'label')
SQUAD_HF = ('squad', ['context', 'question'], 'answers')
TWEET_EVAL_EMOJI_HF = ('tweet_eval', 'text', 'label')
class flex.data.pluggable_datasets.PluggableTorchtext(value)[source]

Bases: Enum

Class containing all the pluggable datasets to a Dataset without any preprocessing needed.

Any other dataset from the TorchText library will need further preprocessing.

Args:

Enum (enum): torchtext class for each dataset than can be accepted on our platform.

members()[source]
class flex.data.pluggable_datasets.PluggableTorchvision(value)[source]

Bases: Enum

Class containing all the pluggable datasets to a Dataset without any preprocessing needed.

Any other dataset from the Torchvision library will need further preprocessing.

Args:

Enum (enum): torchvision class for each dataset than can be accepted on our platform.

members()[source]

flex.data.preprocessing_utils module

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

flex.data.preprocessing_utils.normalize(node_dataset, *args, **kwargs)[source]

Function that normalizes federated data.

Args:

node_dataset (Dataset): node_dataset to normalize the data.

Returns:

Dataset: Returns the node_dataset with the X_data property normalized.

flex.data.preprocessing_utils.one_hot_encoding(node_dataset, *args, **kwargs)[source]

Function that apply one hot encoding to the labels of a node_dataset.

Args:

node_dataset (Dataset): node_dataset to which apply one hot encode to her labels.

Raises:

ValueError: Raises value error if n_labels is not given in the kwargs argument.

Returns:

Dataset: Returns the node_dataset with the y_data property updated.

Module contents

Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.