flex.data package
Submodules
flex.data.dataset module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class flex.data.dataset.Dataset(X_data: LazyIndexable, y_data: LazyIndexable | None = None)[source]
Bases:
objectClass used to represent the dataset from a node in a Federated Learning enviroment.
- X_data
A numpy.array containing the data for the node.
- Type:
LazyIndexable
- y_data
A numpy.array containing the labels for the training data. Can be None if working on an unsupervised learning task. Default None.
- Type:
LazyIndexable
- X_data: LazyIndexable
- classmethod from_array(X_array: list | ndarray, y_array: list | ndarray | None = None)[source]
Function that create a Dataset from array-like objects, list and numpy.
Args:
X_array (Union[list, np.ndarray]): Array-like containing X_data. y_array (Optional[Union[list, np.ndarray]]): Array-like containing the y_data. Default None.
Returns:
Dataset: a Dataset which encasulates X_array and/or y_array.
- classmethod from_huggingface_dataset(hf_dataset, X_columns: list | None = None, label_columns: list | None = None)[source]
Function to conver an arrow dataset from the Datasets package (HuggingFace datasets library) to a FlexDataObject.
Args:
hf_dataset (Union[datasets.arrow_dataset.Dataset, str]): a dataset from the dataset library. If a string is recieved, it will load the dataset from the HuggingFace repository. When a string is given, the split has to be specified in the str variable as follows: ‘dataset;split’. Also, if the string contains a subset, for those datasets that have multiple subsets for differents tasks, it may be given as follow: ‘dataset;subset;split’, so we can download the dataset and the desired subset and split. X_columns (list): List containing the features names for training the model label_columns (list): List containing the name or names of the label column
Returns:
Dataset: a FlexDataObject which encapsulates the dataset.
- classmethod from_tfds_image_dataset(tfds_dataset)[source]
Function to convert a dataset from tensorflow_datasets to a FlexDataObject.
Args:
tdfs_dataset (tf.data.Datasets): a tf dataset
Returns:
Dataset: a FlexDataObject which encapsulates the dataset.
- classmethod from_tfds_text_dataset(tfds_dataset, X_columns: list | None = None, label_columns: list | None = None)[source]
Function to convert a dataset from tensorflow_datasets to a FlexDataObject.
Args:
tdfs_dataset (tf.data.Datasets): a tf dataset loaded. X_columns (list): List containing the features (input) of the model. label_columns (list): List containing the targets of the model.
Returns:
Dataset: a FlexDataObject which encapsulates the dataset.
- classmethod from_torchtext_dataset(pytorch_text_dataset)[source]
- Function to convert an object from torchtext.datasets.* to a FlexDataObject.
It is mandatory that the dataset contains at least the following transform: torchtext.transforms.ToTensor()
Args:
pytorch_text_dataset (torchtext.datasets.*): a torchtext dataset
Returns:
Dataset: a FlexDataObject which encapsulates the dataset.
- classmethod from_torchvision_dataset(pytorch_dataset)[source]
Function to convert an object from torchvision.datasets.* to a FlexDataObject.
Args:
pytorch_dataset (torchvision.datasets.*): a torchvision dataset.
Returns:
Dataset: a FlexDataObject which encapsulates the dataset.
- to_list()[source]
Function to return the FlexDataObject as list.
- to_numpy(x_dtype=None, y_dtype=None)[source]
Function to return the FlexDataObject as numpy arrays.
- to_tf_dataset()[source]
This function is an utility to transform a Dataset object to a tensorflow.data.Dataset object
Returns:
tensorflow.data.Dataset: tf dataset object instanciated using the contents of a Dataset
- to_torchvision_dataset(**kwargs)[source]
This function transforms a Dataset into a Torchvision dataset object
Returns:
torvhcision.datasets.VisionDataset: a torchvision dataset with the contents of datasets. Note that transforms should be pased as arguments.
- validate()[source]
Function that checks whether the object is correct or not.
- y_data: LazyIndexable | None = None
flex.data.fed_data_distribution module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class flex.data.fed_data_distribution.FedDataDistribution(create_key: object | None = None)[source]
Bases:
object- classmethod from_clustering_func(centralized_data: Dataset, clustering_func: Callable)[source]
This function federates data into nodes by means of a clustering function, that outputs to which node (cluster) a data point belongs.
Args:
centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. clustering_func (Callable): function that receives as arguments a pair of x and y elements from centralized_data and returns the name of the node (cluster) that should own it, the returned type must be Hashable. Note that we only support one node (cluster) per data point.
Returns:
federated_dataset (FedDataset): The federated dataset.
- classmethod from_config(centralized_data: Dataset, config: FedDatasetConfig)[source]
This function prepare the data from a centralized data structure to a federated one. It will run different modifications to federate the data.
Args:
centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.
Returns:
federated_dataset (FedDataset): The federated dataset.
- classmethod from_config_with_huggingface_dataset(data, config: FedDatasetConfig, X_columns: list, label_columns: list | None = None)[source]
This function federates a centralized hugginface dataset given a FlexDatasetConfig. This function will transform a dataset from the HuggingFace Hub datasets into a Dataset and then it will federate it.
Args:
data (Union[datasets.arrow_dataset.Dataset, str]): The hugginface dataset to federate. config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset. X_coluns (List[str]): List with the names of the columns to load. label_columns (list): List with the names of the label columns.
- classmethod from_config_with_tfds_image_dataset(data, config: FedDatasetConfig)[source]
This function federates a centralized tensorflow dataset given a FlexDatasetConfig. This function will transform a dataset from the tensorflow_datasets module into a Dataset and then it will federate it.
Args:
data (Dataset): The tensorflow dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.
- classmethod from_config_with_tfds_text_dataset(data, config: FedDatasetConfig, X_columns: list, label_columns: list)[source]
This function federates a centralized tensorflow dataset given a FlexDatasetConfig. This function will transform a dataset from the tensorflow_datasets module into a Dataset and then it will federate it.
Args:
data (Dataset): The tensorflow dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset. X_columns (List): List that contains the columns names for the input features. label_columns (List): List that contains the columns names for the output features.
- classmethod from_config_with_torchtext_dataset(data, config: FedDatasetConfig)[source]
This function federates a centralized torchtext dataset given a FlexDatasetConfig. This function will transform the torchtext dataset into a Dataset and then it will federate it.
Args:
data (Dataset): The torchtext dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.
- classmethod from_config_with_torchvision_dataset(data, config: FedDatasetConfig)[source]
This function federates a centralized torchvision dataset given a FlexDatasetConfig. This function will transform a dataset from the torchvision module into a Dataset and then it will federate it.
Args:
data (Dataset): The torchvision dataset config (FedDatasetConfig): FlexDatasetConfig with the configuration to federate the centralized dataset.
- classmethod iid_distribution(centralized_data: Dataset, n_nodes: int = 2)[source]
Function to create a FedDataset for an IID experiment. We consider the simplest situation in which the data is distributed by giving the same amount of data to each node.
Args:
centralized_data (Dataset): Centralized dataset represented as a FlexDataObject. n_nodes (int): Number of nodes in the Federated Learning experiment. Default 2.
Returns:
federated_dataset (FedDataset): The federated dataset.
flex.data.fed_dataset module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class flex.data.fed_dataset.FedDataset(dict=None, /, **kwargs)[source]
Bases:
UserDictClass that represents a federated dataset for the Flex library. The dataset contains the ids of the nodes and the dataset associated with each node.
- data(collections.UserDict)
with the node ids as keys and the dataset as value.
- Type:
The structure is a dictionary
- apply(func: Callable, node_ids: List[Hashable] | None = None, num_proc: int = 1, **kwargs)[source]
This function lets apply a custom function to the FlexDataset in parallel.
The **kwargs provided to this function are all the kwargs of the custom function provided by the node.
Args:
func (Callable, optional): Function to apply to preprocess the data. node_ids (List[Hashtable], optional): List containig the the node ids where func will be applied. Each element of the list must be hashable and part of the FlexDataset. Defaults to None. num_proc (int, optional): Number of processes to parallelize, negative values are ignored. Default to 1
Returns:
FedDataset: The modified FlexDataset.
Raises:
ValueError: All node ids given must be in the FlexDataset.
- normalize(node_ids: List[Hashable] | None = None, num_proc: int = 0, *args, **kwargs)[source]
Function that normalize the data over the nodes.
Args:
fld (FedDataset): FlexDataset containing all the data from the nodes. node_ids (List[Hashtable], optional): List containig the nodes id whether to normalize the data or not. Each element of the list must be hashable. Defaults to None. num_proc (int, optional): Number of processes to paralelize. Default to None (Use all).
Returns:
FedDataset: The FlexDataset normalized.
- one_hot_encoding(node_ids: List[Hashable] | None = None, num_proc: int = 0, *args, **kwargs)[source]
Function that apply one hot encoding to the node labels.
Args:
fld (FedDataset): FlexDataset containing all the data from the nodes. node_ids (List[Hashtable], optional): List containing the nodes id whether to normalize the data or not. Each element of the list must be hashable. Defaults to None. num_proc (int, optional): Number of processes to paralelize. Default to None (Use all).
Returns:
FedDataset: The FlexDataset normalized.
flex.data.fed_dataset_config module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class flex.data.fed_dataset_config.FedDatasetConfig(seed: int | None = None, n_nodes: int = 2, shuffle: bool = False, node_ids: List[Hashable] | None = None, weights: ndarray[Any, dtype[_ScalarType_co]] | None = None, weights_per_label: ndarray[Any, dtype[_ScalarType_co]] | None = None, replacement: bool = False, labels_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None, features_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None, indexes_per_node: ndarray[Any, dtype[_ScalarType_co]] | None = None, group_by_label_index: int | None = None, keep_labels: List[bool] | None = None)[source]
Bases:
objectClass used to represent a configuration to federate a centralized dataset. The following table shows the compatiblity of each option:
Options compatibility
n_nodes
node_ids
weights
weights_per_label
replacement
labels_per_node
features_per_node
indexes_per_node
group_by_label_index
keep_labels
shuffle
n_nodes
Y
Y
Y
Y
Y
Y
N
N
Y
Y
node_ids
Y
Y
Y
Y
Y
Y
N
Y
Y
weights
N
Y
Y
Y
N
N
Y
Y
weights_per_label
Y
N
N
N
N
Y
Y
replacement
Y
N
N
N
Y
Y
labels_per_node
N
N
N
Y
Y
features_per_node
N
N
Y
Y
indexes_per_node
N
Y
Y
group_by_label_index
N
Y
keep_labels
Y
shuffle
- seed
Seed used to make the federated dataset generated reproducible with this configuration. Default None.
- Type:
Optional[int]
- n_nodes
Number of nodes among which to split a centralized dataset. Default 2.
- Type:
int
- shuffle
If True data is shuffled before being sampled. Default False.
- Type:
bool
- node_ids
Ids to identifty each node, if not provided, nodes will be indexed using integers. If n_nodes is also given, we consider up to n_nodes elements. Default None.
- Type:
Optional[List[Hashable]]
- weights
A numpy.array which provides the proportion of data to give to each node. Default None.
- Type:
Optional[npt.NDArray]
- weights_per_label
A numpy.array which provides the proportion of data to give to each node and class of the dataset to federate. We expect a bidimensional array of shape (n, m) where “n” is the number of nodes and “m” is the number of labels of the dataset to federate. Default None.
- Type:
Optional[npt.NDArray]
- replacement
Whether the samping procedure used to split a centralized dataset is with replacement or not. Default False
- Type:
bool
- labels_per_node
labels to assign to each node, if provided as an int, it is the number labels per node, if provided as a tuple of ints, it establishes a mininum and a maximum of number of labels per node, a random number sampled in such interval decides the number of labels of each node. If provided as a list of lists, it establishes the labels assigned to each node. Default None.
- Type:
Optional[Union[int, npt.NDArray, Tuple[int]]]
- features_per_node
Features to assign to each node, it share the same interface as labels_per_node. Default None.
- Type:
Optional[Union[int, npt.NDArray, Tuple[int]]]
- indexes_per_node
Data indexes to assign to each node. Default None.
- Type:
Optional[npt.NDArray]
- group_by_label_index
Index which indicates which feature unique values will be used to generate federated nodes. Default None.
- Type:
Optional[int]
- keep_labels
Whether each node keeps or not the labels or y_data
- Type:
Optional[list[bool]]
- features_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None
- group_by_label_index: int | None = None
- indexes_per_node: ndarray[Any, dtype[_ScalarType_co]] | None = None
- keep_labels: List[bool] | None = None
- labels_per_node: int | ndarray[Any, dtype[_ScalarType_co]] | Tuple[int] | None = None
- n_nodes: int = 2
- node_ids: List[Hashable] | None = None
- replacement: bool = False
- seed: int | None = None
- shuffle: bool = False
- validate()[source]
This function checks whether the configuration to federate a dataset is correct.
- weights: ndarray[Any, dtype[_ScalarType_co]] | None = None
- weights_per_label: ndarray[Any, dtype[_ScalarType_co]] | None = None
- exception flex.data.fed_dataset_config.InvalidConfig[source]
Bases:
ValueErrorRaised when the input config is wrong
flex.data.pluggable_datasets module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- class flex.data.pluggable_datasets.PluggableDataset(cls, bases, classdict, **kwds)[source]
Bases:
EnumMeta
- class flex.data.pluggable_datasets.PluggableDatasetString(cls, bases, classdict, **kwds)[source]
Bases:
EnumMeta
- class flex.data.pluggable_datasets.PluggableHuggingFace(value)[source]
Bases:
EnumClass containing some datasets that can be loaded to FLEXible. Other datasets can be plugged in, but it requires a special configuration, i.e., glue-cola. This is more about the user using correctly the arguments on the load_dataset function from huggingface datasets than a problem of our platform, so the user can easy-use other datasets.
We show some example datasets that can be loaded using the function FedDataDistribution.from_config_with_huggingface_dataset just giving a config and the string associated to each dataset from the Enum defined.
We selected this dataset as we can automatice the process of loading this datasets, but our framework support almost all the datasets, as they can be loaded as numpy arrays. We only show supports to this datasets as we can load the dataset as follows: dataset = load_dataset(name, split=’train’).
There are some datasets that need extra parameters like the version of the dataset, or that don’t have any split. This must be used by the user previously to load the dataset into FLEXible, but it will be easy and fast, as the user just need to select the X_train-y_train as np.arrays.
Args:
Enum (enum): Tuple containing name, X_columns and y_columns to use in the load_dataset function.
- AG_NEWS_HF = ('ag_news', 'text', 'label')
- AMAZON_POLARITY_HF = ('amazon_polarity', ['title', 'content'], 'label')
- APPREVIEWS_HF = ('app_reviews', 'review', 'star')
- FINANCIAL_PHRASEBANK_HF = ('financial_phrasebank', 'sentence', 'label')
- GLUE_COLA_HF = ('glue', 'sentence', 'label')
- IMDB_HF = ('imdb', 'text', 'label')
- ROTTEN_TOMATOES_HF = ('rotten_tomatoes', 'text', 'label')
- SQUAD_HF = ('squad', ['context', 'question'], 'answers')
- TWEET_EVAL_EMOJI_HF = ('tweet_eval', 'text', 'label')
- class flex.data.pluggable_datasets.PluggableTorchtext(value)[source]
Bases:
EnumClass containing all the pluggable datasets to a Dataset without any preprocessing needed.
Any other dataset from the TorchText library will need further preprocessing.
Args:
Enum (enum): torchtext class for each dataset than can be accepted on our platform.
- class flex.data.pluggable_datasets.PluggableTorchvision(value)[source]
Bases:
EnumClass containing all the pluggable datasets to a Dataset without any preprocessing needed.
Any other dataset from the Torchvision library will need further preprocessing.
Args:
Enum (enum): torchvision class for each dataset than can be accepted on our platform.
flex.data.preprocessing_utils module
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.
- flex.data.preprocessing_utils.normalize(node_dataset, *args, **kwargs)[source]
Function that normalizes federated data.
Args:
node_dataset (Dataset): node_dataset to normalize the data.
Returns:
Dataset: Returns the node_dataset with the X_data property normalized.
- flex.data.preprocessing_utils.one_hot_encoding(node_dataset, *args, **kwargs)[source]
Function that apply one hot encoding to the labels of a node_dataset.
Args:
node_dataset (Dataset): node_dataset to which apply one hot encode to her labels.
Raises:
ValueError: Raises value error if n_labels is not given in the kwargs argument.
Returns:
Dataset: Returns the node_dataset with the y_data property updated.
Module contents
Copyright (C) 2024 Instituto Andaluz Interuniversitario en Ciencia de Datos e Inteligencia Computacional (DaSCI)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.