bead.src.utils package

Submodules

bead.src.utils.conversion module

Data conversion utilities for CSV to HDF5/NumPy.

This module contains functions to convert CSV files to HDF5 and .npy files in parallel. It handles efficient reading and processing of large CSV files with event, jet, and constituent-level data, and saves them in a structured format for later analysis.

Functions:: calculate_jet_properties_numba: Numba-accelerated calculation of jet properties. process_event: Process a single event row from CSV. csv_chunk_generator: Generator yielding CSV file chunks to process. append_to_hdf5: Append data chunk to HDF5 dataset. process_chunk: Process a chunk of CSV rows into homogeneous arrays. convert_csv_to_hdf5_npy_parallel: Main function to convert CSV to HDF5/NumPy.

bead.src.utils.conversion.append_to_hdf5(h5file, dataset_name, data_chunk)[source]: Append a data chunk to an existing resizable HDF5 dataset.

bead.src.utils.conversion.calculate_jet_properties_numba(pt_arr, eta_arr, phi_arr)[source]: Numba-accelerated calculation of jet properties using vectorized NumPy operations. Processes all constituents in a jet simultaneously.

bead.src.utils.conversion.convert_csv_to_hdf5_npy_parallel(csv_file, output_prefix, out_path, file_type='h5', chunk_size=10000, n_workers=4, verbose=False)[source]: Convert CSV to HDF5 (or .npy) using homogeneous 2D arrays. Here, we build each dataset as a 2D array (with fixed number of columns) so that later you can access columns by index (e.g. jets[:,4] for jet_pt).

bead.src.utils.conversion.csv_chunk_generator(csv_file, chunk_size=10000)[source]: Generator yielding CSV file chunks (lists of rows) to avoid loading the entire file. Skips empty rows and rows with ‘evtwt’ in the first column.

bead.src.utils.conversion.process_chunk(chunk, start_evt_id)[source]: Process a chunk of CSV rows into homogeneous 2D arrays. Instead of creating structured arrays (with named fields), we build plain lists of lists and convert them to a homogeneous NumPy array.

bead.src.utils.conversion.process_event(evt_id, row)[source]

Process a single event row:

Convert string data to floats.
Extract event-level data.
Loop over jets and their constituents.
Use vectorized (Numba) calculation for jet properties.

bead.src.utils.data_processing module

Data processing utilities for HDF5/NumPy arrays.

This module provides functions for loading, processing, and preparing data for training and inference. It includes utilities for selecting the top jets and constituents, normalizing data, and converting between different data formats.

Functions:: load_data: Load data from HDF5 files. select_top_jets_and_constituents: Select top N jets and M constituents. process_and_save_tensors: Process input file and save as PyTorch tensors. preproc_inputs: Preprocess inputs for training or inference.

bead.src.utils.data_processing.load_data(file_path, file_type='h5', verbose: bool = False)[source]: Load data from either an HDF5 file or .npy files.

bead.src.utils.data_processing.preproc_inputs(paths, config, keyword, verbose: bool = False)[source]

bead.src.utils.data_processing.process_and_save_tensors(in_path, out_path, output_prefix, config, verbose: bool = False)[source]: Process the input file, parallelize selections, and save the results as PyTorch tensors.

bead.src.utils.data_processing.select_top_jets_and_constituents(jets, constituents, n_jets=3, n_constits=15, verbose=False)[source]

Select top n_jets per event and, for each selected jet, top n_constits constituents.

Returns:: (num_events, n_jets, jets.shape[1]) constits_out: (num_events, n_jets * n_constits, constituents.shape[1])
Return type:: jets_out

bead.src.utils.diagnostics module

Diagnostic utilities for model analysis and profiling.

This module provides functions for extracting and visualizing neural network activations, profiling model performance, and generating diagnostic plots for understanding model behavior.

Functions:: get_mean_node_activations: Calculate mean activations for each node. dict_to_square_matrix: Convert activation dictionary to square matrix. plot: Generate neural activation pattern plot. nap_diagnose: Neural activation pattern diagnosis. pytorch_profile: Profile PyTorch code execution. c_profile: Profile Python code execution with cProfile.

bead.src.utils.diagnostics.c_profile(func, *args, **kwargs)[source]

Profile the function func with cProfile.

Parameters:: func (callable) – The function to be profiled.
Returns:: The result of the function func execution.
Return type:: result

bead.src.utils.diagnostics.dict_to_square_matrix(input_dict: dict) → array[source]

Function changes an input dictionary into a square np.array. Adds NaNs when the dimension of a dict key is less than of the final square matrix.

Parameters:: input_dict (dict)
Returns:: square_matrix (np.array)

bead.src.utils.diagnostics.get_mean_node_activations(input_dict: dict) → dict[source]

bead.src.utils.diagnostics.nap_diagnose(input_path: str, output_path: str, verbose: bool = False) → None[source]

bead.src.utils.diagnostics.plot(data: array, output_path: str) → None[source]

bead.src.utils.diagnostics.pytorch_profile(f, *args, **kwargs)[source]

This function performs PyTorch profiling of CPU, GPU time and memory consumed by the function f execution.

Parameters:: f (callable) – The function to be profiled.
Returns:: The result of the function f execution.
Return type:: result

bead.src.utils.ggl module

Control center for BEAD command-line interface.

This module serves as the main entry point for the BEAD CLI. It handles command-line arguments, project creation, and orchestrates the execution of various modes like data conversion, training, inference, and visualization. Think of this file as your google assistant This file is a collection of simple helper functions and is the control center that accesses all other src files

Functions:: get_arguments: Parse command-line arguments. create_default_config: Create default configuration file. create_new_project: Create directory structure for new project. convert_csv: Convert CSV files to HDF5 or NumPy format. prepare_inputs: Process input data and create tensors. run_training: Execute model training pipeline. run_inference: Execute model inference pipeline. run_plots: Generate plots from results. run_diagnostics: Run model diagnostics. run_full_chain: Execute a sequence of operations.
Classes:: Config: Dataclass for storing configuration settings.

class bead.src.utils.ggl.Config(workspace_name: str, project_name: str, file_type: str, parallel_workers: int, chunk_size: int, num_jets: int, num_constits: int, latent_space_size: int, normalizations: str, invert_normalizations: bool, train_size: float, model_name: str, input_level: str, input_features: str, model_init: str, loss_function: str, optimizer: str, epochs: int, lr: float, batch_size: int, early_stopping: bool, early_stoppin_patience: int, lr_scheduler: bool, lr_scheduler_patience: int, latent_space_plot_style: str, subsample_plot: bool, use_ddp: bool, use_amp: bool, min_delta: int, reg_param: float, intermittent_model_saving: bool, intermittent_saving_patience: int, activation_extraction: bool, deterministic_algorithm: bool, separate_model_saving: bool, subsample_size: int, contrastive_temperature: float, contrastive_weight: float, overlay_roc: bool, overlay_roc_projects: list, overlay_roc_save_location: str, overlay_roc_filename: str)[source]

Bases: object

Defines a configuration dataclass

activation_extraction: bool

batch_size: int

chunk_size: int

contrastive_temperature: float

contrastive_weight: float

deterministic_algorithm: bool

early_stoppin_patience: int

early_stopping: bool

epochs: int

file_type: str

input_features: str

input_level: str

intermittent_model_saving: bool

intermittent_saving_patience: int

invert_normalizations: bool

latent_space_plot_style: str

latent_space_size: int

loss_function: str

lr: float

lr_scheduler: bool

lr_scheduler_patience: int

min_delta: int

model_init: str

model_name: str

normalizations: str

num_constits: int

num_jets: int

optimizer: str

overlay_roc: bool

overlay_roc_filename: str

overlay_roc_projects: list

overlay_roc_save_location: str

parallel_workers: int

project_name: str

reg_param: float

separate_model_saving: bool

subsample_plot: bool

subsample_size: int

train_size: float

use_amp: bool

use_ddp: bool

workspace_name: str

bead.src.utils.ggl.convert_csv(paths, config, verbose: bool = False)[source]

Convert the input ‘’.csv’ into the file_type selected in the config file (‘.h5’ by default)

Separate event-level, jet-level and constituent-level data into separate datasets/files.

Parameters:

data_path (path) – Path to the input csv files
output_path (path) – Selects base path for determining output path
config (dataClass) – Base class selecting user inputs
verbose (bool) – If True, prints out more information

Outputs:

A ProjectName_OutputPrefix.h5 file which includes: - Event-level dataset - Jet-level dataset - Constituent-level dataset

or

A ProjectName_OutputPrefix_{data-level}.npy files which contain the same information as above, split into 3 separate files.

bead.src.utils.ggl.create_default_config(workspace_name: str, project_name: str) → str[source]

Creates a default config file for a project. :param workspace_name: Name of the workspace. :type workspace_name: str :param project_name: Name of the project. :type project_name: str

Returns:: Default config file.
Return type:: str

bead.src.utils.ggl.create_new_project(workspace_name: str, project_name: str, verbose: bool = False, base_path: str = 'bead/workspaces') → None[source]

Creates a new project directory output subdirectories and config files within a workspace.

Parameters:

workspace_name (str) – Creates a workspace (dir) for storing data and projects with this name.
project_name (str) – Creates a project (dir) for storing configs and outputs with this name.
verbose (bool, optional) – Whether to print out the progress. Defaults to False.

bead.src.utils.ggl.get_arguments()[source]

Determines commandline arguments specified by BEAD user. Use –help to see what options are available.

Returns: .py, string, folder: .py file containing the config options, string determining what mode to run, projects directory where outputs go.

bead.src.utils.ggl.prepare_inputs(paths, config, verbose: bool = False)[source]

Read the input data and generate torch tensors ready to train on.

Select number of leading jets per event and number of leading constituents per jet to be used for training.

Parameters:

paths – Dictionary of common paths used in the pipeline
config (dataClass) – Base class selecting user inputs
verbose (bool) – If True, prints out more information

Outputs:: Tensor files which include: - Event-level dataset - [evt_id, evt_weight, met, met_phi, num_jets] - Jet-level dataset - [evt_id, jet_id, num_constituents, jet_btag, jet_pt, jet_eta, jet_phi] - Constituent-level dataset - [evt_id, jet_id, constituent_id, jet_btag, constituent_pt, constituent_eta, constituent_phi]

bead.src.utils.ggl.run_diagnostics(project_path, verbose: bool)[source]

Calls diagnostics.diagnose()

Parameters:

input_path (str) – path to the np.array contataining the activations values
output_path (str) – path to store the diagnostics pdf

bead.src.utils.ggl.run_full_chain(workspace_name: str, project_name: str, paths: dict, config: dict, options: str, verbose: bool = False) → None[source]

Execute a sequence of operations based on the provided options string.

Parameters:

workspace_name – Name of the workspace for new projects
project_name – Name of the project for new projects
paths – Dictionary of file paths and directories
config – Configuration dictionary for operations
options – Underscore-separated string specifying the workflow sequence
verbose – Whether to show verbose output

Example

run_full_chain(“my_workspace”, “my_project”, paths, config,: “newproject_convertcsv_prepareinputs_train_detect”, verbose=True)

bead.src.utils.ggl.run_inference(paths, config, verbose: bool = False)[source]

Main function calling the training functions, ran when –mode=train is selected. The three functions called are: process, ggl.mode_init and training.train.

Parameters:

paths (dictionary) – Dictionary of common paths used in the pipeline
config (dataClass) – Base class selecting user inputs
verbose (bool) – If True, prints out more information

bead.src.utils.ggl.run_plots(paths, config, verbose: bool = False)[source]

Main function calling the plotting functions, ran when –mode=plot is selected. The main functions this calls are: plotting.plot_losses, plotting.plot_latent_variables, plotting.plot_mu_logvar and plotting.plot_roc_curve.

Parameters:

paths (dictionary) – Dictionary of common paths used in the pipeline
config (dataClass) – Base class selecting user inputs
verbose (bool) – If True, prints out more information

bead.src.utils.ggl.run_training(paths, config, verbose: bool = False)[source]

Main function calling the training functions, ran when –mode=train is selected. The three functions called are: ‘data_processing.preproc_inputs’ and training.train.

Parameters:

paths (dictionary) – Dictionary of common paths used in the pipeline
config (dataClass) – Base class selecting user inputs
verbose (bool) – If True, prints out more information

bead.src.utils.helper module

class bead.src.utils.helper.ChainedScaler(scalers)[source]

Bases: BaseEstimator, TransformerMixin

Chains a list of scaler transformations. The transformation is applied sequentially (in the order provided) and the inverse transformation is applied in reverse order.

fit(X, y=None)[source]

inverse_transform(X)[source]

transform(X)[source]

class bead.src.utils.helper.CustomDataset(data_tensor, label_tensor)[source]

Bases: Dataset

A custom PyTorch Dataset for handling paired data and label tensors.

This dataset provides a simple interface for accessing data points and their corresponding labels, which is compatible with PyTorch’s DataLoader.

data

The data tensor containing features.

Type:: torch.Tensor

labels

The labels tensor associated with the data.

Type:: torch.Tensor

class bead.src.utils.helper.EarlyStopping(patience: int, min_delta: float)[source]

Bases: object

Class to perform early stopping during model training.

Parameters:

patience (int) – The number of epochs to wait before stopping the training process if the validation loss doesn’t improve.
min_delta (float) – The minimum difference between the new loss and the previous best loss for the new loss to be considered an improvement.

counter

Counts the number of times the validation loss hasn’t improved.

Type:: int

best_loss

The best validation loss observed so far.

Type:: float

early_stop

Flag that indicates whether early stopping criteria have been met.

Type:: bool

class bead.src.utils.helper.L2Normalizer[source]

Bases: BaseEstimator, TransformerMixin

L2 normalization per feature of data

fit(X, y=None)[source]

inverse_transform(X)[source]

transform(X)[source]

class bead.src.utils.helper.LRScheduler(optimizer, patience, min_lr=1e-06, factor=0.5)[source]

Bases: object

A learning rate scheduler that adjusts the learning rate of an optimizer based on the training loss.

Parameters:

optimizer (torch.optim.Optimizer) – The optimizer whose learning rate will be adjusted.
patience (int) – The number of epochs with no improvement in training loss after which the learning rate will be reduced.
min_lr (float, optional) – The minimum learning rate that can be reached (default: 1e-6).
factor (float, optional) – The factor by which the learning rate will be reduced (default: 0.1).

lr_scheduler

The PyTorch learning rate scheduler that actually performs the adjustments.

Type:: torch.optim.lr_scheduler.ReduceLROnPlateau

Example usage:: optimizer = torch.optim.Adam(model.parameters(), lr=0.01) lr_scheduler = LRScheduler(optimizer, patience=3, min_lr=1e-6, factor=0.5) for epoch in range(num_epochs): train_loss = train(model, train_data_loader) lr_scheduler(train_loss) # …

class bead.src.utils.helper.Log1pScaler[source]

Bases: BaseEstimator, TransformerMixin

Log(1+x) transformer for positive-skewed HEP features

fit(X, y=None)[source]

inverse_transform(X)[source]

transform(X)[source]

class bead.src.utils.helper.SinCosTransformer[source]

Bases: BaseEstimator, TransformerMixin

Transforms an angle (in radians) into two features: [sin(angle), cos(angle)]. Inverse transformation uses arctan2.

fit(X, y=None)[source]

inverse_transform(X)[source]

transform(X)[source]

bead.src.utils.helper.add_sig_bkg_label(tensors: tuple, label: str) → tuple[source]

Adds a new feature to the last dimension of each tensor in the tuple. The new feature is filled with 0 for “bkg” and 1 for “sig”.

Parameters:

tensors – A tuple of three tensors (events, jets, constituents).
label – A string, either “bkg” or “sig”, to determine the value of the new feature.

Returns:

A tuple of the three tensors with the new feature added to the last dimension.

bead.src.utils.helper.calculate_in_shape(data, config, test_mode=False)[source]

Calculates the input shapes for the models based on the data.

Parameters:

data (ndarray) – The data you wish to calculate the input shapes for.
config (dataClass) – Base class selecting user inputs.
test_mode (bool) – A flag to indicate if the function is being called in test mode.

Returns:

A tuple containing the input shapes for the models.

Return type:

tuple

bead.src.utils.helper.call_forward(model, inputs)[source]

Calls the forward method of the given object. If the return value is not a tuple, packs it into a tuple.

Parameters:

model – An object that has a forward method.
inputs – The input data to pass to the model.

Returns:

A tuple containing the result(s) of the forward method.

bead.src.utils.helper.convert_to_tensor(data)[source]

Converts ndarray to torch.Tensors.

Parameters:: data (ndarray) – The data you wish to convert from ndarray to torch.Tensor.
Returns:: Your data as a tensor
Return type:: torch.Tensor

bead.src.utils.helper.create_datasets(events_train, jets_train, constituents_train, events_val, jets_val, constituents_val, events_train_label, jets_train_label, constituents_train_label, events_val_label, jets_val_label, constituents_val_label)[source]

Creates CustomDataset objects for training and validation data.

This function pairs data tensors with their corresponding label tensors to create dataset objects for events, jets, and constituents data.

Parameters:

events_train (torch.Tensor) – Training events data.
jets_train (torch.Tensor) – Training jets data.
constituents_train (torch.Tensor) – Training constituents data.
events_val (torch.Tensor) – Validation events data.
jets_val (torch.Tensor) – Validation jets data.
constituents_val (torch.Tensor) – Validation constituents data.
events_train_label (torch.Tensor) – Labels for training events.
jets_train_label (torch.Tensor) – Labels for training jets.
constituents_train_label (torch.Tensor) – Labels for training constituents.
events_val_label (torch.Tensor) – Labels for validation events.
jets_val_label (torch.Tensor) – Labels for validation jets.
constituents_val_label (torch.Tensor) – Labels for validation constituents.

Returns:

A dictionary containing CustomDataset objects for all data types.

Return type:

dict

bead.src.utils.helper.data_label_split(data)[source]

Splits the data into features and labels.

Parameters:

data (ndarray) – The data you wish to split into features and labels.

Returns:

A tuple containing two ndarrays:

data: The features of the data.
labels: The labels of the data.

Return type:

tuple

bead.src.utils.helper.detach_device(tensor)[source]

Detaches a given tensor to ndarray

Parameters:: tensor (torch.Tensor) – The PyTorch tensor one wants to convert to a ndarray
Returns:: Converted torch.Tensor to ndarray
Return type:: ndarray

bead.src.utils.helper.get_device(config=None)[source]

Returns the appropriate processing device. If DDP is active, uses the local_rank. Otherwise, uses cuda:0 if available, else cpu.

Parameters:: config (dataClass) – Base class selecting user inputs.
Returns:: The device to be used for processing.
Return type:: torch.device

bead.src.utils.helper.get_loss(loss_function: str)[source]

Returns the loss_object based on the string provided.

Parameters:: loss_function (str) – The loss function you wish to use. Options include: - ‘mse’: Mean Squared Error - ‘bce’: Binary Cross Entropy - ‘mae’: Mean Absolute Error - ‘huber’: Huber Loss - ‘l1’: L1 Loss - ‘l2’: L2 Loss - ‘smoothl1’: Smooth L1 Loss
Returns:: The loss function object
Return type:: class

bead.src.utils.helper.get_optimizer(optimizer_name, parameters, lr)[source]

Returns a PyTorch optimizer configured with optimal arguments for training a large VAE.

Parameters:

optimizer_name (str) – One of “adam”, “adamw”, “rmsprop”, “sgd”, “radam”, “adagrad”.
parameters (iterable) – The parameters (or parameter groups) of your model.
lr (float) – The learning rate for the optimizer.

Returns:

An instantiated optimizer with specified hyperparameters.

Return type:

torch.optim.Optimizer

Raises:

ValueError – If an unsupported optimizer name is provided.

bead.src.utils.helper.invert_normalize_data(normalized_data, scaler)[source]

Inverts a chained normalization transformation.

This function accepts normalized data (for example, the output of a VAE’s preprocessed input) and the scaler (or ChainedScaler) that was used to perform the forward transformation. It then returns the original data by calling the scaler’s inverse_transform method.

Parameters:

normalized_data (np.ndarray) – The transformed data array.
scaler – The scaler object (or a ChainedScaler instance) used for the forward transformation, which must implement an inverse_transform method.

Returns:

The data mapped back to its original scale.

Return type:

np.ndarray

bead.src.utils.helper.load_augment_tensors(folder_path, keyword)[source]

Searches through the specified folder for all ‘.pt’ files whose names contain the specified keyword (e.g., ‘bkg_train’, ‘bkg_test’, or ‘sig_test’). Files are then categorized by whether their filename contains one of the three substrings: ‘jets’, ‘events’, or ‘constituents’.

For ‘bkg_train’, each file must contain one of the generator names: ‘herwig’, ‘pythia’, or ‘sherpa’. For each file, the tensor is loaded and a new feature is appended along the last dimension: - 0 for files containing ‘herwig’ - 1 for files containing ‘pythia’ - 2 for files containing ‘sherpa’

For ‘bkg_test’ and ‘sig_test’, the appended new feature is filled with -1, since generator info is not available at test time.

Finally, for each category the resulting tensors are concatenated along axis=0.

Parameters:

folder_path (str) – The path to the folder to search.
keyword (str) – The keyword to filter files (e.g., ‘bkg_train’, ‘bkg_test’, or ‘sig_test’).

Returns:

A tuple of three PyTorch tensors: (jets_tensor, events_tensor, constituents_tensor): corresponding to the concatenated tensors for each category.

Return type:

tuple

Raises:

ValueError – If any category does not have at least one file for each generator type. The error message is: “required files not found. please run the –mode convert_csv and prepare inputs before retrying”

bead.src.utils.helper.load_model(model_path: str, in_shape, config)[source]

Loads the state dictionary of the trained model into a model variable. This variable is then used for passing data through the encoding and decoding functions.

Parameters:

model_path (str) – Path to model
in_shape (tuple) – Input shape
config (Config) – Configuration object

Returns: nn.Module: Returns a model object with the attributes of the model class, with the selected state dictionary loaded into it.

bead.src.utils.helper.load_tensors(folder_path, keyword='sig_test')[source]

Searches through the specified folder for all ‘.pt’ files containing the given keyword in their names. Categorizes these files based on the presence of ‘jets’, ‘events’, or ‘constituents’ in their filenames, loads them into PyTorch tensors, concatenates them along axis=0, and returns the resulting tensors.

Parameters:

folder_path (str) – The path to the folder to search.
keyword (str) – The keyword to filter files (‘bkg_train’, ‘bkg_test’, or ‘sig_test’).

Returns:

A tuple containing three PyTorch tensors: (jets_tensor, events_tensor, constituents_tensor).

Return type:

tuple

Raises:

ValueError – If any specific category (‘jets’, ‘events’, ‘constituents’) has no matching files. The error message is: “Required files not found. Please run the –mode convert_csv and prepare inputs before retrying.”

bead.src.utils.helper.model_init(in_shape, config)[source]

Initializing the models attributes to a model_object variable.

Parameters:

model_name (str) – The name of the model you wish to initialize. This should correspond to what your Model name.
init (str) – The initialization method you wish to use (Xavier support currently). Default is None.
config (dataClass) – Base class selecting user inputs.

Returns:

Object with the models class attributes

Return type:

class

bead.src.utils.helper.normalize_data(data, normalization_type)[source]

Normalizes jet data for VAE-based anomaly detection.

Parameters:

data – 2D numpy array (n_jets, n_features)
normalization_type – A string indicating the normalization method(s). It can be a single method or a chain of methods separated by ‘+’. Valid options include: ‘minmax’ - MinMaxScaler (scales features to [0,1]) ‘standard’- StandardScaler (zero mean, unit variance) ‘robust’ - RobustScaler (less sensitive to outliers) ‘log’ - Log1pScaler (applies log1p transformation) ‘l2’ - L2Normalizer (scales each feature by its L2 norm) ‘power’ - PowerTransformer (using Yeo-Johnson) ‘quantile’- QuantileTransformer (transforms features to follow a normal or uniform distribution) ‘maxabs’ - MaxAbsScaler (scales each feature by its maximum absolute value) ‘sincos’ - SinCosTransformer (converts angles to sin/cos features) Example: ‘log+standard’ applies a log transformation followed by standard scaling.

Returns:

Transformed data array. scaler: Fitted scaler object (or chained scaler) for inverse transformations.

Return type:

normalized_data

bead.src.utils.helper.numpy_to_tensor(data)[source]

Converts ndarray to torch.Tensors.

Parameters:: data (ndarray) – The data you wish to convert from ndarray to torch.Tensor.
Returns:: Your data as a tensor
Return type:: torch.Tensor

bead.src.utils.helper.save_loss_components(loss_data, component_names, suffix, save_dir='loss_outputs')[source]

This function unpacks loss_data into separate components, converts each into a NumPy array, and saves each array as a .npy file with a filename of the form: <component_name>_<suffix>.npy

Parameters:

loss_data (-) – a list of tuples, where each tuple contains loss components
component_names (-) – a list of strings naming each component in the tuple
suffix (-) – a string keyword to be appended (separated by ‘_’) to each filename
save_dir (-) – directory to save .npy files (default “loss_outputs”)

bead.src.utils.helper.save_model(model, model_path: str, config=None) → None[source]

Saves the models state dictionary as a .pt file to the given path. Handles DDP model saving.

Parameters:

model (nn.Module) – The PyTorch model to save.
model_path (str) – String defining the models save path.
config (dataClass) – Base class selecting user inputs. Used to check if DDP is active.

Returns:

Saved model state dictionary as .pt file.

Return type:

None

bead.src.utils.helper.select_features(jets_tensor, constituents_tensor, input_features)[source]

Process the jets_tensor and constituents_tensor based on the input_features flag.

Parameters:

jets_tensor (torch.Tensor) – Tensor with features [evt_id, jet_id, num_constituents, b_tagged, jet_pt, jet_eta, jet_phi_sin, jet_phi_cos, generator_id]
constituents_tensor (torch.Tensor) – Tensor with features [evt_id, jet_id, constit_id, b_tagged, constit_pt, constit_eta, constit_phi_sin, constit_phi_cos, generator_id]
input_features (str) – The flag to determine which features to select. Options: - ‘all’: return tensors as is. - ‘4momentum’: select [pt, eta, phi_sin, phi_cos, generator_id] for both. - ‘4momentum_btag’: select [b_tagged, pt, eta, phi_sin, phi_cos, generator_id] for both. - ‘pj_custom’: select everything except [evt_id, jet_id] for jets and except [evt_id, jet_id, constit_id] for constituents.

Returns:

Processed jets_tensor and constituents_tensor.

Return type:

tuple

bead.src.utils.helper.train_val_split(tensor, train_ratio)[source]

Splits a tensor into training and validation sets based on the specified train_ratio. The split is done by sampling indices randomly ensuring that the data is shuffled.

Parameters:

tensor (torch.Tensor) – The input tensor to be split.
train_ratio (float) – Proportion of data to be used for training (e.g., 0.8 for 80% training data).

Returns:

A tuple containing two tensors:

train_tensor: Tensor containing the training data.
val_tensor: Tensor containing the validation data.

Return type:

tuple

Raises:

ValueError – If train_ratio is not between 0 and 1.

bead.src.utils.loss module

Loss functions for training autoencoder and VAE models.

This module provides various loss functions for training autoencoders and variational autoencoders, including basic reconstruction losses, KL divergence, regularization terms, and combined losses for specialized models like those with normalizing flows.

Classes:: BaseLoss: Base class for all loss functions. ReconstructionLoss: Standard reconstruction loss (MSE or L1). KLDivergenceLoss: Kullback-Leibler divergence for VAE training. WassersteinLoss: Earth Mover’s Distance approximation. L1Regularization: L1 weight regularization. L2Regularization: L2 weight regularization. BinaryCrossEntropyLoss: Binary cross-entropy loss. VAELoss: Combined loss for VAE (reconstruction + KL). VAEFlowLoss: Loss for VAE with normalizing flows. ContrastiveLoss: Contrastive loss for clustering latent vectors. VAELossEMD: VAE loss with Earth Mover’s Distance term. VAELossL1: VAE loss with L1 regularization. VAELossL2: VAE loss with L2 regularization. VAEFlowLossEMD: VAE flow loss with EMD term. VAEFlowLossL1: VAE flow loss with L1 regularization. VAEFlowLossL2: VAE flow loss with L2 regularization. DVAELoss: Combined loss for DirichletConvVAE (reconstruction + KL(dirichlet)) inherits from VAELoss. DVAEFlowLoss: Combined loss for DirichletConvVAE (reconstruction + KL(dirichlet)) inherits from VAEFlowLoss.

class bead.src.utils.loss.BaseLoss(config)[source]

Bases: object

Base class for all loss functions. Each subclass must implement the calculate() method.

calculate(*args, **kwargs)[source]

class bead.src.utils.loss.BinaryCrossEntropyLoss(config)[source]

Bases: BaseLoss

Binary Cross Entropy Loss for binary classification tasks.

Config parameters:

use_logits: Boolean indicating if the predictions are raw logits (default: True).
reduction: Reduction method for the loss (‘mean’, ‘sum’, etc., default: ‘mean’).

Note: Not supported for full_chain mode yet

calculate(predictions, targets, mu, logvar, parameters, log_det_jacobian=0)[source]

Calculate the binary cross entropy loss.

Parameters:

predictions (Tensor) – Predicted outputs (logits or probabilities).
targets (Tensor) – Ground truth binary labels.

Returns:

The computed binary cross entropy loss.

Return type:

Tensor

class bead.src.utils.loss.DVAEFlowLoss(config)[source]

Bases: VAEFlowLoss

DVAEFlowLoss: Combines reconstruction loss and Dirichlet KL divergence loss. Inherits from VAEFlowLoss and overrides the KL loss function to use Dirichlet prior.

class bead.src.utils.loss.DVAELoss(config)[source]

Bases: VAELoss

DVAELoss: Combines reconstruction loss and Dirichlet KL divergence loss. Inherits from VAELoss and overrides the KL loss function to use Dirichlet prior.

class bead.src.utils.loss.KLDivergenceLoss(config, prior: str = 'gaussian')[source]

Bases: BaseLoss

KL Divergence loss for VAE latent space regularization.

Supports: - Gaussian prior - Dirichlet prior via Laplace approximation

calculate(recon, target, mu, logvar, parameters=None, log_det_jacobian=0)[source]

compute_alpha_laplace(mu, logvar)[source]: Compute Dirichlet concentration parameters α from Gaussian parameters (μ, logvar) via Laplace bridge approximation.

class bead.src.utils.loss.L1Regularization(config)[source]

Bases: BaseLoss

Computes L1 regularization over model parameters.

Config parameters:

weight: scaling factor for the L1 regularization (default: 1e-4)

calculate(parameters)[source]

class bead.src.utils.loss.L2Regularization(config)[source]

Bases: BaseLoss

Computes L2 regularization over model parameters.

Config parameters:

weight: scaling factor for the L2 regularization (default: 1e-4)

calculate(parameters)[source]

class bead.src.utils.loss.ReconstructionLoss(config)[source]

Bases: BaseLoss

Reconstruction loss for AE/VAE models. Supports both MSE and L1 losses based on configuration.

Config parameters:

loss_type: ‘mse’ (default) or ‘l1’
reduction: reduction method (default ‘mean’ or ‘sum’)

calculate(recon, target, mu, logvar, parameters, log_det_jacobian=0)[source]

class bead.src.utils.loss.SupervisedContrastiveLoss(config)[source]

Bases: BaseLoss

Supervised Contrastive Learning loss function. Based on: https://arxiv.org/abs/2004.11362

calculate(features, labels)[source]

Parameters:

features (torch.Tensor) – Latent vectors (e.g., zk), shape [batch_size, feature_dim].Assumed to be L2-normalized.
labels (torch.Tensor) – Ground truth labels (generator_ids), shape [batch_size].

Returns:

Supervised contrastive loss.

Return type:

torch.Tensor

class bead.src.utils.loss.VAEFlowLoss(config)[source]

Bases: BaseLoss

Loss for VAE models augmented with a normalizing flow. Includes the log_det_jacobian term from the flow transformation.

Config parameters:

reconstruction: dict for ReconstructionLoss config.
kl: dict for KLDivergenceLoss config.
kl_weight: weight for the KL divergence term.
flow_weight: weight for the log_det_jacobian term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

class bead.src.utils.loss.VAEFlowLossEMD(config)[source]

Bases: VAEFlowLoss

VAE loss augmented with an Earth Mover’s Distance (EMD) term.

Config parameters:

emd_weight: weight for the EMD term.
emd: dict for WassersteinLoss config.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

In addition to the standard VAE inputs, this loss requires:

emd_p: first distribution tensor (e.g. a predicted histogram)
emd_q: second distribution tensor (e.g. a target histogram)

class bead.src.utils.loss.VAEFlowLossL1(config)[source]

Bases: VAEFlowLoss

VAE loss augmented with an L1 regularization term.

Config parameters:

l1_weight: weight for the L1 regularization term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]: ‘parameters’ should be a list of model parameters to regularize.

class bead.src.utils.loss.VAEFlowLossL2(config)[source]

Bases: VAEFlowLoss

VAE loss augmented with an L2 regularization term.

Config parameters:

l2_weight: weight for the L2 regularization term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]: ‘parameters’ should be a list of model parameters to regularize.

class bead.src.utils.loss.VAEFlowSupConLoss(config)[source]

Bases: BaseLoss

Combined loss for VAE with Normalizing Flows and Supervised Contrastive Learning.

Config parameters:

vaeflow: dict for VAEFlowLoss config.
supcon: dict for SupervisedContrastiveLoss config.
contrastive_weight: weight for the contrastive loss term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

class bead.src.utils.loss.VAELoss(config)[source]

Bases: BaseLoss

Total loss for VAE training. Combines reconstruction loss and KL divergence loss.

Config parameters:

reconstruction: dict for ReconstructionLoss config.
kl: dict for KLDivergenceLoss config.
kl_weight: scaling factor for KL loss (default: 1.0)

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

class bead.src.utils.loss.VAELossEMD(config)[source]

Bases: VAELoss

VAE loss augmented with an Earth Mover’s Distance (EMD) term.

Config parameters:

emd_weight: weight for the EMD term.
emd: dict for WassersteinLoss config.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

In addition to the standard VAE inputs, this loss requires:

emd_p: first distribution tensor (e.g. a predicted histogram)
emd_q: second distribution tensor (e.g. a target histogram)

class bead.src.utils.loss.VAELossL1(config)[source]

Bases: VAELoss

VAE loss augmented with an L1 regularization term.

Config parameters:

l1_weight: weight for the L1 regularization term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]: ‘parameters’ should be a list of model parameters to regularize.

class bead.src.utils.loss.VAELossL2(config)[source]

Bases: VAELoss

VAE loss augmented with an L2 regularization term.

Config parameters:

l2_weight: weight for the L2 regularization term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]: ‘parameters’ should be a list of model parameters to regularize.

class bead.src.utils.loss.VAESupConLoss(config)[source]

Bases: BaseLoss

Combined loss for VAE with Supervised Contrastive Learning.

Config parameters:

vae: dict for VAELoss config.
supcon: dict for SupervisedContrastiveLoss config.
contrastive_weight: weight for the contrastive loss term.

calculate(recon, target, mu, logvar, zk, parameters, log_det_jacobian=0, generator_labels=None)[source]

class bead.src.utils.loss.WassersteinLoss(config)[source]

Bases: BaseLoss

Computes an approximation of the Earth Mover’s Distance (Wasserstein Loss) between two 1D probability distributions.

Assumes inputs are tensors of shape (batch_size, n) representing histograms or distributions.

Config parameters:

dim: dimension along which to compute the cumulative sum (default: 1)

calculate(p, q)[source]

bead.src.utils.normalization module

Custom normalization functions for HEP data.

This module provides specialized normalization functions for high-energy physics data, particularly for jet and constituent features. These functions handle the specific requirements of features like PT, eta, and phi in particle physics analyses.

Functions:: normalize_jet_pj_custom: Custom normalization for jet data. normalize_constit_pj_custom: Custom normalization for constituent data. invert_normalize_jet_pj_custom: Invert custom jet normalization. invert_normalize_constit_pj_custom: Invert custom constituent normalization.

bead.src.utils.normalization.invert_normalize_constit_pj_custom(normalized_data, scalers)[source]

Inverts the normalization applied by normalize_jet_data_np_chained.

The input normalized_data is assumed to be a NumPy array of shape (N, 8) with columns:: 0: event_id (unchanged) 1: jet_id (unchanged) 2: num_constituents_norm (normalized via “robust”) 3: b_tagged (unchanged) 4: jet_pt_norm (normalized via “log+standard”) 5: jet_eta_norm (normalized via “standard”) 6-7: jet_phi_sin, jet_phi_cos (normalized via “sin_cos”)

Returns:

NumPy array of shape (N, 7) with columns:: [event_id, jet_id, num_constituents, b_tagged, jet_pt, jet_eta, jet_phi]

Return type:

original_data

Note

The scaler for jet_pt (chain “log+standard”) is expected to invert first the StandardScaler then the Log1pScaler, so that the original jet_pt is recovered.
The scaler for jet_phi (chain “sin_cos”) converts the 2 columns back to the original angle using arctan2.

bead.src.utils.normalization.invert_normalize_jet_pj_custom(normalized_data, scalers)[source]

Inverts the normalization applied by normalize_jet_data_np_chained.

The input normalized_data is assumed to be a NumPy array of shape (N, 8) with columns:: 0: event_id (unchanged) 1: jet_id (unchanged) 2: num_constituents_norm (normalized via “robust”) 3: b_tagged (unchanged) 4: jet_pt_norm (normalized via “log+standard”) 5: jet_eta_norm (normalized via “standard”) 6-7: jet_phi_sin, jet_phi_cos (normalized via “sin_cos”)

Returns:

NumPy array of shape (N, 7) with columns:: [event_id, jet_id, num_constituents, b_tagged, jet_pt, jet_eta, jet_phi]

Return type:

original_data

Note

The scaler for jet_pt (chain “log+standard”) is expected to invert first the StandardScaler then the Log1pScaler, so that the original jet_pt is recovered.
The scaler for jet_phi (chain “sin_cos”) converts the 2 columns back to the original angle using arctan2.

bead.src.utils.normalization.normalize_constit_pj_custom(data)[source]

Normalizes jet data for HEP analysis using a chained normalization approach.

Input data is expected as a NumPy array of shape (N, 7) with columns in the order:: 0: event_id (unchanged) 1: jet_id (unchanged) 2: constit_id (unchanged) 3: b_tagged (unchanged) 4: constit_pt (to be normalized via “log+standard”) 5: constit_eta (to be normalized via “standard”) 6: constit_phi (to be normalized via “sin_cos” transformation)
The output array will have 8 columns:: [event_id, jet_id, constit_id, b_tagged, constit_pt_norm, constit_eta_norm, constit_phi_sin, constit_phi_cos]

Parameters:: data (np.ndarray) – Input array of shape (N, 7).
Returns:: Output array of shape (N, 8). scalers (dict): Dictionary containing the fitted scalers for each feature.
Return type:: normalized_data (np.ndarray)

bead.src.utils.normalization.normalize_jet_pj_custom(data)[source]

Normalizes jet data for HEP analysis using a chained normalization approach.

Input data is expected as a NumPy array of shape (N, 7) with columns in the order: 0: event_id (unchanged) 1: jet_id (unchanged) 2: num_constituents (to be normalized via “robust”) 3: b_tagged (already integer; left unchanged) 4: jet_pt (to be normalized via “log+standard”) 5: jet_eta (to be normalized via “standard”) 6: jet_phi (to be normalized via “sin_cos” transformation)

The output array will have 8 columns: [event_id, jet_id, num_constituents_norm, b_tagged, jet_pt_norm, jet_eta_norm, jet_phi_sin, jet_phi_cos]

Parameters:: data (np.ndarray) – Input array of shape (N, 7).
Returns:: Output array of shape (N, 8). scalers (dict): Dictionary containing the fitted scalers for each feature.
Return type:: normalized_data (np.ndarray)

bead.src.utils.plotting module

Visualization utilities for model results.

This module provides functions for creating visualizations of model training results, latent space embeddings, and performance metrics. These plots are essential for understanding model behavior and evaluating anomaly detection performance.

Functions:: plot_losses: Generate plots for training and validation losses. reduce_dim_subsampled: Reduce dimensionality with optional subsampling. plot_latent_variables: Visualize latent space embeddings. plot_mu_logvar: Plot latent space mean and variance. plot_roc_curve: Generate ROC curves from model results.

bead.src.utils.plotting.plot_latent_variables(config, paths, verbose=False)[source]

Visualize latent space embeddings from the model.

This function creates 2D projections of the latent space using dimensionality reduction techniques (PCA, t-SNE, UMap or TriMap) for both initial (z0) and final (zk) latent variables, color-coded by class.

Parameters:

config (object) – Configuration object containing parameters like latent_space_size, latent_space_plot_style, input_level, etc.
paths (dict) – Dictionary of paths including output_path and data_path
verbose (bool, optional) – Whether to print progress and debugging information, default is False

Notes

The function handles both training and test data, with different color schemes for each. For test data, signal samples are shown in red, while background samples are colored according to their generator type.

bead.src.utils.plotting.plot_losses(output_dir, save_dir, config, verbose: bool = False)[source]

Generate plots for training and validation losses over epochs.

This function creates two types of visualizations: 1. Training and validation total loss curves per epoch 2. Component-wise loss curves (reconstruction, KL divergence, etc.) for train/val/test sets

Parameters:

output_dir (str) – Directory containing the saved model output files (.npy)
save_dir (str) – Directory where the generated plots will be saved
config (object) – Configuration object containing model parameters like project_name and epochs
verbose (bool, optional) – Whether to print progress messages, default is False

Raises:

FileNotFoundError – If required loss data files are not found in the output directory

bead.src.utils.plotting.plot_mu_logvar(config, paths, verbose=False)[source]

Visualize the mean (mu) and log-variance (logvar) of the latent space distribution.

This function creates two types of visualizations: 1. 2D projection of the mean vectors in latent space, color-coded by class 2. Histogram of uncertainties derived from the log-variance

Parameters:

config (object) – Configuration object containing parameters like latent_space_size, latent_space_plot_style, input_level, etc.
paths (dict) – Dictionary of paths including output_path and data_path
verbose (bool, optional) – Whether to print progress and debugging information, default is False

Notes

The uncertainty is calculated as the mean of the standard deviation (sigma) across all dimensions of the latent space, where sigma = exp(0.5 * logvar).

bead.src.utils.plotting.plot_roc_curve(config, paths, verbose: bool = False)[source]

Generate and save ROC curves for available loss component files.

This function computes and plots Receiver Operating Characteristic (ROC) curves for different loss components, evaluating their effectiveness as anomaly scores.

If config.overlay_roc is True, it also generates an overlay plot comparing ROC curves across different projects specified in config.overlay_roc_projects.

Parameters:

config (object) – Configuration object containing parameters like input_level and project_name
paths (dict) – Dictionary containing paths, particularly output_path
verbose (bool, optional) – Whether to print additional debug information, default is False

Raises:

FileNotFoundError – If the required label file is not found
ValueError – If ground truth labels are not a 1D array or if there’s a length mismatch between loss scores and ground truth labels

Notes

The function generates a single plot containing ROC curves for all available loss components (total loss, reconstruction loss, KL divergence, etc.), with the area under the curve (AUC) displayed in the legend.

If overlay_roc is enabled, it also creates a combined plot showing ROC curves from multiple projects for comparison.

bead.src.utils.plotting.reduce_dim_subsampled(data, method='trimap', n_components=2, n_samples=None, verbose=False)[source]

Reduce dimensionality of data with optional subsampling for large datasets.

This function applies dimensionality reduction techniques (PCA, t-SNE, TriMap, or UMAP) to high-dimensional data, with options for subsampling large datasets to improve computational efficiency. It will use GPU-accelerated methods when available.

Parameters:

data (numpy.ndarray) – Input data of shape (n_samples, n_features)
method (str, optional) – Dimensionality reduction method to use: “pca”, “tsne”, “trimap”, or “umap”, default is “tsne”
n_components (int, optional) – Number of dimensions to reduce to, default is 2
n_samples (int, optional) – Number of samples to use (subsampling), if None uses all data, default is None
verbose (bool, optional) – Whether to print progress messages, default is False

Returns:

numpy.ndarray – Reduced data with shape (n_samples, n_components)
str – The name of the dimensionality reduction method used
numpy.ndarray – Indices of the samples used for subsampling

Raises:

ValueError – If an invalid dimensionality reduction method is specified

bead.src.utils package

Submodules

bead.src.utils.conversion module

bead.src.utils.data_processing module

bead.src.utils.diagnostics module

bead.src.utils.ggl module

bead.src.utils.helper module

bead.src.utils.loss module

bead.src.utils.normalization module

bead.src.utils.plotting module

Module contents