Module connectome.preprocessing.data_preparation

helper function to transform the processed data into the right format

Functions

def prepare_data(data: pandas.core.frame.DataFrame, classification: bool = True, columns_drop: list = ['ConnID', 'Repseudonym', 'siteidvisdat', 'MEM_Score', 'Apoe', 'IDs'], target: str = 'prmdiag', y_0: list = [0], y_1: list = [2, 3], train_size: float = 0.8, seed: int = 123, split=True)

Function that prepares the data for modelling

Examples:

>>> # prepare data for classification task (e. g. classification of diagnosis)
>>> classification = True
>>> columns_drop = ["ConnID", "Repseudonym", "siteid", "visdat",
>>>                 "MEM_score", "Apoe", "IDs"]
>>> target = "prmdiag"
>>> y_0 = [0]
>>> y_1 = [2, 3]
>>> train_size = 0.8
>>> seed = 1855
>>> split = True
>>> X_train, y_train, X_test, y_test = prepare_data(data=df,
>>>                                                 classification=classification,
>>>                                                 columns_drop=columns_drop,
>>>                                                 target=target, y_0=y_0, y_1=y_1,
>>>                                                 train_size=train_size,
>>>                                                 seed=seed, split=split)
>>>
>>> # prepare data for regression task (e. g. regression with MEM_score as target)
>>> classification = False
>>> columns_drop = ["ConnID", "Repseudonym", "siteid", "visdat",
>>>                 "prmdiag", "Apoe", "IDs"]
>>> target = "MEM_score"
>>> train_size = 0.8
>>> seed = 1855
>>> split = True
>>> X_train, y_train, X_test, y_test = prepare_data(data=df,
>>>                                                 classification=classification,
>>>                                                 columns_drop=columns_drop,
>>>                                                 target=target,
>>>                                                 train_size=train_size,
>>>                                                 seed=seed, split=split)

Args

data
A pd.Dataframe
classification
boolean, if false regression task
columns_drop
which columns should be dropped (all other columns will be used for modelling)
target
name of the target variable
y_0
(only relevant for classification task) which values of target should be treated as 0
y_1
(only relevant for classification task) which values of target should be treated as 1
train_size
size of the training data (default 0.8)
seed
seed for reproducibility of train/test split
split
should a train/test split be performed or not?

Returns

Xtrain, ytrain, Xtest, ytest