Data Transformations¶

In this tutorial we will cover some basic important functionalities of JADBio PythonAPI client regarding data transformations. Specifically we will show how to:

Split a dataset in a stratified manner into train and test (locally)
Change feature types of an uploaded dataset

Getting started¶

Start by importing all neccessary libraries at top of the script:

In [3]:

# ----------------- IMPORTS ------------------
from jadbio import client as jad
import os
# --------------------------------------------

Next, assign your credentials to some variables:

In [4]:

# --------------- CREDENTIALS ----------------
EMAIL = '[email protected]'
PASSWORD = '*********'
# --------------------------------------------

And login to JADBio's server:

In [6]:

# Login & initialize session
client = jad.JadbioClient(EMAIL, PASSWORD)

# Get Version
client.get_version()

Out[6]:

'1.2-beta'

Split a dataset¶

Some transformations available through JADBio's online platform under Transform Data tab, are not implemented in the API, but will be very soon. However, one can replicate this functionality by utilizing scikit-learn's libraries as shown below. First we take a look at the data:

In [5]:

from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd

DATA_PATH = '/where/your/dataset/is'
DATASET = 'Alzheimer_GSE46579'

# Load a dataset
data = pd.read_csv(os.path.join(DATA_PATH, DATASET + '.csv'),
                   index_col = 0) # read data & set first column as row IDs

data

Out[5]:

	diagnosis	hsa-miR-30a-3p	hsa-miR-550a-3p	hsa-miR-29a-3p	hsa-miR-378e	hsa-miR-155-5p	hsa-miR-628-3p	hsa-miR-26a-5p	hsa-miR-106b-5p	hsa-miR-4781-3p	...	brain-mir-192	brain-mir-96	brain-mir-53	brain-mir-93	brain-mir-112	brain-mir-167	brain-mir-24	brain-mir-159	brain-mir-328	brain-mir-225
AD	Alzheimer	172.300000	467.342857	41.828571	1.757143	30.357143	2.935714	7467.342857	50.814286	20.671429	...	9.314286	2.935714	13.700000	63.014286	7.314286	4.528571	24.735714	429.757143	11.142857	0.685714
AD1	Alzheimer	132.885714	988.000000	79.664286	1.564286	15.871429	0.285714	3546.471429	118.921429	22.371429	...	6.714286	5.157143	3.235714	41.828571	14.028571	3.235714	19.000000	310.414286	7.757143	5.785714
AD2	Alzheimer	282.371429	818.942857	66.121429	3.321429	19.285714	2.042857	14464.157140	41.464286	36.692857	...	11.628571	8.157143	8.157143	55.028571	18.542857	3.321429	27.928571	297.985714	16.514286	2.042857
AD3	Alzheimer	179.371429	557.442857	73.135714	0.728571	58.828571	4.142857	8648.271429	18.742857	24.064286	...	6.271429	0.728571	11.900000	32.028571	15.428571	7.471429	36.985714	275.814286	16.592857	4.142857
AD4	Alzheimer	117.857143	1055.000000	65.542857	0.500000	21.107143	1.771429	7038.985714	73.928571	32.728571	...	8.185714	3.192857	13.928571	101.714286	20.314286	3.192857	38.485714	275.814286	15.328571	1.771429
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
AD43	Alzheimer	106.742857	349.342857	49.000000	2.071429	35.057143	3.257143	8317.371429	66.121429	26.400000	...	3.257143	2.071429	20.392857	58.335714	14.571429	2.071429	3.257143	159.450000	19.085714	0.814286
AD44	Alzheimer	118.921429	212.042857	67.278571	7.085714	48.514286	3.442857	5995.142857	202.771429	24.885714	...	1.335714	0.750000	6.435714	30.357143	4.757143	0.750000	4.757143	101.714286	10.121429	2.285714
AD45	Alzheimer	175.642857	492.985714	61.242857	1.714286	18.171429	4.107143	8317.371429	47.671429	37.728571	...	13.571429	7.992857	15.507143	75.857143	19.185714	1.714286	46.621429	250.485714	11.485714	1.714286
AD46	Alzheimer	157.785714	357.028571	26.814286	1.735714	39.642857	17.400000	8317.371429	34.714286	39.642857	...	11.485714	4.735714	8.921429	74.892857	22.528571	1.735714	68.600000	297.985714	17.400000	4.735714
AD47	Alzheimer	228.478571	330.900000	5.742857	5.742857	5.742857	40.035714	3416.142857	95.464286	5.742857	...	5.742857	5.742857	5.742857	95.464286	40.035714	5.742857	5.742857	612.371429	40.035714	5.742857

70 rows × 504 columns

Now, let's use column diagnosis to stratify the data into train and test datasets that we wish to generate.

In [6]:

# Stratified split using 20% of samples as test dataset
seed = 2022  # set seed for reproducibility
protocol = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed)

# Get splits
stratify_by = data['diagnosis']
for train_idx, test_idx in protocol.split(data, y=stratify_by):
    data_train = data.iloc[train_idx, :]
    data_test = data.iloc[test_idx, :]
    
# Get data shapes    
df = pd.DataFrame({'Shape': [data.shape, data_train.shape, data_test.shape]},
                  index=['Data(original)', 'Data(train)', 'Data(test)'])
df

Out[6]:

	Shape
Data(original)	(70, 504)
Data(train)	(56, 504)
Data(test)	(14, 504)

Now, we can save these datasets locally and upload them to JADBio server (see here for instruction on how to upload the data:

In [9]:

# Save train & test data
data_train.to_csv(os.path.join(DATA_PATH, "{}_{}.csv".format(DATASET, 'train')))
data_test.to_csv(os.path.join(DATA_PATH, "{}_{}.csv".format(DATASET, 'test')))

Change feature types¶

When uploading a dataset, JADBio will automatically assign a data type to each feature based on some rules regarding its values (see here). This action will assign only the numerical and/or categorical data types to features, while event, timeToEvent (necessary for Survival analysis) and identifier must be declared explicitely. Below we show how to change feature types according to your needs:

First, we need to choose a dataset from a project.

In [16]:

# List projects
my_projects = client.get_projects(count=100)
pDF = pd.DataFrame({'Name': [p['name'] for p in my_projects['data']],
                   'pid': [p['projectId'] for p in my_projects['data']]})
pDF

Out[16]:

	Name	pid
0	MyNewProject	3892
1	DemoUpload	3987
2	LiveDemo	4073

In [17]:

pid = '3987'  # choose a project ID
my_datasets = client.get_datasets(project_id='3987',
                                  count=100)
dDF = pd.DataFrame({'Name': [d['name'] for d in my_datasets['data']],
                   'did': [d['datasetId'] for d in my_datasets['data']]})
dDF

Out[17]:

	Name	did
0	AlzheimersDU	21364

Now, let's pick a dataset, e.g. AlzheimersDU: '21364' and create some changes. For this we will create a list of dictionaries with the following format:

{
    'matcher': dictionary, 
    'newType': string
}

Each dictionary of the changes list matches some features of the source dataset as specified by the 'matcher' field. The types of those features in the new dataset will be changed to the type given by 'newType' whose value must be one of 'numerical', 'categorical', 'timeToEvent', 'event', or 'identifier'. If some feature is matched by multiple matchers, the last matching entry in the list of dictionaries determines its new type.

In [22]:

did = '21364'  # Unique dataset ID
changes = []  # start with an empty list of changes

# Change by feature name
feature_names = ['hsa-miR-30a-3p', 'hsa-miR-550a-3p']
changes.append({ 'matcher': { 'byName': feature_names }, 'newType': 'categorical' })

# Change by index
idx = [5, 20, 100]
changes.append({ 'matcher': { 'byIndex': idx }, 'newType': 'categorical' })

# Change all features of a specific data type
current_type = 'numerical'
changes.append({ 'matcher': { 'byCurrentType': current_type }, 'newType': 'categorical' })

# review changes
changes

Out[22]:

[{'matcher': {'byName': ['hsa-miR-30a-3p', 'hsa-miR-550a-3p']},
  'newType': 'categorical'},
 {'matcher': {'byIndex': [5, 20, 100]}, 'newType': 'categorical'},
 {'matcher': {'byCurrentType': 'numerical'}, 'newType': 'categorical'}]

Finally, apply changes and save new dataset with a different name:

In [23]:

new_name = 'AlzheimersDU_cat'

task_id = client.change_feature_types(did, new_name, changes)
task_id

Out[23]:

'16635'

NOTE: The new dataset will have a new dataset ID. You can get the new ID by typing:

In [7]:

status = client.get_task_status(task_id)
status['datasetId']

Out[7]:

'22203'