Data Transformations¶
In this tutorial we will cover some basic important functionalities of JADBio PythonAPI client regarding data transformations. Specifically we will show how to:
- Split a dataset in a stratified manner into train and test (locally)
- Change feature types of an uploaded dataset
Getting started¶
Start by importing all neccessary libraries at top of the script:
# ----------------- IMPORTS ------------------
from jadbio import client as jad
import os
# --------------------------------------------
Next, assign your credentials to some variables:
# --------------- CREDENTIALS ----------------
EMAIL = '[email protected]'
PASSWORD = '*********'
# --------------------------------------------
And login to JADBio's server:
# Login & initialize session
client = jad.JadbioClient(EMAIL, PASSWORD)
# Get Version
client.get_version()
'1.2-beta'
Split a dataset¶
Some transformations available through JADBio's online platform under Transform Data tab, are not implemented in the API, but will be very soon. However, one can replicate this functionality by utilizing scikit-learn's libraries as shown below. First we take a look at the data:
from sklearn.model_selection import StratifiedShuffleSplit
import pandas as pd
DATA_PATH = '/where/your/dataset/is'
DATASET = 'Alzheimer_GSE46579'
# Load a dataset
data = pd.read_csv(os.path.join(DATA_PATH, DATASET + '.csv'),
index_col = 0) # read data & set first column as row IDs
data
diagnosis | hsa-miR-30a-3p | hsa-miR-550a-3p | hsa-miR-29a-3p | hsa-miR-378e | hsa-miR-155-5p | hsa-miR-628-3p | hsa-miR-26a-5p | hsa-miR-106b-5p | hsa-miR-4781-3p | ... | brain-mir-192 | brain-mir-96 | brain-mir-53 | brain-mir-93 | brain-mir-112 | brain-mir-167 | brain-mir-24 | brain-mir-159 | brain-mir-328 | brain-mir-225 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AD | Alzheimer | 172.300000 | 467.342857 | 41.828571 | 1.757143 | 30.357143 | 2.935714 | 7467.342857 | 50.814286 | 20.671429 | ... | 9.314286 | 2.935714 | 13.700000 | 63.014286 | 7.314286 | 4.528571 | 24.735714 | 429.757143 | 11.142857 | 0.685714 |
AD1 | Alzheimer | 132.885714 | 988.000000 | 79.664286 | 1.564286 | 15.871429 | 0.285714 | 3546.471429 | 118.921429 | 22.371429 | ... | 6.714286 | 5.157143 | 3.235714 | 41.828571 | 14.028571 | 3.235714 | 19.000000 | 310.414286 | 7.757143 | 5.785714 |
AD2 | Alzheimer | 282.371429 | 818.942857 | 66.121429 | 3.321429 | 19.285714 | 2.042857 | 14464.157140 | 41.464286 | 36.692857 | ... | 11.628571 | 8.157143 | 8.157143 | 55.028571 | 18.542857 | 3.321429 | 27.928571 | 297.985714 | 16.514286 | 2.042857 |
AD3 | Alzheimer | 179.371429 | 557.442857 | 73.135714 | 0.728571 | 58.828571 | 4.142857 | 8648.271429 | 18.742857 | 24.064286 | ... | 6.271429 | 0.728571 | 11.900000 | 32.028571 | 15.428571 | 7.471429 | 36.985714 | 275.814286 | 16.592857 | 4.142857 |
AD4 | Alzheimer | 117.857143 | 1055.000000 | 65.542857 | 0.500000 | 21.107143 | 1.771429 | 7038.985714 | 73.928571 | 32.728571 | ... | 8.185714 | 3.192857 | 13.928571 | 101.714286 | 20.314286 | 3.192857 | 38.485714 | 275.814286 | 15.328571 | 1.771429 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
AD43 | Alzheimer | 106.742857 | 349.342857 | 49.000000 | 2.071429 | 35.057143 | 3.257143 | 8317.371429 | 66.121429 | 26.400000 | ... | 3.257143 | 2.071429 | 20.392857 | 58.335714 | 14.571429 | 2.071429 | 3.257143 | 159.450000 | 19.085714 | 0.814286 |
AD44 | Alzheimer | 118.921429 | 212.042857 | 67.278571 | 7.085714 | 48.514286 | 3.442857 | 5995.142857 | 202.771429 | 24.885714 | ... | 1.335714 | 0.750000 | 6.435714 | 30.357143 | 4.757143 | 0.750000 | 4.757143 | 101.714286 | 10.121429 | 2.285714 |
AD45 | Alzheimer | 175.642857 | 492.985714 | 61.242857 | 1.714286 | 18.171429 | 4.107143 | 8317.371429 | 47.671429 | 37.728571 | ... | 13.571429 | 7.992857 | 15.507143 | 75.857143 | 19.185714 | 1.714286 | 46.621429 | 250.485714 | 11.485714 | 1.714286 |
AD46 | Alzheimer | 157.785714 | 357.028571 | 26.814286 | 1.735714 | 39.642857 | 17.400000 | 8317.371429 | 34.714286 | 39.642857 | ... | 11.485714 | 4.735714 | 8.921429 | 74.892857 | 22.528571 | 1.735714 | 68.600000 | 297.985714 | 17.400000 | 4.735714 |
AD47 | Alzheimer | 228.478571 | 330.900000 | 5.742857 | 5.742857 | 5.742857 | 40.035714 | 3416.142857 | 95.464286 | 5.742857 | ... | 5.742857 | 5.742857 | 5.742857 | 95.464286 | 40.035714 | 5.742857 | 5.742857 | 612.371429 | 40.035714 | 5.742857 |
70 rows × 504 columns
Now, let's use column diagnosis
to stratify the data into train and test datasets that we wish to generate.
# Stratified split using 20% of samples as test dataset
seed = 2022 # set seed for reproducibility
protocol = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=seed)
# Get splits
stratify_by = data['diagnosis']
for train_idx, test_idx in protocol.split(data, y=stratify_by):
data_train = data.iloc[train_idx, :]
data_test = data.iloc[test_idx, :]
# Get data shapes
df = pd.DataFrame({'Shape': [data.shape, data_train.shape, data_test.shape]},
index=['Data(original)', 'Data(train)', 'Data(test)'])
df
Shape | |
---|---|
Data(original) | (70, 504) |
Data(train) | (56, 504) |
Data(test) | (14, 504) |
Now, we can save these datasets locally and upload them to JADBio server (see here for instruction on how to upload the data:
# Save train & test data
data_train.to_csv(os.path.join(DATA_PATH, "{}_{}.csv".format(DATASET, 'train')))
data_test.to_csv(os.path.join(DATA_PATH, "{}_{}.csv".format(DATASET, 'test')))
Change feature types¶
When uploading a dataset, JADBio will automatically assign a data type to each feature based on some rules regarding its values (see here). This action will assign only the numerical
and/or categorical
data types to features, while event
, timeToEvent
(necessary for Survival
analysis) and identifier
must be declared explicitely. Below we show how to change feature types according to your needs:
First, we need to choose a dataset from a project.
# List projects
my_projects = client.get_projects(count=100)
pDF = pd.DataFrame({'Name': [p['name'] for p in my_projects['data']],
'pid': [p['projectId'] for p in my_projects['data']]})
pDF
Name | pid | |
---|---|---|
0 | MyNewProject | 3892 |
1 | DemoUpload | 3987 |
2 | LiveDemo | 4073 |
pid = '3987' # choose a project ID
my_datasets = client.get_datasets(project_id='3987',
count=100)
dDF = pd.DataFrame({'Name': [d['name'] for d in my_datasets['data']],
'did': [d['datasetId'] for d in my_datasets['data']]})
dDF
Name | did | |
---|---|---|
0 | AlzheimersDU | 21364 |
Now, let's pick a dataset, e.g. AlzheimersDU: '21364'
and create some changes. For this we will create a list of dictionaries with the following format:
{
'matcher': dictionary,
'newType': string
}
Each dictionary of the changes
list matches some features of the source dataset as specified by the 'matcher'
field. The types of those features in the new dataset will be changed to the type given by 'newType'
whose value must be one of 'numerical'
, 'categorical'
, 'timeToEvent'
, 'event'
, or 'identifier'
. If some feature is matched by multiple matchers, the last matching entry in the list of dictionaries determines its new type.
did = '21364' # Unique dataset ID
changes = [] # start with an empty list of changes
# Change by feature name
feature_names = ['hsa-miR-30a-3p', 'hsa-miR-550a-3p']
changes.append({ 'matcher': { 'byName': feature_names }, 'newType': 'categorical' })
# Change by index
idx = [5, 20, 100]
changes.append({ 'matcher': { 'byIndex': idx }, 'newType': 'categorical' })
# Change all features of a specific data type
current_type = 'numerical'
changes.append({ 'matcher': { 'byCurrentType': current_type }, 'newType': 'categorical' })
# review changes
changes
[{'matcher': {'byName': ['hsa-miR-30a-3p', 'hsa-miR-550a-3p']}, 'newType': 'categorical'}, {'matcher': {'byIndex': [5, 20, 100]}, 'newType': 'categorical'}, {'matcher': {'byCurrentType': 'numerical'}, 'newType': 'categorical'}]
Finally, apply changes and save new dataset with a different name:
new_name = 'AlzheimersDU_cat'
task_id = client.change_feature_types(did, new_name, changes)
task_id
'16635'
NOTE: The new dataset will have a new dataset ID. You can get the new ID by typing:
status = client.get_task_status(task_id)
status['datasetId']
'22203'