Fine-tune on your own data
This section introduces how to fine-tune the model for regression tasks, such as retention time prediction, on your own data.
Setup
Please set up the environment as shown in the Source code setup page.
Step 1: Data preparation
Please prepare the data of molecular properties as:
,id,smiles,prop
0,0382_00004,NC(=O)N1c2ccccc2[C@H](O)[C@@H](O)c2ccccc21,5.79
1,0382_00005,CN(C)[C@@H]1C(=O)C(C(N)=O)=C(O)[C@@]2(O)C(=O)C3=C(O)c4c(O)ccc(Cl)c4[C@@](C)(O)[C@H]3C[C@@H]12,4.5
2,0382_00008,Cc1onc(-c2c(Cl)cccc2Cl)c1C(=O)N[C@@H]1C(=O)N2[C@@H](C(=O)O)C(C)(C)S[C@H]12,7.8
3,0382_00009,C[C@H]1c2cccc(O)c2C(O)=C2C(=O)[C@]3(O)C(O)=C(C(N)=O)C(=O)[C@@H](N(C)C)[C@@H]3[C@@H](O)[C@@H]21,6.2
4,0382_00010,C#C[C@]1(O)CC[C@H]2[C@@H]3CCc4cc(O)ccc4[C@H]3CC[C@@]21C,9.46
5,0382_00012,Cc1onc(-c2ccccc2)c1C(=O)N[C@@H]1C(=O)N2[C@@H](C(=O)O)C(C)(C)S[C@H]12,6.9
where prop column is the RT or CCS values. Split your data into train and test CSV files, then convert each to pkl using:
from molnetpack import csv2pkl_wfilter
import yaml, pickle
with open('./molnetpack/config/preprocess_etkdgv3.yml') as f:
cfg = yaml.safe_load(f)
for split in ['train', 'test']:
data = csv2pkl_wfilter(f'<path_to_{split}.csv>', cfg['encoding'])
pickle.dump(data, open(f'<path_to_{split}.pkl>', 'wb'))
Step 2: Training
Fine-tune the model.
Using the command-line script:
python scripts/train.py --task rt \
--train_data <path_to_train.pkl> \
--test_data <path_to_test.pkl> \
--checkpoint_path <path_to_save_checkpoint> \
--transfer \
--resume_path <path_to_pretrained_model> \
--seed 42
Using the Python API:
import torch
from molnetpack import MolNet
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
molnet_engine = MolNet(device, seed=42)
molnet_engine.train(
task='rt',
train_data='<path_to_train.pkl>',
valid_data='<path_to_test.pkl>',
checkpoint_path='<path_to_save_checkpoint>',
resume_path='<path_to_pretrained_model>',
transfer=True,
use_scaler=True,
)
Step 3: Running prediction
Predict unlabeled data.
Using the command-line script:
python scripts/predict.py --task prop \
--test_data <path_to_csv_or_pkl> \
--resume_path <path_to_checkpoint> \
--result_path <path_to_results.csv> \
--seed 42
Using the Python API:
# After training, the model is ready immediately — no reload needed.
# To use an existing checkpoint instead, load it first:
molnet_engine.load_data('<path_to_csv_or_pkl>')
rt_df = molnet_engine.pred_rt(
path_to_results='<path_to_results.csv>',
path_to_checkpoint='<path_to_checkpoint>',
)