Train your own model for MS/MS prediction
Setup
Please set up the environment as shown in the Source code setup page.
Step 1: Obtain the Pretrained Model
Download the pretrained model (molnet_pre_etkdgv3.pt.zip) from Releases. You can also train the model from scratch. For details on pretraining the model on the QM9 dataset, refer to Pretraining 3DMolMS on QM9 page.
Step 2: Prepare the Datasets
Download and organize the datasets into the ./data/ directory. The current version uses four datasets:
Agilent DPCL, provided by Agilent Technologies.
NIST20, available under license for academic use.
MoNA, publicly available.
Waters QTOF, our own experimental dataset.
The data directory structure should look like this:
|- data
|- origin
|- Agilent_Combined.sdf
|- Agilent_Metlin.sdf
|- hr_msms_nist.SDF
|- MoNA-export-All_LC-MS-MS_QTOF.sdf
|- MoNA-export-All_LC-MS-MS_Orbitrap.sdf
|- waters_qtof.mgf
Step 3: Preprocess the Datasets
Run the following commands to preprocess the datasets. Specify the dataset with --dataset and select the instrument type as qtof. Use --maxmin_pick to apply the MaxMin algorithm for selecting training molecules; otherwise, selection will be random. The dataset configurations are in ./molnetpack/config/preprocess_etkdgv3.yml.
python scripts/preprocess.py --task msms \
--dataset agilent nist mona waters gnps \
--instrument_type qtof orbitrap \
--data_config_path ./molnetpack/config/preprocess_etkdgv3.yml \
--mgf_dir ./data/mgf_debug/
Step 4: Train the Model
Use the following commands to train the model. Configuration settings for the model and training process are located in ./molnetpack/config/molnet.yml.
Using the command-line script:
# Train the model from pretrain:
# Q-TOF:
python scripts/train.py --task msms \
--train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3_tl.pt \
--transfer \
--resume_path ./check_point/molnet_pre_etkdgv3.pt
# Orbitrap can be done in a similar way.
# Train the model from scratch
# Q-TOF:
python scripts/train.py --task msms \
--train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt --device 0
# Orbitrap:
python scripts/train.py --task msms \
--train_data ./data/orbitrap_etkdgv3_train.pkl \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--ex_model_path ./check_point/molnet_orbitrap_etkdgv3_jit.pt --device 0
Using the Python API:
import torch
from molnetpack import MolNet
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
molnet_engine = MolNet(device, seed=42)
# Fine-tune from a pretrained checkpoint (Q-TOF):
molnet_engine.train(
task='msms',
train_data='./data/qtof_etkdgv3_train.pkl',
valid_data='./data/qtof_etkdgv3_test.pkl',
checkpoint_path='./check_point/molnet_qtof_etkdgv3_tl.pt',
resume_path='./check_point/molnet_pre_etkdgv3.pt',
transfer=True,
)
Step 5: Evaluation
Let’s evaluate the model trained above!
Using the command-line scripts:
# Predict the spectra:
# Q-TOF:
python scripts/predict.py --task msms \
--test_data ./data/qtof_etkdgv3_test.pkl \
--resume_path ./check_point/molnet_qtof_etkdgv3.pt \
--result_path ./result/pred_qtof_etkdgv3_test.mgf
# Orbitrap:
python scripts/predict.py --task msms \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--resume_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--result_path ./result/pred_orbitrap_etkdgv3_test.mgf
# Evaluate the cosine similarity between experimental spectra and predicted spectra:
# Q-TOF:
python scripts/eval.py ./data/qtof_etkdgv3_test.pkl ./result/pred_qtof_etkdgv3_test.mgf \
--result_path ./eval_qtof_etkdgv3_test.csv --plot_path ./eval_qtof_etkdgv3_test.png
# Orbitrap:
python scripts/eval.py ./data/orbitrap_etkdgv3_test.pkl ./result/pred_orbitrap_etkdgv3_test.mgf \
--result_path ./eval_orbitrap_etkdgv3_test.csv --plot_path ./eval_orbitrap_etkdgv3_test.png
Using the Python API:
# After training, the model is immediately ready — no checkpoint reload needed.
# Alternatively, load an existing checkpoint:
molnet_engine.load_data('./data/qtof_etkdgv3_test.pkl')
pred_df = molnet_engine.pred_msms(
path_to_results='./result/pred_qtof_etkdgv3_test.mgf',
instrument='qtof',
)
# Evaluate cosine similarity against ground truth:
results_df = molnet_engine.evaluate(
test_pkl='./data/qtof_etkdgv3_test.pkl',
pred_mgf='./result/pred_qtof_etkdgv3_test.mgf',
result_path='./eval_qtof_etkdgv3_test.csv',
plot_path='./eval_qtof_etkdgv3_test.png',
)