Train your own model for MS/MS prediction

Setup

Please set up the environment as shown in the Source code setup page.

Step 1: Obtain the Pretrained Model

Download the pretrained model (molnet_pre_etkdgv3.pt.zip) from Releases. You can also train the model from scratch. For details on pretraining the model on the QM9 dataset, refer to Pretraining 3DMolMS on QM9 page.

Step 2: Prepare the Datasets

Download and organize the datasets into the ./data/ directory. The current version uses four datasets:

Agilent DPCL, provided by Agilent Technologies.
NIST20, available under license for academic use.
MoNA, publicly available.
Waters QTOF, our own experimental dataset.

The data directory structure should look like this:

|- data
  |- origin
    |- Agilent_Combined.sdf
    |- Agilent_Metlin.sdf
    |- hr_msms_nist.SDF
    |- MoNA-export-All_LC-MS-MS_QTOF.sdf
    |- MoNA-export-All_LC-MS-MS_Orbitrap.sdf
    |- waters_qtof.mgf

Step 3: Preprocess the Datasets

Run the following commands to preprocess the datasets. Specify the dataset with --dataset and select the instrument type as qtof. Use --maxmin_pick to apply the MaxMin algorithm for selecting training molecules; otherwise, selection will be random. The dataset configurations are in ./molnetpack/config/preprocess_etkdgv3.yml.

python scripts/preprocess.py --task msms \
--dataset agilent nist mona waters gnps \
--instrument_type qtof orbitrap \
--data_config_path ./molnetpack/config/preprocess_etkdgv3.yml \
--mgf_dir ./data/mgf_debug/

Step 4: Train the Model

Use the following commands to train the model. Configuration settings for the model and training process are located in ./molnetpack/config/molnet.yml.

Using the command-line script:

# Train the model from pretrain:
# Q-TOF:
python scripts/train.py --task msms \
--train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3_tl.pt \
--transfer \
--resume_path ./check_point/molnet_pre_etkdgv3.pt
# Orbitrap can be done in a similar way.

# Train the model from scratch
# Q-TOF:
python scripts/train.py --task msms \
--train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--ex_model_path ./check_point/molnet_qtof_etkdgv3_jit.pt --device 0

# Orbitrap:
python scripts/train.py --task msms \
--train_data ./data/orbitrap_etkdgv3_train.pkl \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--checkpoint_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--ex_model_path ./check_point/molnet_orbitrap_etkdgv3_jit.pt --device 0

Using the Python API:

import torch
from molnetpack import MolNet

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
molnet_engine = MolNet(device, seed=42)

# Fine-tune from a pretrained checkpoint (Q-TOF):
molnet_engine.train(
    task='msms',
    train_data='./data/qtof_etkdgv3_train.pkl',
    valid_data='./data/qtof_etkdgv3_test.pkl',
    checkpoint_path='./check_point/molnet_qtof_etkdgv3_tl.pt',
    resume_path='./check_point/molnet_pre_etkdgv3.pt',
    transfer=True,
)

Step 5: Evaluation

Let’s evaluate the model trained above!

Using the command-line scripts:

# Predict the spectra:
# Q-TOF:
python scripts/predict.py --task msms \
--test_data ./data/qtof_etkdgv3_test.pkl \
--resume_path ./check_point/molnet_qtof_etkdgv3.pt \
--result_path ./result/pred_qtof_etkdgv3_test.mgf
# Orbitrap:
python scripts/predict.py --task msms \
--test_data ./data/orbitrap_etkdgv3_test.pkl \
--resume_path ./check_point/molnet_orbitrap_etkdgv3.pt \
--result_path ./result/pred_orbitrap_etkdgv3_test.mgf

# Evaluate the cosine similarity between experimental spectra and predicted spectra:
# Q-TOF:
python scripts/eval.py ./data/qtof_etkdgv3_test.pkl ./result/pred_qtof_etkdgv3_test.mgf \
--result_path ./eval_qtof_etkdgv3_test.csv --plot_path ./eval_qtof_etkdgv3_test.png
# Orbitrap:
python scripts/eval.py ./data/orbitrap_etkdgv3_test.pkl ./result/pred_orbitrap_etkdgv3_test.mgf \
--result_path ./eval_orbitrap_etkdgv3_test.csv --plot_path ./eval_orbitrap_etkdgv3_test.png

Using the Python API:

# After training, the model is immediately ready — no checkpoint reload needed.
# Alternatively, load an existing checkpoint:
molnet_engine.load_data('./data/qtof_etkdgv3_test.pkl')
pred_df = molnet_engine.pred_msms(
    path_to_results='./result/pred_qtof_etkdgv3_test.mgf',
    instrument='qtof',
)

# Evaluate cosine similarity against ground truth:
results_df = molnet_engine.evaluate(
    test_pkl='./data/qtof_etkdgv3_test.pkl',
    pred_mgf='./result/pred_qtof_etkdgv3_test.mgf',
    result_path='./eval_qtof_etkdgv3_test.csv',
    plot_path='./eval_qtof_etkdgv3_test.png',
)