Multimodal Epigenetic Sequence Analysis (MESA)
Modality perforamce evaluation
circula mesa with -p option will perform modality performance evaluation. To get a glance of the performance of each modality in terms of classification, the program will test classifier(s) for each modality with cross-validation.
To quickly assess the perforamnce, the --subset option can be used to specify the number or ratio of samples to be used for the evaluation, which will randomly sample features for each modality accordingly. The --repeat option can be used to specify the number of repeats for the random test.
The --clf option can be used to specify the classifier(s) to be used for the evaluation: 1. Random Forest, 2.Logistic Regression, 3. SVM. The --label option is required to specify the label file. The --modality option is required to specify the modality name of each input matrix. See :doc:’../api/mesa’ for more details.
circula mesa \
./data/mesa_test_DHS_matrix.tsv \
./data/mesa_test_CGI_matrix.tsv \
./data/mesa_test_OCC_matrix.tsv \
./data/mesa_test_WPS_matrix.tsv \
--label ./data/mesa_test_label.tsv \
--modality "DHS meth" "CGI meth" "Occupancy" "WPS" \
--clf 1 2 3 \
-p \
--subset 10000 \
--repeat 10 \
-o ./output
Modality |
classifier_roc_auc_mean (RFC,LR,SVC) |
classifier_roc_auc_std (RFC,LR,SVC) |
best_classifier, idx |
best_roc_auc_mean |
|---|---|---|---|---|
Occupancy |
[0.8779 0.8789 0.8892] |
[0.1251 0.1357 0.128 ] |
(‘SVC’, 2) |
0.8892 |
WPS |
[0.8746 0.8728 0.8818] |
[0.1287 0.1387 0.1294] |
(‘SVC’, 2) |
0.8818 |
DHS meth |
[0.7789 0.7864 0.7906] |
[0.162 0.1467 0.1474] |
(‘SVC’, 2) |
0.7906 |
CGI meth |
[0.7768 0.7284 0.7291] |
[0.1618 0.1641 0.1671] |
(‘RandomForestClassifier’, 0) |
0.7768 |
MESA cross-validation
For the sake of model fine-tuning and evaluation, circula mesa with --cv_mesa option will perform cross-validation automatically for the MESA model based on the input moadlity matrix(s), feature size to select --size, classifier --clf, etc.
This process will output the cross-validation AUC to stdout.
# single modality
circula mesa ./data/mesa_test_DHS_matrix.tsv --label ./data/20241217_mesa_test_label.tsv --cv
@Tue Jan 28 23:53:01 2025 MESA.
@Tue Jan 28 23:53:01 2025 Loading modality(s) matrix from input: ['./data/mesa_test_DHS_matrix.tsv']
@Tue Jan 28 23:53:02 2025 Classifier for each modality is specified.
@Tue Jan 28 23:53:02 2025 [RandomForestClassifier(n_jobs=-1, random_state=0)]
@Tue Jan 28 23:53:02 2025 Loading label from input: ./data/20241217_mesa_test_label.tsv
@Tue Jan 28 23:53:02 2025 MESA cross-validation for one modality.
@Tue Jan 28 23:53:02 2025 Single modality input
@Tue Jan 28 23:55:11 2025 MESA cross-validation AUC: 0.7
Or test multiple modalities with multiple input matrices.
# single modality
circula mesa ./data/mesa_test_DHS_matrix.tsv ./data/mesa_test_CGI_matrix.tsv --label ./data/20241217_mesa_test_label.tsv --cv
@Tue Jan 28 23:58:12 2025 MESA.
@Tue Jan 28 23:58:12 2025 Loading modality(s) matrix from input: ['./data/20241217_mesa_test_DHS.tsv', './data/20241217_mesa_test_CGI.tsv']
@Tue Jan 28 23:58:14 2025 Number of classifier > or < number of modality(s). Use first classifer for all modality(s).
@Tue Jan 28 23:58:14 2025 RandomForestClassifier(n_jobs=-1, random_state=0)
@Tue Jan 28 23:58:14 2025 Loading label from input: ./data/20241217_mesa_test_label.tsv
@Tue Jan 28 23:58:14 2025 MESA cross-validation for multiple modalities.
@Tue Jan 28 23:58:14 2025 Mutiple modalities input
@Tue Jan 28 23:55:11 2025 MESA cross-validation AUC: 0.76
MESA construction
Users can construct a MESA model with circula mesa with --mesa option. This process will output the trained MESA model object, in the format of .pkl, to the specified output directory.
# multi modality
circula mesa \
./data/mesa_test_DHS_matrix.tsv \
./data/mesa_test_CGI_matrix.tsv \
./data/mesa_test_OCC_matrix.tsv \
./data/mesa_test_WPS_matrix.tsv \
--label ./data/mesa_test_label.tsv \
--modality "DHS meth" "CGI meth" "Occupancy" "WPS" \
--mesa \
-o ./output
@Wed Jan 29 00:07:36 2025 MESA.
@Wed Jan 29 00:07:36 2025 Constructing MESA model.
@Wed Jan 29 00:07:36 2025 Modality performance test skipped. Loading modality(s) matrix from input: ['./data/mesa_test_DHS_matrix.tsv', './data/mesa_test_CGI_matrix.tsv', './data/mesa_test_OCC_matrix.tsv', './data/mesa_test_WPS_matrix.tsv']
@Wed Jan 29 00:07:44 2025 Number of classifier > or < number of modality(s). Use first classifer for all modality(s).
@Wed Jan 29 00:07:44 2025 RandomForestClassifier(n_jobs=-1, random_state=0)
@Wed Jan 29 00:07:44 2025 Loading label from input: ./data/mesa_test_label.tsv
@Wed Jan 29 00:07:44 2025 Fitting base estimators.
...
@Wed Jan 29 00:07:44 2025 MESA model saved to ./output/MESA_model.pkl
If -p option is specified together with --mesa, the program will perform modality performance evaluation and construct the MESA model automatically based on the best-performing modality(s).
# multi modality
circula mesa \
./data/mesa_test_DHS_matrix.tsv \
./data/mesa_test_CGI_matrix.tsv \
./data/mesa_test_OCC_matrix.tsv \
./data/mesa_test_WPS_matrix.tsv \
--label ./data/mesa_test_label.tsv \
--modality "DHS meth" "CGI meth" "Occupancy" "WPS" \
--clf 1 2 3 \
-p --mesa --max_modality 2 \
--subset 10000 \
--repeat 10 \
-o ./output
Note
For more information on the available arguments for each step, check API section or check with --help.