2.3 years ago by
Penn State University
Good question and not an easy one to answer in my opinion. When I started doing metabolomics, I used only PCA or PLS-DA and thought this was the very best way since it was a nice way to visualize high dimension data in visually appealing manner. However, I found PCA and PLS-DA to be less sensitive to low abundance metabolites. Further, I got hung up on the model quality and permutation tests---I've since relaxed my stance on their overall value assuming I'm just using PLS-DA or OPLS to get a quick overview of the data. If I were actually using these tools to model my data and use it to make future predictions, maybe I would be more concerned with them.
I next switched to self-organizing maps (SOMs) which again was an excellent way to represent a lot of data in a simple and easy to interpret map. The challenge I found was that while SOMs made great figures that were nice for talks, they were hard to mine for biomarkers.
Next I switched to random forests and after getting over my fear of R I found this approach to be useful for the vast majority of my projects. Is it better than other modeling or classification algorithms? I think my answer to this question really depends on the project. I like random forests for biomarker studies (trying to find significant changes between two groups) but if I'm trying to pack as much information into a figure as possible, I generally turn to OPLS or some network based approach (MetaMapp is awesome for this).
What's your opinion? We're always going to be faced with the large P, small n problem.