Question: For metabolomics data sets (where, samples <<< metabolites) out of several classification algorithms, which is efficient ?
1
gravatar for Biswapriya
2.2 years ago by
Biswapriya120
United States
Biswapriya120 wrote:

So, out of existing classification algorithms such as Support Vector Machine (SVM), K-nearest neighbor (KNN), Interval Valued Classification (IVC) and the improvised Interval Value based Particle Swarm Optimization (IVPSO) algorithm which one is potentially robust in handling, reliable in terms of results (not necessarily popular!) and why ?

 

ADD COMMENTlink modified 2.1 years ago • written 2.2 years ago by Biswapriya120
2
gravatar for Andrew Patterson
2.2 years ago by
Penn State University
Andrew Patterson180 wrote:

Good question and not an easy one to answer in my opinion. When I started doing metabolomics, I used only PCA or PLS-DA and thought this was the very best way since it was a nice way to visualize high dimension data in visually appealing manner. However, I found PCA and PLS-DA to be less sensitive to low abundance metabolites. Further, I got hung up on the model quality and permutation tests---I've since relaxed my stance on their overall value assuming I'm just using PLS-DA or OPLS to get a quick overview of the data. If I were actually using these tools to model my data and use it to make future predictions, maybe I would be more concerned with them.

I next switched to self-organizing maps (SOMs) which again was an excellent way to represent a lot of data in a simple and easy to interpret map. The challenge I found was that while SOMs made great figures that were nice for talks, they were hard to mine for biomarkers.

Next I switched to random forests and after getting over my fear of R I found this approach to be useful for the vast majority of my projects. Is it better than other modeling or classification algorithms? I think my answer to this question really depends on the project. I like random forests for biomarker studies (trying to find significant changes between two groups) but if I'm trying to pack as much information into a figure as possible, I generally turn to OPLS or some network based approach (MetaMapp is awesome for this). 

What's your opinion? We're always going to be faced with the large P, small n problem. 

ADD COMMENTlink written 2.2 years ago by Andrew Patterson180
0
gravatar for Biswapriya
2.1 years ago by
Biswapriya120
United States
Biswapriya120 wrote:

Yes. Heard a recent 'stats expert' say that, even 'PLS-DA, volcanoes, t-test, ranking, Random Forest, VIP, ROC' all can point to "really different and unrelated" metabolites (pathway-wise unrelated, different class and so on!) in a single study when seeking "that/those elusive biomarker(S)" ! This is a headache to "plant biologist" like me, interfacing at both data, mass-spec and biology to make conclusion ! He was of the view that the "statistician is the right person to judge/ guide" and that " no clear cut answer to decide on as to which biomarker is the chosen one!". Sounds tough on us, and tends to uncertainty. And then, if I show the best biomarker from technique 1 as the correct, then hiding others is logical (and not unethical) ?

ADD COMMENTlink written 2.1 years ago by Biswapriya120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.03
Traffic: 2 users visited in the last hour