[mlpack] GSoC 2014 : Introduction and Interests

Marcus Edel marcus.edel at fu-berlin.de
Mon Mar 10 13:54:49 EDT 2014


Hello,

> I was studying on bench-marking and performance analysis of machine
> learning algorithms and came across an interesting idea in a research
> paper.

Can you point us to the paper?

> So, one of the things that I propose for this project is that we
> implement, say, k metrics and perform a bootstrap analysis for the
> given algorithms over these k metrics. By this, we will have a good
> idea about how probable is it for an algorithm to perform "well" given
> various metrics.

Yes, that seems reasonable.

> I have not yet decided on the metrics to use, but I am working on
> that.

I think we should offer some standard metrics and the class should also be templatized in such a way that the user can easily implement own metrics or choose different metrics.

> I would like to have comments and feedback on the idea. Also, it
> would be great if you can tell me the algorithms/tools that we will be
> comparing for performance in the project. I can give more rigorous
> details in the proposal.


Currently there are a few classifiers in the mlpack/benchmark system (linear regression, logistic regression, least angle regression, naive bayes classifier, etc.). 

The following link list the currently available methods in mlpack:

http://mlpack.org/doxygen.php

So maybe it's a good idea to include some additional classifiers from shogun, weka, scikit, etc.

http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
http://www.shogun-toolbox.org/page/features/
http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html

I hope that helps.

Thanks,

Marcus

On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:

> Hi Marcus and Ryan,
> 
> I was studying on bench-marking and performance analysis of machine
> learning algorithms and came across an interesting idea in a research
> paper.
> 
> Suppose we need to compare 'n' algorithms for performance. (I need
> more information about the algorithms that will be involved in this
> project). Also, suppose I have 'k' performance metrics. Obviously we
> must not infer anything by looking at an algorithm's performance based
> on just one metric.
> 
> For example, in one of my projects where I did sentiment analysis
> using ANNs (artificial neural networks), I got a good accuracy while
> the precision/recall measures were not in good figures. This means
> there is no "best algorithm". It all depends on the metrics used.
> 
> So, one of the things that I propose for this project is that we
> implement, say, k metrics and perform a bootstrap analysis for the
> given algorithms over these k metrics. By this, we will have a good
> idea about how probable is it for an algorithm to perform "well" given
> various metrics.
> 
> I have not yet decided on the metrics to use, but I am working on
> that. I would like to have comments and feedback on the idea. Also, it
> would be great if you can tell me the algorithms/tools that we will be
> comparing for performance in the project. I can give more rigorous
> details in the proposal.
> 
> Regards.
> 
> Anand Soni
> 
> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu> wrote:
>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>> Thanks a lot Ryan!
>>> 
>>> I too, would want to have a single and nice application submitted
>>> rather than many. It was just out of interest that I was reading up on
>>> dual trees and yes, most of the literature that I found was from
>>> gatech. I also came across your paper on dual trees
>>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>>> pointers where I can get a better understanding of dual trees?
>> 
>> There are lots of papers on dual-tree algorithms but the paper you
>> linked to is (to my knowledge) the only one that tries to describe
>> dual-tree algorithms in an abstract manner.  Here are some links to
>> other papers, but keep in mind that they focus on particular algorithms
>> and often don't devote very much space to describing exactly what a
>> dual-tree algorithm is:
>> 
>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>> Advances in Neural Information Processing Systems (2001): 521-527.
>> 
>> A.W. Moore.  "Nonparametric density estimation: toward computational
>> tractability."  Proceedings of the Third SIAM International Conference
>> on Data Mining (2003).
>> 
>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>> neighbor."  Proceedings of the 23rd International Conference on Machine
>> Learning (2006).
>> 
>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>> pairwise statistical problems."  Advances in Neural Information
>> Processing Systems (2009).
>> 
>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>> (2010).
>> 
>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>> one hasn't been published yet...
>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>> 
>> I know that's a lot of references and probably way more than you want to
>> read, so don't feel obligated to read anything, but it will probably
>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>> link to more papers too, if you want...
>> 
>>> But, of course, I am more willing to work on automatic benchmarking,
>>> on which I had a little talk with Marcus and I am brewing ideas.
>> 
>> Ok, sounds good.
>> 
>> Thanks,
>> 
>> Ryan
>> 
>> --
>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>> ryan at ratml.org |   - Kit
> 
> 
> 
> -- 
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4972 bytes
Desc: not available
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140310/20c30be3/attachment-0003.bin>


More information about the mlpack mailing list