[mlpack] GSoC 2014 : Introduction and Interests

Marcus Edel marcus.edel at fu-berlin.de
Tue Mar 11 15:48:16 EDT 2014


Hello,

I'm gone through the paper, and I think in our case all metrics except the ROC could be implemented to measure the performance. But keep in mind some of the metrics can only handle binary classification problems. As mentioned in the paper one possible solution is to transform the data into binary classification problems, another solution is to use multi-classification metrics.

Regards,

Marcus


On 10 Mar 2014, at 19:34, Anand Soni <anand.92.soni at gmail.com> wrote:

> Marcus,
> 
> I was talking about the following paper from Cornell University:
> 
> http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
> 
> I want my implementations to be based on this paper and possibly some
> other ideas. The paper points to some standard metrics too. I would
> like to use some (or all) of them depending on the feasibility. Can
> you have a look at the metrics and tell me if some of them are
> irrelevant for us?
> 
> Also, I will look at the classifiers you have pointed me to. Thanks a lot!
> 
> Regards.
> 
> Anand Soni
> 
> On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
>> Hello,
>> 
>>> I was studying on bench-marking and performance analysis of machine
>>> learning algorithms and came across an interesting idea in a research
>>> paper.
>> 
>> Can you point us to the paper?
>> 
>>> So, one of the things that I propose for this project is that we
>>> implement, say, k metrics and perform a bootstrap analysis for the
>>> given algorithms over these k metrics. By this, we will have a good
>>> idea about how probable is it for an algorithm to perform "well" given
>>> various metrics.
>> 
>> Yes, that seems reasonable.
>> 
>>> I have not yet decided on the metrics to use, but I am working on
>>> that.
>> 
>> I think we should offer some standard metrics and the class should also be templatized in such a way that the user can easily implement own metrics or choose different metrics.
>> 
>>> I would like to have comments and feedback on the idea. Also, it
>>> would be great if you can tell me the algorithms/tools that we will be
>>> comparing for performance in the project. I can give more rigorous
>>> details in the proposal.
>> 
>> 
>> Currently there are a few classifiers in the mlpack/benchmark system (linear regression, logistic regression, least angle regression, naive bayes classifier, etc.).
>> 
>> The following link list the currently available methods in mlpack:
>> 
>> http://mlpack.org/doxygen.php
>> 
>> So maybe it's a good idea to include some additional classifiers from shogun, weka, scikit, etc.
>> 
>> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
>> http://www.shogun-toolbox.org/page/features/
>> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>> 
>> I hope that helps.
>> 
>> Thanks,
>> 
>> Marcus
>> 
>> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>> 
>>> Hi Marcus and Ryan,
>>> 
>>> I was studying on bench-marking and performance analysis of machine
>>> learning algorithms and came across an interesting idea in a research
>>> paper.
>>> 
>>> Suppose we need to compare 'n' algorithms for performance. (I need
>>> more information about the algorithms that will be involved in this
>>> project). Also, suppose I have 'k' performance metrics. Obviously we
>>> must not infer anything by looking at an algorithm's performance based
>>> on just one metric.
>>> 
>>> For example, in one of my projects where I did sentiment analysis
>>> using ANNs (artificial neural networks), I got a good accuracy while
>>> the precision/recall measures were not in good figures. This means
>>> there is no "best algorithm". It all depends on the metrics used.
>>> 
>>> So, one of the things that I propose for this project is that we
>>> implement, say, k metrics and perform a bootstrap analysis for the
>>> given algorithms over these k metrics. By this, we will have a good
>>> idea about how probable is it for an algorithm to perform "well" given
>>> various metrics.
>>> 
>>> I have not yet decided on the metrics to use, but I am working on
>>> that. I would like to have comments and feedback on the idea. Also, it
>>> would be great if you can tell me the algorithms/tools that we will be
>>> comparing for performance in the project. I can give more rigorous
>>> details in the proposal.
>>> 
>>> Regards.
>>> 
>>> Anand Soni
>>> 
>>> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu> wrote:
>>>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>>>> Thanks a lot Ryan!
>>>>> 
>>>>> I too, would want to have a single and nice application submitted
>>>>> rather than many. It was just out of interest that I was reading up on
>>>>> dual trees and yes, most of the literature that I found was from
>>>>> gatech. I also came across your paper on dual trees
>>>>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>>>>> pointers where I can get a better understanding of dual trees?
>>>> 
>>>> There are lots of papers on dual-tree algorithms but the paper you
>>>> linked to is (to my knowledge) the only one that tries to describe
>>>> dual-tree algorithms in an abstract manner.  Here are some links to
>>>> other papers, but keep in mind that they focus on particular algorithms
>>>> and often don't devote very much space to describing exactly what a
>>>> dual-tree algorithm is:
>>>> 
>>>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>>>> Advances in Neural Information Processing Systems (2001): 521-527.
>>>> 
>>>> A.W. Moore.  "Nonparametric density estimation: toward computational
>>>> tractability."  Proceedings of the Third SIAM International Conference
>>>> on Data Mining (2003).
>>>> 
>>>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>>>> neighbor."  Proceedings of the 23rd International Conference on Machine
>>>> Learning (2006).
>>>> 
>>>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>>>> pairwise statistical problems."  Advances in Neural Information
>>>> Processing Systems (2009).
>>>> 
>>>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>>>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>>>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>>>> (2010).
>>>> 
>>>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>>>> one hasn't been published yet...
>>>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>>>> 
>>>> I know that's a lot of references and probably way more than you want to
>>>> read, so don't feel obligated to read anything, but it will probably
>>>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>>>> link to more papers too, if you want...
>>>> 
>>>>> But, of course, I am more willing to work on automatic benchmarking,
>>>>> on which I had a little talk with Marcus and I am brewing ideas.
>>>> 
>>>> Ok, sounds good.
>>>> 
>>>> Thanks,
>>>> 
>>>> Ryan
>>>> 
>>>> --
>>>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>>>> ryan at ratml.org |   - Kit
>>> 
>>> 
>>> 
>>> --
>>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>>> Engineering | IIT Bombay | India
>> 
> 
> 
> 
> -- 
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4972 bytes
Desc: not available
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140311/f8c3a52d/attachment-0003.bin>


More information about the mlpack mailing list