[mlpack] GSoC 2014 : Introduction and Interests

Wed Mar 12 06:08:49 EDT 2014

Hello Anand,

> Both these methods are feasible. There are other complex methods but I
> prefer one of the above two.

Choose the way you like the most, but is there any reason why avoid multiclass metrics such as Matthews correlation coefficient (MCC)?

> Also, I would like to know and discuss how will I apply these tests
> (once implemented) on the required algorithms. Will I have results
> available from all the algorithms already run on several datasets or
> will I have to run and generate result data and then apply the
> metrics?
> If later is the case, will I have a code base and datasets to start
> running the algorithms?

Choose the way you like the most, but is there any reason why avoid multiclass metrics such as Matthews correlation coefficient (MCC)?

Currently we measure only the runtime for several algorithms and datasets. The runtime information is stored in a sqlite database. To apply the metrics you have to extend the existing code to get the required information. We use the following code to extract the runtime information for the mlpack Naive Bayes Classifier:

https://github.com/zoq/benchmarks/blob/master/methods/mlpack/nbc.py

But there is already a codebase and several datasets. As I said before, maybe you can implement more classifiers.

Hope that helps!

Thanks,

Marcus

On 11 Mar 2014, at 21:51, Anand Soni <anand.92.soni at gmail.com> wrote:

> Hello Marcus,
> 
> Thanks for going through the paper. I plan to implement many of the
> metrics mentioned in the paper. As far as the binary classification
> metrics are concerned, I have the following two approaches in mind to
> convert multi-class classification problem to binary :
> 
> a)  Given a multi-class problem with 'k' classes. We label each class
> as Ci for 'i' in 1 to 'k'. Now, consider a particular class with label
> Ci. All examples in Ci will be considered positive and all others
> negative. We do this for each class and end up with 'k' hypotheses
> which need to be combined.
> 
> b) Another approach is to pick two classes from the 'k' classes, take
> one as positive example and the other as negative. Here, we end up
> with kC2 hypotheses which need to be combined.
> 
> Both these methods are feasible. There are other complex methods but I
> prefer one of the above two.
> 
> Also, I would like to know and discuss how will I apply these tests
> (once implemented) on the required algorithms. Will I have results
> available from all the algorithms already run on several datasets or
> will I have to run and generate result data and then apply the
> metrics?
> If later is the case, will I have a code base and datasets to start
> running the algorithms?
> 
> Thanks and regards,
> Anand
> 
> On Wed, Mar 12, 2014 at 1:18 AM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
>> Hello,
>> 
>> I'm gone through the paper, and I think in our case all metrics except the ROC could be implemented to measure the performance. But keep in mind some of the metrics can only handle binary classification problems. As mentioned in the paper one possible solution is to transform the data into binary classification problems, another solution is to use multi-classification metrics.
>> 
>> Regards,
>> 
>> Marcus
>> 
>> 
>> On 10 Mar 2014, at 19:34, Anand Soni <anand.92.soni at gmail.com> wrote:
>> 
>>> Marcus,
>>> 
>>> I was talking about the following paper from Cornell University:
>>> 
>>> http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
>>> 
>>> I want my implementations to be based on this paper and possibly some
>>> other ideas. The paper points to some standard metrics too. I would
>>> like to use some (or all) of them depending on the feasibility. Can
>>> you have a look at the metrics and tell me if some of them are
>>> irrelevant for us?
>>> 
>>> Also, I will look at the classifiers you have pointed me to. Thanks a lot!
>>> 
>>> Regards.
>>> 
>>> Anand Soni
>>> 
>>> On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
>>>> Hello,
>>>> 
>>>>> I was studying on bench-marking and performance analysis of machine
>>>>> learning algorithms and came across an interesting idea in a research
>>>>> paper.
>>>> 
>>>> Can you point us to the paper?
>>>> 
>>>>> So, one of the things that I propose for this project is that we
>>>>> implement, say, k metrics and perform a bootstrap analysis for the
>>>>> given algorithms over these k metrics. By this, we will have a good
>>>>> idea about how probable is it for an algorithm to perform "well" given
>>>>> various metrics.
>>>> 
>>>> Yes, that seems reasonable.
>>>> 
>>>>> I have not yet decided on the metrics to use, but I am working on
>>>>> that.
>>>> 
>>>> I think we should offer some standard metrics and the class should also be templatized in such a way that the user can easily implement own metrics or choose different metrics.
>>>> 
>>>>> I would like to have comments and feedback on the idea. Also, it
>>>>> would be great if you can tell me the algorithms/tools that we will be
>>>>> comparing for performance in the project. I can give more rigorous
>>>>> details in the proposal.
>>>> 
>>>> 
>>>> Currently there are a few classifiers in the mlpack/benchmark system (linear regression, logistic regression, least angle regression, naive bayes classifier, etc.).
>>>> 
>>>> The following link list the currently available methods in mlpack:
>>>> 
>>>> http://mlpack.org/doxygen.php
>>>> 
>>>> So maybe it's a good idea to include some additional classifiers from shogun, weka, scikit, etc.
>>>> 
>>>> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
>>>> http://www.shogun-toolbox.org/page/features/
>>>> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>>>> 
>>>> I hope that helps.
>>>> 
>>>> Thanks,
>>>> 
>>>> Marcus
>>>> 
>>>> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>>>> 
>>>>> Hi Marcus and Ryan,
>>>>> 
>>>>> I was studying on bench-marking and performance analysis of machine
>>>>> learning algorithms and came across an interesting idea in a research
>>>>> paper.
>>>>> 
>>>>> Suppose we need to compare 'n' algorithms for performance. (I need
>>>>> more information about the algorithms that will be involved in this
>>>>> project). Also, suppose I have 'k' performance metrics. Obviously we
>>>>> must not infer anything by looking at an algorithm's performance based
>>>>> on just one metric.
>>>>> 
>>>>> For example, in one of my projects where I did sentiment analysis
>>>>> using ANNs (artificial neural networks), I got a good accuracy while
>>>>> the precision/recall measures were not in good figures. This means
>>>>> there is no "best algorithm". It all depends on the metrics used.
>>>>> 
>>>>> So, one of the things that I propose for this project is that we
>>>>> implement, say, k metrics and perform a bootstrap analysis for the
>>>>> given algorithms over these k metrics. By this, we will have a good
>>>>> idea about how probable is it for an algorithm to perform "well" given
>>>>> various metrics.
>>>>> 
>>>>> I have not yet decided on the metrics to use, but I am working on
>>>>> that. I would like to have comments and feedback on the idea. Also, it
>>>>> would be great if you can tell me the algorithms/tools that we will be
>>>>> comparing for performance in the project. I can give more rigorous
>>>>> details in the proposal.
>>>>> 
>>>>> Regards.
>>>>> 
>>>>> Anand Soni
>>>>> 
>>>>> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu> wrote:
>>>>>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>>>>>> Thanks a lot Ryan!
>>>>>>> 
>>>>>>> I too, would want to have a single and nice application submitted
>>>>>>> rather than many. It was just out of interest that I was reading up on
>>>>>>> dual trees and yes, most of the literature that I found was from
>>>>>>> gatech. I also came across your paper on dual trees
>>>>>>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>>>>>>> pointers where I can get a better understanding of dual trees?
>>>>>> 
>>>>>> There are lots of papers on dual-tree algorithms but the paper you
>>>>>> linked to is (to my knowledge) the only one that tries to describe
>>>>>> dual-tree algorithms in an abstract manner.  Here are some links to
>>>>>> other papers, but keep in mind that they focus on particular algorithms
>>>>>> and often don't devote very much space to describing exactly what a
>>>>>> dual-tree algorithm is:
>>>>>> 
>>>>>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>>>>>> Advances in Neural Information Processing Systems (2001): 521-527.
>>>>>> 
>>>>>> A.W. Moore.  "Nonparametric density estimation: toward computational
>>>>>> tractability."  Proceedings of the Third SIAM International Conference
>>>>>> on Data Mining (2003).
>>>>>> 
>>>>>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>>>>>> neighbor."  Proceedings of the 23rd International Conference on Machine
>>>>>> Learning (2006).
>>>>>> 
>>>>>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>>>>>> pairwise statistical problems."  Advances in Neural Information
>>>>>> Processing Systems (2009).
>>>>>> 
>>>>>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>>>>>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>>>>>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>>>>>> (2010).
>>>>>> 
>>>>>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>>>>>> one hasn't been published yet...
>>>>>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>>>>>> 
>>>>>> I know that's a lot of references and probably way more than you want to
>>>>>> read, so don't feel obligated to read anything, but it will probably
>>>>>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>>>>>> link to more papers too, if you want...
>>>>>> 
>>>>>>> But, of course, I am more willing to work on automatic benchmarking,
>>>>>>> on which I had a little talk with Marcus and I am brewing ideas.
>>>>>> 
>>>>>> Ok, sounds good.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Ryan
>>>>>> 
>>>>>> --
>>>>>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>>>>>> ryan at ratml.org |   - Kit
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>>>>> Engineering | IIT Bombay | India
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>>> Engineering | IIT Bombay | India
>> 
> 
> 
> 
> -- 
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140312/0581ad4f/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4972 bytes
Desc: not available
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20140312/0581ad4f/attachment-0003.bin>