[mlpack] GSoC 2014 : Introduction and Interests

Tue Mar 11 16:51:23 EDT 2014

Hello Marcus,

Thanks for going through the paper. I plan to implement many of the
metrics mentioned in the paper. As far as the binary classification
metrics are concerned, I have the following two approaches in mind to
convert multi-class classification problem to binary :

a)  Given a multi-class problem with 'k' classes. We label each class
as Ci for 'i' in 1 to 'k'. Now, consider a particular class with label
Ci. All examples in Ci will be considered positive and all others
negative. We do this for each class and end up with 'k' hypotheses
which need to be combined.

b) Another approach is to pick two classes from the 'k' classes, take
one as positive example and the other as negative. Here, we end up
with kC2 hypotheses which need to be combined.

Both these methods are feasible. There are other complex methods but I
prefer one of the above two.

Also, I would like to know and discuss how will I apply these tests
(once implemented) on the required algorithms. Will I have results
available from all the algorithms already run on several datasets or
will I have to run and generate result data and then apply the
metrics?
If later is the case, will I have a code base and datasets to start
running the algorithms?

Thanks and regards,
Anand

On Wed, Mar 12, 2014 at 1:18 AM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
> Hello,
>
> I'm gone through the paper, and I think in our case all metrics except the ROC could be implemented to measure the performance. But keep in mind some of the metrics can only handle binary classification problems. As mentioned in the paper one possible solution is to transform the data into binary classification problems, another solution is to use multi-classification metrics.
>
> Regards,
>
> Marcus
>
>
> On 10 Mar 2014, at 19:34, Anand Soni <anand.92.soni at gmail.com> wrote:
>
>> Marcus,
>>
>> I was talking about the following paper from Cornell University:
>>
>> http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
>>
>> I want my implementations to be based on this paper and possibly some
>> other ideas. The paper points to some standard metrics too. I would
>> like to use some (or all) of them depending on the feasibility. Can
>> you have a look at the metrics and tell me if some of them are
>> irrelevant for us?
>>
>> Also, I will look at the classifiers you have pointed me to. Thanks a lot!
>>
>> Regards.
>>
>> Anand Soni
>>
>> On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
>>> Hello,
>>>
>>>> I was studying on bench-marking and performance analysis of machine
>>>> learning algorithms and came across an interesting idea in a research
>>>> paper.
>>>
>>> Can you point us to the paper?
>>>
>>>> So, one of the things that I propose for this project is that we
>>>> implement, say, k metrics and perform a bootstrap analysis for the
>>>> given algorithms over these k metrics. By this, we will have a good
>>>> idea about how probable is it for an algorithm to perform "well" given
>>>> various metrics.
>>>
>>> Yes, that seems reasonable.
>>>
>>>> I have not yet decided on the metrics to use, but I am working on
>>>> that.
>>>
>>> I think we should offer some standard metrics and the class should also be templatized in such a way that the user can easily implement own metrics or choose different metrics.
>>>
>>>> I would like to have comments and feedback on the idea. Also, it
>>>> would be great if you can tell me the algorithms/tools that we will be
>>>> comparing for performance in the project. I can give more rigorous
>>>> details in the proposal.
>>>
>>>
>>> Currently there are a few classifiers in the mlpack/benchmark system (linear regression, logistic regression, least angle regression, naive bayes classifier, etc.).
>>>
>>> The following link list the currently available methods in mlpack:
>>>
>>> http://mlpack.org/doxygen.php
>>>
>>> So maybe it's a good idea to include some additional classifiers from shogun, weka, scikit, etc.
>>>
>>> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
>>> http://www.shogun-toolbox.org/page/features/
>>> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>>>
>>> I hope that helps.
>>>
>>> Thanks,
>>>
>>> Marcus
>>>
>>> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>>>
>>>> Hi Marcus and Ryan,
>>>>
>>>> I was studying on bench-marking and performance analysis of machine
>>>> learning algorithms and came across an interesting idea in a research
>>>> paper.
>>>>
>>>> Suppose we need to compare 'n' algorithms for performance. (I need
>>>> more information about the algorithms that will be involved in this
>>>> project). Also, suppose I have 'k' performance metrics. Obviously we
>>>> must not infer anything by looking at an algorithm's performance based
>>>> on just one metric.
>>>>
>>>> For example, in one of my projects where I did sentiment analysis
>>>> using ANNs (artificial neural networks), I got a good accuracy while
>>>> the precision/recall measures were not in good figures. This means
>>>> there is no "best algorithm". It all depends on the metrics used.
>>>>
>>>> So, one of the things that I propose for this project is that we
>>>> implement, say, k metrics and perform a bootstrap analysis for the
>>>> given algorithms over these k metrics. By this, we will have a good
>>>> idea about how probable is it for an algorithm to perform "well" given
>>>> various metrics.
>>>>
>>>> I have not yet decided on the metrics to use, but I am working on
>>>> that. I would like to have comments and feedback on the idea. Also, it
>>>> would be great if you can tell me the algorithms/tools that we will be
>>>> comparing for performance in the project. I can give more rigorous
>>>> details in the proposal.
>>>>
>>>> Regards.
>>>>
>>>> Anand Soni
>>>>
>>>> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu> wrote:
>>>>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>>>>> Thanks a lot Ryan!
>>>>>>
>>>>>> I too, would want to have a single and nice application submitted
>>>>>> rather than many. It was just out of interest that I was reading up on
>>>>>> dual trees and yes, most of the literature that I found was from
>>>>>> gatech. I also came across your paper on dual trees
>>>>>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>>>>>> pointers where I can get a better understanding of dual trees?
>>>>>
>>>>> There are lots of papers on dual-tree algorithms but the paper you
>>>>> linked to is (to my knowledge) the only one that tries to describe
>>>>> dual-tree algorithms in an abstract manner.  Here are some links to
>>>>> other papers, but keep in mind that they focus on particular algorithms
>>>>> and often don't devote very much space to describing exactly what a
>>>>> dual-tree algorithm is:
>>>>>
>>>>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>>>>> Advances in Neural Information Processing Systems (2001): 521-527.
>>>>>
>>>>> A.W. Moore.  "Nonparametric density estimation: toward computational
>>>>> tractability."  Proceedings of the Third SIAM International Conference
>>>>> on Data Mining (2003).
>>>>>
>>>>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>>>>> neighbor."  Proceedings of the 23rd International Conference on Machine
>>>>> Learning (2006).
>>>>>
>>>>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>>>>> pairwise statistical problems."  Advances in Neural Information
>>>>> Processing Systems (2009).
>>>>>
>>>>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>>>>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>>>>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>>>>> (2010).
>>>>>
>>>>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>>>>> one hasn't been published yet...
>>>>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>>>>>
>>>>> I know that's a lot of references and probably way more than you want to
>>>>> read, so don't feel obligated to read anything, but it will probably
>>>>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>>>>> link to more papers too, if you want...
>>>>>
>>>>>> But, of course, I am more willing to work on automatic benchmarking,
>>>>>> on which I had a little talk with Marcus and I am brewing ideas.
>>>>>
>>>>> Ok, sounds good.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Ryan
>>>>>
>>>>> --
>>>>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>>>>> ryan at ratml.org |   - Kit
>>>>
>>>>
>>>>
>>>> --
>>>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>>>> Engineering | IIT Bombay | India
>>>
>>
>>
>>
>> --
>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>> Engineering | IIT Bombay | India
>

-- 
Anand Soni | Junior Undergraduate | Department of Computer Science &
Engineering | IIT Bombay | India