[mlpack] GSoC 2014 : Introduction and Interests

Mon Mar 10 14:34:52 EDT 2014

Marcus,

I was talking about the following paper from Cornell University:

http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf

I want my implementations to be based on this paper and possibly some
other ideas. The paper points to some standard metrics too. I would
like to use some (or all) of them depending on the feasibility. Can
you have a look at the metrics and tell me if some of them are
irrelevant for us?

Also, I will look at the classifiers you have pointed me to. Thanks a lot!

Regards.

Anand Soni

On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
> Hello,
>
>> I was studying on bench-marking and performance analysis of machine
>> learning algorithms and came across an interesting idea in a research
>> paper.
>
> Can you point us to the paper?
>
>> So, one of the things that I propose for this project is that we
>> implement, say, k metrics and perform a bootstrap analysis for the
>> given algorithms over these k metrics. By this, we will have a good
>> idea about how probable is it for an algorithm to perform "well" given
>> various metrics.
>
> Yes, that seems reasonable.
>
>> I have not yet decided on the metrics to use, but I am working on
>> that.
>
> I think we should offer some standard metrics and the class should also be templatized in such a way that the user can easily implement own metrics or choose different metrics.
>
>> I would like to have comments and feedback on the idea. Also, it
>> would be great if you can tell me the algorithms/tools that we will be
>> comparing for performance in the project. I can give more rigorous
>> details in the proposal.
>
>
> Currently there are a few classifiers in the mlpack/benchmark system (linear regression, logistic regression, least angle regression, naive bayes classifier, etc.).
>
> The following link list the currently available methods in mlpack:
>
> http://mlpack.org/doxygen.php
>
> So maybe it's a good idea to include some additional classifiers from shogun, weka, scikit, etc.
>
> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
> http://www.shogun-toolbox.org/page/features/
> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>
> I hope that helps.
>
> Thanks,
>
> Marcus
>
> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>
>> Hi Marcus and Ryan,
>>
>> I was studying on bench-marking and performance analysis of machine
>> learning algorithms and came across an interesting idea in a research
>> paper.
>>
>> Suppose we need to compare 'n' algorithms for performance. (I need
>> more information about the algorithms that will be involved in this
>> project). Also, suppose I have 'k' performance metrics. Obviously we
>> must not infer anything by looking at an algorithm's performance based
>> on just one metric.
>>
>> For example, in one of my projects where I did sentiment analysis
>> using ANNs (artificial neural networks), I got a good accuracy while
>> the precision/recall measures were not in good figures. This means
>> there is no "best algorithm". It all depends on the metrics used.
>>
>> So, one of the things that I propose for this project is that we
>> implement, say, k metrics and perform a bootstrap analysis for the
>> given algorithms over these k metrics. By this, we will have a good
>> idea about how probable is it for an algorithm to perform "well" given
>> various metrics.
>>
>> I have not yet decided on the metrics to use, but I am working on
>> that. I would like to have comments and feedback on the idea. Also, it
>> would be great if you can tell me the algorithms/tools that we will be
>> comparing for performance in the project. I can give more rigorous
>> details in the proposal.
>>
>> Regards.
>>
>> Anand Soni
>>
>> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu> wrote:
>>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>>> Thanks a lot Ryan!
>>>>
>>>> I too, would want to have a single and nice application submitted
>>>> rather than many. It was just out of interest that I was reading up on
>>>> dual trees and yes, most of the literature that I found was from
>>>> gatech. I also came across your paper on dual trees
>>>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>>>> pointers where I can get a better understanding of dual trees?
>>>
>>> There are lots of papers on dual-tree algorithms but the paper you
>>> linked to is (to my knowledge) the only one that tries to describe
>>> dual-tree algorithms in an abstract manner.  Here are some links to
>>> other papers, but keep in mind that they focus on particular algorithms
>>> and often don't devote very much space to describing exactly what a
>>> dual-tree algorithm is:
>>>
>>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>>> Advances in Neural Information Processing Systems (2001): 521-527.
>>>
>>> A.W. Moore.  "Nonparametric density estimation: toward computational
>>> tractability."  Proceedings of the Third SIAM International Conference
>>> on Data Mining (2003).
>>>
>>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>>> neighbor."  Proceedings of the 23rd International Conference on Machine
>>> Learning (2006).
>>>
>>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>>> pairwise statistical problems."  Advances in Neural Information
>>> Processing Systems (2009).
>>>
>>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>>> (2010).
>>>
>>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>>> one hasn't been published yet...
>>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>>>
>>> I know that's a lot of references and probably way more than you want to
>>> read, so don't feel obligated to read anything, but it will probably
>>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>>> link to more papers too, if you want...
>>>
>>>> But, of course, I am more willing to work on automatic benchmarking,
>>>> on which I had a little talk with Marcus and I am brewing ideas.
>>>
>>> Ok, sounds good.
>>>
>>> Thanks,
>>>
>>> Ryan
>>>
>>> --
>>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>>> ryan at ratml.org |   - Kit
>>
>>
>>
>> --
>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>> Engineering | IIT Bombay | India
>

-- 
Anand Soni | Junior Undergraduate | Department of Computer Science &
Engineering | IIT Bombay | India