[mlpack] GSoC 2014 : Introduction and Interests

Wed Mar 12 08:32:51 EDT 2014

Hi Marcus,

No, we are not avoiding multi-class metrics. Infact, I will take both
multi-class metrics (for multi-class classifiers) and binary metrics
(for both binary and multi-class classifiers) into consideration while
implementing.

As far as I know Matthews correlation coefficient is also a measure
for two class classifications. It is calculated on the basis of the
number of true and false positives and negatives which makes sense
only for a binary classifier. However, since I plan to convert multi
class classification data to two class classification data (as
described in my last mail), I can very well use this metric too.

And thanks for answering my query about the code base and data. I
think things will become more clear once I start working on the code
base and the metrics implementations. The only question that remains
now is : what metrics will I use? I will let you know soon.

I just wanted to discuss enough and make certain details clear before
submitting the proposal. I hope this is fine.

Regards.
Anand

On Wed, Mar 12, 2014 at 3:38 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
> Hello Anand,
>
>
> Both these methods are feasible. There are other complex methods but I
> prefer one of the above two.
>
>
> Choose the way you like the most, but is there any reason why avoid
> multiclass metrics such as Matthews correlation coefficient (MCC)?
>
> Also, I would like to know and discuss how will I apply these tests
> (once implemented) on the required algorithms. Will I have results
> available from all the algorithms already run on several datasets or
> will I have to run and generate result data and then apply the
> metrics?
> If later is the case, will I have a code base and datasets to start
> running the algorithms?
>
>
> Choose the way you like the most, but is there any reason why avoid
> multiclass metrics such as Matthews correlation coefficient (MCC)?
>
> Currently we measure only the runtime for several algorithms and datasets.
> The runtime information is stored in a sqlite database. To apply the metrics
> you have to extend the existing code to get the required information. We use
> the following code to extract the runtime information for the mlpack Naive
> Bayes Classifier:
>
> https://github.com/zoq/benchmarks/blob/master/methods/mlpack/nbc.py
>
> But there is already a codebase and several datasets. As I said before,
> maybe you can implement more classifiers.
>
> Hope that helps!
>
> Thanks,
>
> Marcus
>
>
> On 11 Mar 2014, at 21:51, Anand Soni <anand.92.soni at gmail.com> wrote:
>
> Hello Marcus,
>
> Thanks for going through the paper. I plan to implement many of the
> metrics mentioned in the paper. As far as the binary classification
> metrics are concerned, I have the following two approaches in mind to
> convert multi-class classification problem to binary :
>
> a)  Given a multi-class problem with 'k' classes. We label each class
> as Ci for 'i' in 1 to 'k'. Now, consider a particular class with label
> Ci. All examples in Ci will be considered positive and all others
> negative. We do this for each class and end up with 'k' hypotheses
> which need to be combined.
>
> b) Another approach is to pick two classes from the 'k' classes, take
> one as positive example and the other as negative. Here, we end up
> with kC2 hypotheses which need to be combined.
>
> Both these methods are feasible. There are other complex methods but I
> prefer one of the above two.
>
> Also, I would like to know and discuss how will I apply these tests
> (once implemented) on the required algorithms. Will I have results
> available from all the algorithms already run on several datasets or
> will I have to run and generate result data and then apply the
> metrics?
> If later is the case, will I have a code base and datasets to start
> running the algorithms?
>
> Thanks and regards,
> Anand
>
> On Wed, Mar 12, 2014 at 1:18 AM, Marcus Edel <marcus.edel at fu-berlin.de>
> wrote:
>
> Hello,
>
> I'm gone through the paper, and I think in our case all metrics except the
> ROC could be implemented to measure the performance. But keep in mind some
> of the metrics can only handle binary classification problems. As mentioned
> in the paper one possible solution is to transform the data into binary
> classification problems, another solution is to use multi-classification
> metrics.
>
> Regards,
>
> Marcus
>
>
> On 10 Mar 2014, at 19:34, Anand Soni <anand.92.soni at gmail.com> wrote:
>
> Marcus,
>
> I was talking about the following paper from Cornell University:
>
> http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
>
> I want my implementations to be based on this paper and possibly some
> other ideas. The paper points to some standard metrics too. I would
> like to use some (or all) of them depending on the feasibility. Can
> you have a look at the metrics and tell me if some of them are
> irrelevant for us?
>
> Also, I will look at the classifiers you have pointed me to. Thanks a lot!
>
> Regards.
>
> Anand Soni
>
> On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de>
> wrote:
>
> Hello,
>
> I was studying on bench-marking and performance analysis of machine
> learning algorithms and came across an interesting idea in a research
> paper.
>
>
> Can you point us to the paper?
>
> So, one of the things that I propose for this project is that we
> implement, say, k metrics and perform a bootstrap analysis for the
> given algorithms over these k metrics. By this, we will have a good
> idea about how probable is it for an algorithm to perform "well" given
> various metrics.
>
>
> Yes, that seems reasonable.
>
> I have not yet decided on the metrics to use, but I am working on
> that.
>
>
> I think we should offer some standard metrics and the class should also be
> templatized in such a way that the user can easily implement own metrics or
> choose different metrics.
>
> I would like to have comments and feedback on the idea. Also, it
> would be great if you can tell me the algorithms/tools that we will be
> comparing for performance in the project. I can give more rigorous
> details in the proposal.
>
>
>
> Currently there are a few classifiers in the mlpack/benchmark system (linear
> regression, logistic regression, least angle regression, naive bayes
> classifier, etc.).
>
> The following link list the currently available methods in mlpack:
>
> http://mlpack.org/doxygen.php
>
> So maybe it's a good idea to include some additional classifiers from
> shogun, weka, scikit, etc.
>
> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
> http://www.shogun-toolbox.org/page/features/
> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>
> I hope that helps.
>
> Thanks,
>
> Marcus
>
> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>
> Hi Marcus and Ryan,
>
> I was studying on bench-marking and performance analysis of machine
> learning algorithms and came across an interesting idea in a research
> paper.
>
> Suppose we need to compare 'n' algorithms for performance. (I need
> more information about the algorithms that will be involved in this
> project). Also, suppose I have 'k' performance metrics. Obviously we
> must not infer anything by looking at an algorithm's performance based
> on just one metric.
>
> For example, in one of my projects where I did sentiment analysis
> using ANNs (artificial neural networks), I got a good accuracy while
> the precision/recall measures were not in good figures. This means
> there is no "best algorithm". It all depends on the metrics used.
>
> So, one of the things that I propose for this project is that we
> implement, say, k metrics and perform a bootstrap analysis for the
> given algorithms over these k metrics. By this, we will have a good
> idea about how probable is it for an algorithm to perform "well" given
> various metrics.
>
> I have not yet decided on the metrics to use, but I am working on
> that. I would like to have comments and feedback on the idea. Also, it
> would be great if you can tell me the algorithms/tools that we will be
> comparing for performance in the project. I can give more rigorous
> details in the proposal.
>
> Regards.
>
> Anand Soni
>
> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu>
> wrote:
>
> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>
> Thanks a lot Ryan!
>
> I too, would want to have a single and nice application submitted
> rather than many. It was just out of interest that I was reading up on
> dual trees and yes, most of the literature that I found was from
> gatech. I also came across your paper on dual trees
> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
> pointers where I can get a better understanding of dual trees?
>
>
> There are lots of papers on dual-tree algorithms but the paper you
> linked to is (to my knowledge) the only one that tries to describe
> dual-tree algorithms in an abstract manner.  Here are some links to
> other papers, but keep in mind that they focus on particular algorithms
> and often don't devote very much space to describing exactly what a
> dual-tree algorithm is:
>
> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
> Advances in Neural Information Processing Systems (2001): 521-527.
>
> A.W. Moore.  "Nonparametric density estimation: toward computational
> tractability."  Proceedings of the Third SIAM International Conference
> on Data Mining (2003).
>
> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
> neighbor."  Proceedings of the 23rd International Conference on Machine
> Learning (2006).
>
> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
> pairwise statistical problems."  Advances in Neural Information
> Processing Systems (2009).
>
> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
> algorithm, analysis, and applications."  Proceedings of the 16th ACM
> SIGKDD International Conference on Knowledge Discovery and Data Mining
> (2010).
>
> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
> one hasn't been published yet...
> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>
> I know that's a lot of references and probably way more than you want to
> read, so don't feel obligated to read anything, but it will probably
> help explain exactly what a dual-tree algorithm is... I hope!  I can
> link to more papers too, if you want...
>
> But, of course, I am more willing to work on automatic benchmarking,
> on which I had a little talk with Marcus and I am brewing ideas.
>
>
> Ok, sounds good.
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
> ryan at ratml.org |   - Kit
>
>
>
>
> --
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India
>
>
>
>
>
> --
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India
>
>
>
>
>
> --
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India
>
>

-- 
Anand Soni | Junior Undergraduate | Department of Computer Science &
Engineering | IIT Bombay | India