[mlpack] GSoC 2014 : Introduction and Interests

Thu Mar 13 06:05:55 EDT 2014

Hi Marcus,

I have decided on the metrics to be implemented. Mostly, I will be
implementing and testing with the following metrics :

a) Accuracy
b) F-measure
c) Lift
d) Precision
e) Recall
f) Matthews correlation coefficient (turned about to be a good
suggestion from you)
g) Mean Squared error
h) Mean predictive information (very similar to cross entropy, but
more easily interpreted)

So, there are 8 metrics in total. The last two are probabilistic
measures. I will talk about these metrics in detail in the proposal.
Please tell me if you want something to be added or removed.

Also, does mlpack already has a classifier based on feed-forward
neural networks and the backpropagation algorithm? If not and if time
permits, I would like to add it as a classifier for mlpack.

Thanks a lot.

Regards,
Anand

On Wed, Mar 12, 2014 at 6:02 PM, Anand Soni <anand.92.soni at gmail.com> wrote:
> Hi Marcus,
>
> No, we are not avoiding multi-class metrics. Infact, I will take both
> multi-class metrics (for multi-class classifiers) and binary metrics
> (for both binary and multi-class classifiers) into consideration while
> implementing.
>
> As far as I know Matthews correlation coefficient is also a measure
> for two class classifications. It is calculated on the basis of the
> number of true and false positives and negatives which makes sense
> only for a binary classifier. However, since I plan to convert multi
> class classification data to two class classification data (as
> described in my last mail), I can very well use this metric too.
>
> And thanks for answering my query about the code base and data. I
> think things will become more clear once I start working on the code
> base and the metrics implementations. The only question that remains
> now is : what metrics will I use? I will let you know soon.
>
> I just wanted to discuss enough and make certain details clear before
> submitting the proposal. I hope this is fine.
>
> Regards.
> Anand
>
> On Wed, Mar 12, 2014 at 3:38 PM, Marcus Edel <marcus.edel at fu-berlin.de> wrote:
>> Hello Anand,
>>
>>
>> Both these methods are feasible. There are other complex methods but I
>> prefer one of the above two.
>>
>>
>> Choose the way you like the most, but is there any reason why avoid
>> multiclass metrics such as Matthews correlation coefficient (MCC)?
>>
>> Also, I would like to know and discuss how will I apply these tests
>> (once implemented) on the required algorithms. Will I have results
>> available from all the algorithms already run on several datasets or
>> will I have to run and generate result data and then apply the
>> metrics?
>> If later is the case, will I have a code base and datasets to start
>> running the algorithms?
>>
>>
>> Choose the way you like the most, but is there any reason why avoid
>> multiclass metrics such as Matthews correlation coefficient (MCC)?
>>
>> Currently we measure only the runtime for several algorithms and datasets.
>> The runtime information is stored in a sqlite database. To apply the metrics
>> you have to extend the existing code to get the required information. We use
>> the following code to extract the runtime information for the mlpack Naive
>> Bayes Classifier:
>>
>> https://github.com/zoq/benchmarks/blob/master/methods/mlpack/nbc.py
>>
>> But there is already a codebase and several datasets. As I said before,
>> maybe you can implement more classifiers.
>>
>> Hope that helps!
>>
>> Thanks,
>>
>> Marcus
>>
>>
>> On 11 Mar 2014, at 21:51, Anand Soni <anand.92.soni at gmail.com> wrote:
>>
>> Hello Marcus,
>>
>> Thanks for going through the paper. I plan to implement many of the
>> metrics mentioned in the paper. As far as the binary classification
>> metrics are concerned, I have the following two approaches in mind to
>> convert multi-class classification problem to binary :
>>
>> a)  Given a multi-class problem with 'k' classes. We label each class
>> as Ci for 'i' in 1 to 'k'. Now, consider a particular class with label
>> Ci. All examples in Ci will be considered positive and all others
>> negative. We do this for each class and end up with 'k' hypotheses
>> which need to be combined.
>>
>> b) Another approach is to pick two classes from the 'k' classes, take
>> one as positive example and the other as negative. Here, we end up
>> with kC2 hypotheses which need to be combined.
>>
>> Both these methods are feasible. There are other complex methods but I
>> prefer one of the above two.
>>
>> Also, I would like to know and discuss how will I apply these tests
>> (once implemented) on the required algorithms. Will I have results
>> available from all the algorithms already run on several datasets or
>> will I have to run and generate result data and then apply the
>> metrics?
>> If later is the case, will I have a code base and datasets to start
>> running the algorithms?
>>
>> Thanks and regards,
>> Anand
>>
>> On Wed, Mar 12, 2014 at 1:18 AM, Marcus Edel <marcus.edel at fu-berlin.de>
>> wrote:
>>
>> Hello,
>>
>> I'm gone through the paper, and I think in our case all metrics except the
>> ROC could be implemented to measure the performance. But keep in mind some
>> of the metrics can only handle binary classification problems. As mentioned
>> in the paper one possible solution is to transform the data into binary
>> classification problems, another solution is to use multi-classification
>> metrics.
>>
>> Regards,
>>
>> Marcus
>>
>>
>> On 10 Mar 2014, at 19:34, Anand Soni <anand.92.soni at gmail.com> wrote:
>>
>> Marcus,
>>
>> I was talking about the following paper from Cornell University:
>>
>> http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icml06.pdf
>>
>> I want my implementations to be based on this paper and possibly some
>> other ideas. The paper points to some standard metrics too. I would
>> like to use some (or all) of them depending on the feasibility. Can
>> you have a look at the metrics and tell me if some of them are
>> irrelevant for us?
>>
>> Also, I will look at the classifiers you have pointed me to. Thanks a lot!
>>
>> Regards.
>>
>> Anand Soni
>>
>> On Mon, Mar 10, 2014 at 11:24 PM, Marcus Edel <marcus.edel at fu-berlin.de>
>> wrote:
>>
>> Hello,
>>
>> I was studying on bench-marking and performance analysis of machine
>> learning algorithms and came across an interesting idea in a research
>> paper.
>>
>>
>> Can you point us to the paper?
>>
>> So, one of the things that I propose for this project is that we
>> implement, say, k metrics and perform a bootstrap analysis for the
>> given algorithms over these k metrics. By this, we will have a good
>> idea about how probable is it for an algorithm to perform "well" given
>> various metrics.
>>
>>
>> Yes, that seems reasonable.
>>
>> I have not yet decided on the metrics to use, but I am working on
>> that.
>>
>>
>> I think we should offer some standard metrics and the class should also be
>> templatized in such a way that the user can easily implement own metrics or
>> choose different metrics.
>>
>> I would like to have comments and feedback on the idea. Also, it
>> would be great if you can tell me the algorithms/tools that we will be
>> comparing for performance in the project. I can give more rigorous
>> details in the proposal.
>>
>>
>>
>> Currently there are a few classifiers in the mlpack/benchmark system (linear
>> regression, logistic regression, least angle regression, naive bayes
>> classifier, etc.).
>>
>> The following link list the currently available methods in mlpack:
>>
>> http://mlpack.org/doxygen.php
>>
>> So maybe it's a good idea to include some additional classifiers from
>> shogun, weka, scikit, etc.
>>
>> http://scikit-learn.org/stable/supervised_learning.html#supervised-learning
>> http://www.shogun-toolbox.org/page/features/
>> http://weka.sourceforge.net/doc.dev/weka/classifiers/Classifier.html
>>
>> I hope that helps.
>>
>> Thanks,
>>
>> Marcus
>>
>> On 10 Mar 2014, at 17:56, Anand Soni <anand.92.soni at gmail.com> wrote:
>>
>> Hi Marcus and Ryan,
>>
>> I was studying on bench-marking and performance analysis of machine
>> learning algorithms and came across an interesting idea in a research
>> paper.
>>
>> Suppose we need to compare 'n' algorithms for performance. (I need
>> more information about the algorithms that will be involved in this
>> project). Also, suppose I have 'k' performance metrics. Obviously we
>> must not infer anything by looking at an algorithm's performance based
>> on just one metric.
>>
>> For example, in one of my projects where I did sentiment analysis
>> using ANNs (artificial neural networks), I got a good accuracy while
>> the precision/recall measures were not in good figures. This means
>> there is no "best algorithm". It all depends on the metrics used.
>>
>> So, one of the things that I propose for this project is that we
>> implement, say, k metrics and perform a bootstrap analysis for the
>> given algorithms over these k metrics. By this, we will have a good
>> idea about how probable is it for an algorithm to perform "well" given
>> various metrics.
>>
>> I have not yet decided on the metrics to use, but I am working on
>> that. I would like to have comments and feedback on the idea. Also, it
>> would be great if you can tell me the algorithms/tools that we will be
>> comparing for performance in the project. I can give more rigorous
>> details in the proposal.
>>
>> Regards.
>>
>> Anand Soni
>>
>> On Thu, Mar 6, 2014 at 10:08 PM, Ryan Curtin <gth671b at mail.gatech.edu>
>> wrote:
>>
>> On Wed, Mar 05, 2014 at 08:39:10PM +0530, Anand Soni wrote:
>>
>> Thanks a lot Ryan!
>>
>> I too, would want to have a single and nice application submitted
>> rather than many. It was just out of interest that I was reading up on
>> dual trees and yes, most of the literature that I found was from
>> gatech. I also came across your paper on dual trees
>> (http://arxiv.org/pdf/1304.4327.pdf ). Can you give me some more
>> pointers where I can get a better understanding of dual trees?
>>
>>
>> There are lots of papers on dual-tree algorithms but the paper you
>> linked to is (to my knowledge) the only one that tries to describe
>> dual-tree algorithms in an abstract manner.  Here are some links to
>> other papers, but keep in mind that they focus on particular algorithms
>> and often don't devote very much space to describing exactly what a
>> dual-tree algorithm is:
>>
>> A.G. Gray and A.W. Moore. "N-body problems in statistical learning."
>> Advances in Neural Information Processing Systems (2001): 521-527.
>>
>> A.W. Moore.  "Nonparametric density estimation: toward computational
>> tractability."  Proceedings of the Third SIAM International Conference
>> on Data Mining (2003).
>>
>> A. Beygelzimer, S. Kakade, and J.L. Langford.  "Cover trees for nearest
>> neighbor."  Proceedings of the 23rd International Conference on Machine
>> Learning (2006).
>>
>> P. Ram, D. Lee, W.B. March, A.G. Gray.  "Linear-time algorithms for
>> pairwise statistical problems."  Advances in Neural Information
>> Processing Systems (2009).
>>
>> W.B. March, P. Ram, A.G. Gray.  "Fast Euclidean minimum spanning tree:
>> algorithm, analysis, and applications."  Proceedings of the 16th ACM
>> SIGKDD International Conference on Knowledge Discovery and Data Mining
>> (2010).
>>
>> R.R. Curtin, P. Ram.  "Dual-tree fast exact max-kernel search." (this
>> one hasn't been published yet...
>> http://www.ratml.org/pub/pdf/2013fastmks.pdf ).
>>
>> I know that's a lot of references and probably way more than you want to
>> read, so don't feel obligated to read anything, but it will probably
>> help explain exactly what a dual-tree algorithm is... I hope!  I can
>> link to more papers too, if you want...
>>
>> But, of course, I am more willing to work on automatic benchmarking,
>> on which I had a little talk with Marcus and I am brewing ideas.
>>
>>
>> Ok, sounds good.
>>
>> Thanks,
>>
>> Ryan
>>
>> --
>> Ryan Curtin    | "Somebody dropped a bag on the sidewalk."
>> ryan at ratml.org |   - Kit
>>
>>
>>
>>
>> --
>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>> Engineering | IIT Bombay | India
>>
>>
>>
>>
>>
>> --
>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>> Engineering | IIT Bombay | India
>>
>>
>>
>>
>>
>> --
>> Anand Soni | Junior Undergraduate | Department of Computer Science &
>> Engineering | IIT Bombay | India
>>
>>
>
>
>
> --
> Anand Soni | Junior Undergraduate | Department of Computer Science &
> Engineering | IIT Bombay | India

-- 
Anand Soni | Junior Undergraduate | Department of Computer Science &
Engineering | IIT Bombay | India