[mlpack] allknn, counting the number of pruned nodes

Thu Aug 15 14:50:37 EDT 2013

How can I download trunk?  I only see links to the released versions off
mlpack.org

On Wed, Aug 14, 2013 at 1:47 PM, Ryan Curtin <gth671b at mail.gatech.edu>wrote:

> On Tue, Aug 13, 2013 at 03:34:27PM -0700, Mike Izbicki wrote:
> > Thanks, Ryan!  This is super helpful.
> >
> > Let me explain a little more about what I'm doing.  I'm implementing a
> > version of ball trees that I've been toying around with.  I think I
> > mentioned them when we talked at ICML, but I've fleshed the idea out a
> lot
> > more since then.  The basic idea is that they are isomorphic to a sorted
> > vector.  This property makes it really easy to construct the tree
> online/in
> > parallel, the tree is always perfectly balanced, and that it's really
> easy
> > to take advantage of some cache tricks.  At least the nearest neighbor
> dual
> > tree algorithm is pretty easy to parallelize with this scheme, but I'm
> not
> > sure about the other ones.  The only downside is that I'm not sure how
> much
> > pruning can actually be done on these trees when searching in practice,
> so
> > that's what I want to test empirically.
>
> When you say isomorphic to a sorted vector, what this means to me is
> that your tree lays out the points in such a way that each node's
> descendant points are contiguous in memory.  But I'm pretty sure I'm not
> understanding that right.
>
> I have not yet approached the problem of parallelization for dual-tree
> algorithms in the way I want to.  There is one publication I do know of
> that considered this in the context of kernel summations by a former
> labmate:
>
>   D. Lee, R. Vuduc, A.G. Gray. "A Distributed Kernel Summation Framework
>   for General-Dimension Machine Learning", SDM 2012.
>   https://sites.google.com/site/dongryel/SDM2012_distributed.pdf
>
> In my opinion load balancing is by far the hardest part of the whole
> parallelization process.
>
> > I've got a toy implementation that just counts how many data points can
> be
> > ignored in different scenarios.  I want to compare it to some other tree
> > types just to see how many data points each tree has to go through.  I
> have
> > no idea what even a good ballpark figure is for this.  If it turns out to
> > be similar, then I'll start doing optimizations and see how close the
> > actual run times are.
> >
> >> The traverser actually collects the number of prunes nodes, but in my
> >> experience this really isn't a great measure of how good a tree object
> is.
> >> A better measure might be the number of nodes actually visited (there
> are
> >> what, around O(n^2 log^2 n) maximum?).
> >
> > Yeah, by pruned nodes, I had meant including all the nodes in the
> subtrees
> > too, so I think these measures are really the same thing.  What I'm not
> > clear on, though, is why is the number of nodes actually visited not
> equal
> > to the number of times base case is called?
>
> Ok, so, that's not actually what my code was counting.  My code counts
> the number of times that a prune is made; so when an entire subtree is
> pruned, it only increments the counter by one.  This will actually be
> more difficult and I'm not sure it makes sense (see next bit of the
> reply).
>
> >> Other statistics might show better comparisons than the number of pruned
> >> nodes, such as the number of base case evaluations and the number of
> >> node-to-node distance calculations.  I can happily furnish code to
> output
> >> this on demand, if this is what you want.  It will take me five minutes
> and
> >> is no problem at all.
> >
> > This would be totally awesome!  That's exactly what I'm looking for.
>
> The easiest thing for me to count is the number of node combinations
> visited by the traversal.  I can't count the number of node combinations
> that were pruned, because there's no way to count how many node
> combinations the traversal *should* have visited.  Consider this (for a
> kd-tree):
>
> At the node combination (q, r), we have a few ways we can recurse,
> depending on how the traversal is written and how the tree is laid out:
>
>  * (q, r.left) and (q, r.right)
>  * (q.left, r) and (q.right, r)
>  * (q.left, r.left), (q.left, r.right), (q.right, r.left),
>        (q.right, r.right)
>
> So we can end up recursing to two leaf nodes (q_l, r_l) from the root
> nodes (q_root, r_root) in a multitude of ways; if we only use the first
> two recursion possibilities, we end up with a traversal that visited
> many more node combinations than if we only used the last recursion
> possibility.
>
> This means that there's no reasonable way to calculate the number of
> node combinations that are pruned, or the number of node combinations
> that would be visited if nothing was pruned.  One option might be to run
> the recursion but never prune and count the number of node combinations
> that are visited, but this gives you a traversal-dependent and
> implementation-dependent number.  That would be rather time-consuming,
> though.
>
> Nonetheless, I've attached a patch (against trunk) that counts the
> number of nodes that are visited, the number of base case computations,
> and the number of times a node combination is scored.  The last number
> is of interest because it should be exactly the number of times the
> minimum distance between two nodes is calculated; the number of base
> case computations should be the number of point-to-point distance
> calculations.
>
> When you patch, you'll just have to run the allknn executable with the
> '-v' option and it should output the numbers you need.
>
> These are useful numbers, and I need to think about it a bit, but I
> suspect I will end up adding them to the trunk implementation once I
> ensure that the additional runtime overhead is negligible, which it
> realistically should be.
>
> > Also, are you familiar with recursion schemes at all?  Your ICML paper
> > inspired me to ask this stack exchange question about recursion schemes
> and
> > pruning (
> >
> http://cstheory.stackexchange.com/questions/18374/recursion-schemes-that-prune
> ).
> > There's a really good answer that shows that it's equivalent to
> > "hylomorphisms."  That might be an interesting perspective for you to
> think
> > about the dual tree algorithms from.
>
> I am not familiar with recursion schemes -- but expect that I will be
> soon.  Thanks for the link.  :)
>
> --
> Ryan Curtin       | "And they say there is no fate, but there is: it's
> ryan at igglybob.com | what you create." - Minister
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20130815/074b202c/attachment-0003.html>