[mlpack] allknn, counting the number of pruned nodes

Wed Aug 14 16:47:04 EDT 2013

On Tue, Aug 13, 2013 at 03:34:27PM -0700, Mike Izbicki wrote:
> Thanks, Ryan!  This is super helpful.
> 
> Let me explain a little more about what I'm doing.  I'm implementing a
> version of ball trees that I've been toying around with.  I think I
> mentioned them when we talked at ICML, but I've fleshed the idea out a lot
> more since then.  The basic idea is that they are isomorphic to a sorted
> vector.  This property makes it really easy to construct the tree online/in
> parallel, the tree is always perfectly balanced, and that it's really easy
> to take advantage of some cache tricks.  At least the nearest neighbor dual
> tree algorithm is pretty easy to parallelize with this scheme, but I'm not
> sure about the other ones.  The only downside is that I'm not sure how much
> pruning can actually be done on these trees when searching in practice, so
> that's what I want to test empirically.

When you say isomorphic to a sorted vector, what this means to me is
that your tree lays out the points in such a way that each node's
descendant points are contiguous in memory.  But I'm pretty sure I'm not
understanding that right.

I have not yet approached the problem of parallelization for dual-tree
algorithms in the way I want to.  There is one publication I do know of
that considered this in the context of kernel summations by a former
labmate:

  D. Lee, R. Vuduc, A.G. Gray. "A Distributed Kernel Summation Framework
  for General-Dimension Machine Learning", SDM 2012.
  https://sites.google.com/site/dongryel/SDM2012_distributed.pdf

In my opinion load balancing is by far the hardest part of the whole
parallelization process.

> I've got a toy implementation that just counts how many data points can be
> ignored in different scenarios.  I want to compare it to some other tree
> types just to see how many data points each tree has to go through.  I have
> no idea what even a good ballpark figure is for this.  If it turns out to
> be similar, then I'll start doing optimizations and see how close the
> actual run times are.
> 
>> The traverser actually collects the number of prunes nodes, but in my
>> experience this really isn't a great measure of how good a tree object is.
>> A better measure might be the number of nodes actually visited (there are
>> what, around O(n^2 log^2 n) maximum?).
> 
> Yeah, by pruned nodes, I had meant including all the nodes in the subtrees
> too, so I think these measures are really the same thing.  What I'm not
> clear on, though, is why is the number of nodes actually visited not equal
> to the number of times base case is called?

Ok, so, that's not actually what my code was counting.  My code counts
the number of times that a prune is made; so when an entire subtree is
pruned, it only increments the counter by one.  This will actually be
more difficult and I'm not sure it makes sense (see next bit of the
reply).

>> Other statistics might show better comparisons than the number of pruned
>> nodes, such as the number of base case evaluations and the number of
>> node-to-node distance calculations.  I can happily furnish code to output
>> this on demand, if this is what you want.  It will take me five minutes and
>> is no problem at all.
> 
> This would be totally awesome!  That's exactly what I'm looking for.

The easiest thing for me to count is the number of node combinations
visited by the traversal.  I can't count the number of node combinations
that were pruned, because there's no way to count how many node
combinations the traversal *should* have visited.  Consider this (for a
kd-tree):

At the node combination (q, r), we have a few ways we can recurse,
depending on how the traversal is written and how the tree is laid out:

 * (q, r.left) and (q, r.right)
 * (q.left, r) and (q.right, r)
 * (q.left, r.left), (q.left, r.right), (q.right, r.left),
       (q.right, r.right)

So we can end up recursing to two leaf nodes (q_l, r_l) from the root
nodes (q_root, r_root) in a multitude of ways; if we only use the first
two recursion possibilities, we end up with a traversal that visited
many more node combinations than if we only used the last recursion
possibility.

This means that there's no reasonable way to calculate the number of
node combinations that are pruned, or the number of node combinations
that would be visited if nothing was pruned.  One option might be to run
the recursion but never prune and count the number of node combinations
that are visited, but this gives you a traversal-dependent and
implementation-dependent number.  That would be rather time-consuming,
though.

Nonetheless, I've attached a patch (against trunk) that counts the
number of nodes that are visited, the number of base case computations,
and the number of times a node combination is scored.  The last number
is of interest because it should be exactly the number of times the
minimum distance between two nodes is calculated; the number of base
case computations should be the number of point-to-point distance
calculations.

When you patch, you'll just have to run the allknn executable with the
'-v' option and it should output the numbers you need.

These are useful numbers, and I need to think about it a bit, but I
suspect I will end up adding them to the trunk implementation once I
ensure that the additional runtime overhead is negligible, which it
realistically should be.

> Also, are you familiar with recursion schemes at all?  Your ICML paper
> inspired me to ask this stack exchange question about recursion schemes and
> pruning (
> http://cstheory.stackexchange.com/questions/18374/recursion-schemes-that-prune).
> There's a really good answer that shows that it's equivalent to
> "hylomorphisms."  That might be an interesting perspective for you to think
> about the dual tree algorithms from.

I am not familiar with recursion schemes -- but expect that I will be
soon.  Thanks for the link.  :)

-- 
Ryan Curtin       | "And they say there is no fate, but there is: it's
ryan at igglybob.com | what you create." - Minister
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tree-stats.patch
Type: text/x-diff
Size: 5929 bytes
Desc: not available
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20130814/49696672/attachment-0001.patch>