[mlpack] allknn, counting the number of pruned nodes

Tue Aug 13 18:34:27 EDT 2013

Thanks, Ryan!  This is super helpful.

Let me explain a little more about what I'm doing.  I'm implementing a
version of ball trees that I've been toying around with.  I think I
mentioned them when we talked at ICML, but I've fleshed the idea out a lot
more since then.  The basic idea is that they are isomorphic to a sorted
vector.  This property makes it really easy to construct the tree online/in
parallel, the tree is always perfectly balanced, and that it's really easy
to take advantage of some cache tricks.  At least the nearest neighbor dual
tree algorithm is pretty easy to parallelize with this scheme, but I'm not
sure about the other ones.  The only downside is that I'm not sure how much
pruning can actually be done on these trees when searching in practice, so
that's what I want to test empirically.

I've got a toy implementation that just counts how many data points can be
ignored in different scenarios.  I want to compare it to some other tree
types just to see how many data points each tree has to go through.  I have
no idea what even a good ballpark figure is for this.  If it turns out to
be similar, then I'll start doing optimizations and see how close the
actual run times are.

>The traverser actually collects the number of prunes nodes, but in my
experience this really isn't a great measure of how good a tree object is.
 A better measure might be the number of nodes actually visited (there are
what, around O(n^2 log^2 n) maximum?).

Yeah, by pruned nodes, I had meant including all the nodes in the subtrees
too, so I think these measures are really the same thing.  What I'm not
clear on, though, is why is the number of nodes actually visited not equal
to the number of times base case is called?

>Other statistics might show better comparisons than the number of pruned
nodes, such as the number of base case evaluations and the number of
node-to-node distance calculations.  I can happily furnish code to output
this on demand, if this is what you want.  It will take me five minutes and
is no problem at all.

This would be totally awesome!  That's exactly what I'm looking for.

Also, are you familiar with recursion schemes at all?  Your ICML paper
inspired me to ask this stack exchange question about recursion schemes and
pruning (
http://cstheory.stackexchange.com/questions/18374/recursion-schemes-that-prune).
There's a really good answer that shows that it's equivalent to
"hylomorphisms."  That might be an interesting perspective for you to think
about the dual tree algorithms from.

On Tue, Aug 13, 2013 at 2:31 PM, Ryan Curtin <gth671b at mail.gatech.edu>wrote:

> On Tue, Aug 13, 2013 at 01:40:20PM -0700, Mike Izbicki wrote:
> > I've been playing around with the allknn utility, which is quite nice.
> > What I would like is a way to output the total number of nodes in the
> tree
> > that were pruned during the find nearest neighbors traversal.
> >
> > Is this number easy to get somewhere?  I'm willing to modify the C++
> code,
> > but I don't see this number in the documentation.
> >
> > My purpose in doing this is twofold.  First, this seems like a good
> measure
> > of the quality of the kd-tree.  Second, I'm implementing a variant of
> > kd-trees, and would like to compare my implementation to yours.
>
> Hey Mike,
>
> It was good to meet you at ICML a month or so ago.  :)
>
> Let me first do a little explanation of how the trees in mlpack are
> coded.  This is going to be a long email, so, I apologize in advance.
> I'm going to refer to dual-tree algorithms (where you are doing nearest
> neighbor search on a batch of query points, and you build a tree on the
> queries too), but what is written should extend with ease to the
> single-tree case (if not, I'm happy to retarget the email).
>
> A tree is independent from its traversal, and that is also independent
> from the algorithm.  So, for allknn, the actual "nearest neighbor
> search" part is abstracted away from *both* the tree and the traversal.
> The algorithm itself is described by a Score() and BaseCase() function;
> the Score() function evaluates whether or not a particular combination
> of query node and reference node can be pruned, and the BaseCase()
> function runs an actual distance comparison between two points.
>
> Implementationally, the Score() and BaseCase() functions are implemented
> for allknn in
> src/mlpack/methods/neighbor_search/neighbor_search_rules.hpp and related
> files.
>
> The kd-tree dual-tree traversal is a dual depth-first traversal; you can
> find its implementation in
> src/mlpack/core/tree/binary_space_tree/dual_tree_traversal.hpp and
> related files.
>
> The traverser actually collects the number of prunes nodes, but in my
> experience this really isn't a great measure of how good a tree object
> is.  A better measure might be the number of nodes actually visited
> (there are what, around O(n^2 log^2 n) maximum?).  This is because if
> you prune a small number of nodes, you may have either pruned huge
> subtrees or tiny subtrees -- so a small number of prunes can mean both
> very fast runtime or very slow runtime.  The number of prunes is printed
> now, in neighbor_search_impl.hpp, at line 208, but it prints to the
> debug log and doesn't show up without debugging symbols.  You could
> change it to Log::Info and it'll show up when you run allknn with the -v
> option.
>
> The other statistic I usually collect, which is the number of times
> BaseCase() is called, also does not give a complete picture.  I usually
> collect this number either by modifying the NeighborSearchRules class to
> count the number of times BaseCase() is called and then have the
> NeighborSearch class display that when the traversal completes, or I
> compile with profiling information (and -O0 -DDEBUG so nothing gets
> inlined) and then use gprof to look at how many times BaseCase() was
> called.
>
> The number of base case computations is definitely a very useful
> statistic, but I think for a full picture it should be combined with the
> number of distance evaluations between tree nodes.  When scoring a node
> combination to see if it can be pruned, this entails calculation of the
> minimum distance between the two nodes; this is (for kd-trees) an O(d)
> calculation.  So the traverser could be modified to count the number of
> distance evaluations and output that too.  With both that and the number
> of base cases, I think that gives a clear picture of how the tree
> actually performs.
>
> I think comparing construction time is a little more difficult, though,
> but I don't know if that's what you intend to do.
>
> One more thing: when I talk about the Score() function for allknn, it
> takes the minimum distance between two nodes and compares it with some
> cached bound that is a somewhat complex function (I can provide a
> reference for that if you like).  This bound function takes some time to
> calculate, and while it does help keep the number of base case
> evaluations down, it seems to slow the algorithm down in real time.  I
> don't have hard numbers; I haven't yet had a chance to investigate it
> (but I will before mlpack-1.0.7).  The bounding function in mlpack-1.0.3
> is simpler, and can incur more base case evaluations, but is evaluated
> quickly and seems to perform better in terms of real-time running time.
> It may be better to use that version for now.
>
> Last one more thing: if you implement your proposed tree type in mlpack,
> and write a dual-tree traversal for it, then suddenly not only does it
> work for allknn, but allkfn, Euclidean minimum spanning tree
> calculation, range search, rank-approximate nearest neighbor search, and
> any other tree-based algorithms someone might eventually implement in
> mlpack (I have several on my list).
>
> ----
>
> So, tl;dr:
>
>  * The number of pruned nodes is currently collected but only displayed
>    if compiled with debugging symbols.  Change the Log::Debug to
>    Log::Info in
>    src/mlpack/methods/neighbor_search/neighbor_search_impl.hpp:208 and
>    then run with '-v' and it will be output.
>
>  * Other statistics might show better comparisons than the number of
>    pruned nodes, such as the number of base case evaluations and the
>    number of node-to-node distance calculations.  I can happily furnish
>    code to output this on demand, if this is what you want.  It will
>    take me five minutes and is no problem at all.
>
>  * The bounding rules in mlpack-1.0.3 I think will perform better in
>    terms of real runtime than the ones in trunk right now, but if you
>    are simply comparing tree types I don't think it will make much
>    difference.
>
>  * If you write a tree and traverser in conformance with the mlpack
>    specs, which are somewhat undocumented a bit because I am still
>    trying to formalize tree-independent dual-tree algorithms, then your
>    tree type will work with all mlpack dual-tree algorithms without any
>    further work from you.
>
>  * I spend virtually all of my time thinking about and reading about
>    tree-based algorithms.  If you are looking for references into
>    existing types of trees, interesting applications of trees,
>    historical references to early tree-like data structures, and
>    kind-of-related-but-not-quite-tree-based approaches to nearest
>    neighbor search, I have about a thousand, and will happily talk about
>    dual-tree algorithms until I keel over dead.
>
> Hopefully this is helpful and makes any shred of sense whatsoever.  If
> not let me know.  :)
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin       | "And they say there is no fate, but there is: it's
> ryan at igglybob.com | what you create." - Minister
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20130813/a49f9b9e/attachment-0003.html>