[mlpack] allknn, counting the number of pruned nodes

Tue Aug 13 17:31:11 EDT 2013

On Tue, Aug 13, 2013 at 01:40:20PM -0700, Mike Izbicki wrote:
> I've been playing around with the allknn utility, which is quite nice.
> What I would like is a way to output the total number of nodes in the tree
> that were pruned during the find nearest neighbors traversal.
> 
> Is this number easy to get somewhere?  I'm willing to modify the C++ code,
> but I don't see this number in the documentation.
> 
> My purpose in doing this is twofold.  First, this seems like a good measure
> of the quality of the kd-tree.  Second, I'm implementing a variant of
> kd-trees, and would like to compare my implementation to yours.

Hey Mike,

It was good to meet you at ICML a month or so ago.  :)

Let me first do a little explanation of how the trees in mlpack are
coded.  This is going to be a long email, so, I apologize in advance.
I'm going to refer to dual-tree algorithms (where you are doing nearest
neighbor search on a batch of query points, and you build a tree on the
queries too), but what is written should extend with ease to the
single-tree case (if not, I'm happy to retarget the email).

A tree is independent from its traversal, and that is also independent
from the algorithm.  So, for allknn, the actual "nearest neighbor
search" part is abstracted away from *both* the tree and the traversal.
The algorithm itself is described by a Score() and BaseCase() function;
the Score() function evaluates whether or not a particular combination
of query node and reference node can be pruned, and the BaseCase()
function runs an actual distance comparison between two points.

Implementationally, the Score() and BaseCase() functions are implemented
for allknn in
src/mlpack/methods/neighbor_search/neighbor_search_rules.hpp and related
files.

The kd-tree dual-tree traversal is a dual depth-first traversal; you can
find its implementation in
src/mlpack/core/tree/binary_space_tree/dual_tree_traversal.hpp and
related files.

The traverser actually collects the number of prunes nodes, but in my
experience this really isn't a great measure of how good a tree object
is.  A better measure might be the number of nodes actually visited
(there are what, around O(n^2 log^2 n) maximum?).  This is because if
you prune a small number of nodes, you may have either pruned huge
subtrees or tiny subtrees -- so a small number of prunes can mean both
very fast runtime or very slow runtime.  The number of prunes is printed
now, in neighbor_search_impl.hpp, at line 208, but it prints to the
debug log and doesn't show up without debugging symbols.  You could
change it to Log::Info and it'll show up when you run allknn with the -v
option.

The other statistic I usually collect, which is the number of times
BaseCase() is called, also does not give a complete picture.  I usually
collect this number either by modifying the NeighborSearchRules class to
count the number of times BaseCase() is called and then have the
NeighborSearch class display that when the traversal completes, or I
compile with profiling information (and -O0 -DDEBUG so nothing gets
inlined) and then use gprof to look at how many times BaseCase() was
called.

The number of base case computations is definitely a very useful
statistic, but I think for a full picture it should be combined with the
number of distance evaluations between tree nodes.  When scoring a node
combination to see if it can be pruned, this entails calculation of the
minimum distance between the two nodes; this is (for kd-trees) an O(d)
calculation.  So the traverser could be modified to count the number of
distance evaluations and output that too.  With both that and the number
of base cases, I think that gives a clear picture of how the tree
actually performs.

I think comparing construction time is a little more difficult, though,
but I don't know if that's what you intend to do.

One more thing: when I talk about the Score() function for allknn, it
takes the minimum distance between two nodes and compares it with some
cached bound that is a somewhat complex function (I can provide a
reference for that if you like).  This bound function takes some time to
calculate, and while it does help keep the number of base case
evaluations down, it seems to slow the algorithm down in real time.  I
don't have hard numbers; I haven't yet had a chance to investigate it
(but I will before mlpack-1.0.7).  The bounding function in mlpack-1.0.3
is simpler, and can incur more base case evaluations, but is evaluated
quickly and seems to perform better in terms of real-time running time.
It may be better to use that version for now.

Last one more thing: if you implement your proposed tree type in mlpack,
and write a dual-tree traversal for it, then suddenly not only does it
work for allknn, but allkfn, Euclidean minimum spanning tree
calculation, range search, rank-approximate nearest neighbor search, and
any other tree-based algorithms someone might eventually implement in
mlpack (I have several on my list).

----

So, tl;dr:

 * The number of pruned nodes is currently collected but only displayed
   if compiled with debugging symbols.  Change the Log::Debug to
   Log::Info in
   src/mlpack/methods/neighbor_search/neighbor_search_impl.hpp:208 and
   then run with '-v' and it will be output.

 * Other statistics might show better comparisons than the number of
   pruned nodes, such as the number of base case evaluations and the
   number of node-to-node distance calculations.  I can happily furnish
   code to output this on demand, if this is what you want.  It will
   take me five minutes and is no problem at all.

 * The bounding rules in mlpack-1.0.3 I think will perform better in
   terms of real runtime than the ones in trunk right now, but if you
   are simply comparing tree types I don't think it will make much
   difference.

 * If you write a tree and traverser in conformance with the mlpack
   specs, which are somewhat undocumented a bit because I am still
   trying to formalize tree-independent dual-tree algorithms, then your
   tree type will work with all mlpack dual-tree algorithms without any
   further work from you.

 * I spend virtually all of my time thinking about and reading about
   tree-based algorithms.  If you are looking for references into
   existing types of trees, interesting applications of trees,
   historical references to early tree-like data structures, and
   kind-of-related-but-not-quite-tree-based approaches to nearest
   neighbor search, I have about a thousand, and will happily talk about
   dual-tree algorithms until I keel over dead.

Hopefully this is helpful and makes any shred of sense whatsoever.  If
not let me know.  :)

Thanks,

Ryan

-- 
Ryan Curtin       | "And they say there is no fate, but there is: it's
ryan at igglybob.com | what you create." - Minister