[mlpack] GSoC Project: LMNN (via LRSDP) & BoostMetric Implementation

Fri May 11 09:57:28 EDT 2018

On Thu, May 10, 2018 at 08:34:34PM +0530, Manish Kumar wrote:
> Hi Ryan,
> 
> This is regarding the constraints generation algorithm for LMNN. So, here's
> what I think could be an option for generating {data point, target
> neighbors, impostors} triplets:
> 1) We can use KNN  to search for N - 1 neighbors for all the N data points.
> (The runtime of KNN search will determine the major part of runtime of
> whole algo.)
> 2) Then we could select k targetNeighbors and impostors corresponding to
> each data point.
> 3) finally generate triplets using targetNeighbors and impostors.
> 
> Here is a sample code https://pastebin.com/Vb5VJawv for above 3 steps.
> 
> I have compared the runtime of above code with Boostmetric's KNN-triplets
> generation algo and this one has shown a significant speedup. (0.15s/5.82s
> on a dataset of 125 points and k = 5)
> 
> Let me know your view on this.

Hey Manish,

If we do KNN for (N - 1) neighbors, there is no way to escape the O(N^2)
runtime of the algorithm, even with trees (actually trees will be slower
than naive search here, for reasonably sized N).  So even if the speedup
is significant for 125 points, I doubt that it will be for 125k points.

I have a different idea though.  The goal of the KNN search is to find
k neighbors with the same class, and k neighbors with some other class.
Therefore, we could (in the simplest instantiation) do multiple
searches.  For simplicity let's just assume we have two classes.  In
this case, split the dataset into two datasets: class 0 and class 1.
Then use KNN four times with k = k:

 * With class 0 as the reference set and class 0 as the query set.
 * With class 0 as the reference set and class 1 as the query set.
 * With class 1 as the reference set and class 0 as the query set.
 * With class 1 as the reference set and class 1 as the query set.

So, basically, we run KNN four times with k = k instead of once with
k = (N - 1), and for reasonably sized to large datasets, this is likely
to save us a huge amount of runtime.

We can generalize to multiple classes, basically by running KNN twice
for the points in each class; once with the same class points as the
reference set, and once with all points not in the same class as the
reference set.  With some clever memory management, it is possible to do
this without making copies of the dataset---instead, you sort the
dataset so that each class's points are contiguous in memory, then
create Armadillo alias matrices and use NeighborSearch with those.

Alternately, the fastest option (with trees at least) is to write a new
NeighborSearchRules that's adapted to search for k points of the same
class and k points of a separate class.  That could be time-consuming,
and I don't suspect it would be *too* much faster than what I suggested
above.

I think what I am proposing is a minor improvement on what I think is
being done in BoostMetric based on what they wrote in the paper.  Did
you compare against some existing BoostMetric implementation, or
implement their strategy yourself?

I hope this helps.  Let me know if I can clarify anything.

Thanks,

Ryan

-- 
Ryan Curtin    | "I like fish sticks."
ryan at ratml.org |  - Benny