[mlpack] GSoC 2016 - Project Questions

Thu Mar 3 11:56:21 EST 2016

On Wed, Mar 02, 2016 at 08:23:43PM +0000, Yannis Mentekidis wrote:
> Hello developers,
> 
> My name is Yannis Mentekidis, I am a final-year student in Electrical and
> Computer Engineering in Aristotle University, Thessaloniki Greece.

Hi Yannis,

I can only answer the approximate nearest neighbor search part of your
email well, but I will try to answer both. :)

> I've gone through your project description and your ideas page, and I would
> love to work with you for the summer! I've started going through your
> source code and compiled the library to familiarize myself with the project.
> 
> Two projects are especially relevant to my interests: Approximate Nearest
> Neighbors and Neuroevolutional Algorithms.
> 
> I've worked for my thesis on efficient algorithms for the Nearest Neighbor
> Search problem, and I am familiar with Approximate NNS – for example, LSH
> and Multiprobe LSH, as well as Compressed Hashing (from CVPR 2013). I would
> be interested in implementing some or all of these, but I would also like
> to try the ideas presented in “Modeling LSH for performance tuning (Dong et
> al, 2008)”. You also mention dual trees, which I am less familiar with.
> 
> *My first question is*, would some such implementation (Multiprobe
> LSH/Performance Tuning for LSH) be in the interest of the project? Or is it
> preferable to design a larger number of approximate solutions and give more
> freedom of choice to the end user?

Yes, definitely other forms of LSH or improvement to the existing LSH
implementation would be a good way to go.  Dong's paper is where LSHKIT
came from, right?  If so it might be interesting to implement that, but
only if we are pretty sure we can outperform LSHKIT (i.e. if we can make
a faster implementation).

One thing I found with other LSH implementations was that most of them
are a little difficult to work with... a user is most likely going to
want to say "I want k approximate nearest neighbors with a desired
maximum of e% relative error", so they specify k and e and then the
algorithm returns results with the desired error bound.  But some LSH
packages do not offer this kind of interface and can be very hard to
work with (E2LSH is one example).

Here are some interesting runtime comparisons to LSHKIT that I did last
year (on page 9 and 10); I compared some tree-based algorithms and then
also multiprobe LSH:

http://ratml.org/pub/pdf/2015faster.pdf

I think that probably LSH would be faster in higher dimensions, but that
wasn't the region of interest for the study I linked to.  Also that
paper might be a decent place to start reading about dual-tree nearest
neighbor search (and the papers that paper references might be
interesting as well).

Anyway, maybe you would be the right person for this task?

https://github.com/mlpack/mlpack/issues/261

I think that if you have spent a lot of time studying and thinking about
LSH, then this one may not be so difficult.

> *Another question* is inspired from an old ticket I found in GitHub, where
> you mention interest in parallelization. I would also be interested in, as
> a side goal, implementing a parallel version of a few of the approximate
> algorithms. Is that something interesting to you guys?

Yes, parallelization is interesting, but should probably be done via
OpenMP.  It can be really difficult to read parallel code, so if we make
the code too complex, then suddenly developers need to have good
knowledge of C++, parallel programming, and also machine learning.  It's
already hard enough to find people at the intersection of C++ and
machine learning. :)

In many cases OpenMP can do pretty well, too.  I think it would probably
be pretty easy to OpenMP-ize the current LSH implementation.

> The other subject, Neuroevolutional Algorithms, I have less experience on,
> but I would like to work on evolutionary computation and I have in the past
> implemented feedforward Neural Networks, so this is also something I could
> see myself working on.

Here's a response by Marcus to another prospective applicant about the
deep learning modules project, which has some similarity; the reading
list here might be useful:

https://mailman.cc.gatech.edu/pipermail/mlpack/2016-February/000738.html

> *My final question* is regarding procedure: Since I am interested in both
> these projects, should I submit two different applications to Google, or
> just one where I mention both projects?

http://www.google-melange.com/gsoc/document/show/gsoc_program/google/gsoc2015/help_page#6._Can_a_student_submit_more_than_one

You're more than welcome to submit two proposals, but often it's better
to focus on just one, since we can only accept one proposal per student.
:)

However, the choice is still yours, and if you think you can put
together two strong proposals, you are more than welcome to.

> I will be lurking your irc channel and GitHub in the following weeks,
> hopefully we will have the chance to work together!

Sounds good; I look forward to hearing more from you in the coming
weeks.

Thanks!

Ryan

-- 
Ryan Curtin    | "Hold still."
ryan at ratml.org |   - Mr. Blonde