[mlpack] Google Summer of Code project proposal

Thu Mar 24 16:13:12 EDT 2016

Hi Ryan!
Thanks for getting back to me, to answer your questions, yes, yes, yes, and yes! 

 > One of my interest has always been to have Jenkins build mlpack against
all versions of its dependencies and run the tests, to try and find
subtle bugs.

This is actually one of the main purposes of the ctest/cdash system! Any platform which can run cmake can also run ctest. So if you have a bunch of different platforms setup, use can use a scheduler to build and run the tests on mlpack every night with a variety of configurations, and their reports will be automatically conglomerated on the cdash server, which will keep the test results in its history. This makes it very easy to see when a bug was introduced, which platforms it causes problems on, and what the test failures that it caused were.

> how can we utilize the hardware that we already have?
Any platform that can run cmake  can also run ctest. I would suggest setting up your machines with a variety of different platforms which could each build and run the tests on your software every night (with different dependencies versions also if you want to check that) so that obscure bugs can be visible very quickly, and you can see the results very easily.
This means that every night you would build your software on different platforms and see which tests passed and failed on each platform as well as any build problems.

>what changes will need to be made to the mlpack codebase to support
   your project?

One of the major changes will be converting the existing testing process into a more granular process using ctest. Right now, after a pull request, all the tests are run on travis via the command mlpack_test, which executes your binary containing all of the tests. With ctest your 617 test cases will each be a different test, and they will all be executed via the command “ctest”. after the test execution, the results will be submitted to your online dashboard, where you can easily view test timings outputs and reasons for failure for each test.  This history will be saved so that in a month from now, you could go back and see the results. That way it will be very easy for a developer to see which test failed, why, and if that test has a history of failure on a particular platform.

 > how can we present the information gathered by the automatic build
   system in a concise and manageable way?

Once the tests have been migrated to using ctest, the dashboard submission process is automatic. CDash automatically parses the submissions into a standard presentation format which states the build site, environment, tests passing or falure, test timings, test history, and test output. In addition, tests results are put into an xml file, which could be parsed by other methods to get the data that you want

> Benchmarking?
Ctest records the timing information for all the individual tests which are reported to cdash and easily avalible.

I'll make sure that these points are addressed in my proposal. Do these answers address your questions satisfactorily? Do have any other concerns? Let me know.

Thanks!
Alex

________________________________________
From: Ryan Curtin <ryan at ratml.org>
Sent: Wednesday, March 23, 2016 2:01 PM
To: Leinoff, Alexander
Cc: mlpack at cc.gatech.edu
Subject: Re: Google Summer of Code project proposal

On Tue, Mar 22, 2016 at 10:49:05PM +0000, Leinoff, Alexander wrote:
> Hi!
> If you have time, I’d like to get your feedback on a project idea and
> proposal for the google summer of code. My name is Alexander Leinoff (
> I made a couple of commits yesterday ), and I’m in my final year as an
> undergraduate studying Computer Engineering at The University of Iowa.
> I’d like to submit a project proposal that’s a little off of the
> books, although I think it will end up covering the “More Diverse
> Build Slaves” and possibly the “Profiling for further optimization”
> Summer of Code Ideas. My main goal will be to set up a robust and
> constant cross-platform testing framework, with an online dashboard
> presentation using cmake, ctest, and cdash, such as the one used by
> ITK: https://open.cdash.org/index.php?project=Insight . I’d also like
> to implement a SuperBuild environment with cmake to automatically
> build project dependencies. I think your project could really use a
> more understandable and easy to use (and visualize) testing process,
> and I’d like to help you implement it! Let me know if this is
> something I should be pursuing for the Summer of Code project, or if I
> should be focusing more on the listed ideas as they are stated. I’ve
> shared a draft of my proposal via the Google Summer of Code submission
> page, it still needs some work, but please check it out and let me
> know what you think. Any feedback is greatly appreciated!

Hi Alex,

Thanks for your contributions over the past couple of days.

I'm perfectly fine with proposals for projects that are not what's
written on the Ideas page.

The dashboard for ITK looks pretty nice.  Currently we use Jenkins, set
up at http://big.mlpack.org:7780/ (it is on a slow connection,
unfortunately).  It used to be that we had more systems set up, but
since I finished my Ph.D. I no longer have the resources to set all
those systems up:

http://ratml.org/misc_img/build_farm_new.jpg

Now I am at Symantec, and they can support mlpack, but with different
hardware.  Let me describe what I have access to:

 * "masterblaster", 2x Xeon E5-2699v3 (72 cores) with 256GB RAM and
   3-4TB storage
 * "big", "samedi", "cabbie", "dambala", and "shoeshine": HP i5 desktops
   with 8-16GB of RAM; older desktops from ~2010
 * 3 unnamed Sun SPARC Enterprise T5220s, each with 64 cores and I think
   128GB of RAM?  They've never been powered on, so I need to set them
   up.  I found them in a closet; they were going to be thrown away.

I should be able to get all of these set up in a way that they are
externally accessible by the beginning of the summer.  (Right now, some
of these can only be accessed internally.)

The five desktops are part of the old build farm.  Two of them, cabbie
and shoeshine, are used for benchmarking using the benchmarking system
that Marcus built in GSoC 2013 and Anand improved in GSoC 2014:

https://github.com/zoq/benchmarks
http://www.mlpack.org/benchmarks.html

One of my interest has always been to have Jenkins build mlpack against
all versions of its dependencies and run the tests, to try and find
subtle bugs.  I don't know how well Jenkins will play with CTest and
CDash, which are products I've never used.  The ITK dashboard you linked
to looks nice; Jenkins can give similar output.  (I wouldn't be
surprised if both can be used in tandem.)

So we should definitely work out the details, but we can also change
some things around after the proposal deadline if necessary.  I'll take
a look over your proposal when I have a chance (next day or two?), but I
will be looking for how we can work out the following things:

 * how can we utilize the hardware that we already have?

 * can we automate the benchmarking process better, and integrate the
   benchmarking system Marcus built well?

 * what changes will need to be made to the mlpack codebase to support
   your project?

 * how can we present the information gathered by the automatic build
   system in a concise and manageable way?

Anyway, it may not be possible to answer all these questions, but we
should at least try.  I think you are absolutely correct that mlpack
could use a better CI infrastructure, so I am excited to see what you
can put together for a proposal.

Thanks for getting in touch!  Let me know if I can clarify anything.

Ryan

--
Ryan Curtin    | "The enemy cannot press a button... if you have
ryan at ratml.org | disabled his hand." - Sgt. Zim