[mlpack] mlpack video meeting: April 4

Thu Apr 4 20:48:28 EDT 2019

Hello everyone,

great to see and talk to some of you today. Here are the notes from today's
meeting.

Anyone is welcome to respond to individual points that are listed below, to
either continue (or start) a discussion.

Notes:

As time has gone by, mlpack has become a lot bigger and there are a lot more
people contributing nowadays, so it's getting very hard to see everything that
is changing; so what Ryan did was to go through the changes that happened
between January 2019 until now, to get everyone on the same state:

https://www.ratml.org/misc/mlpack-meeting-slides.pdf

Additions to the slides:

Overview
------------------------------------------------

"mlpack is nearly 12 years old now (the first code was in 2007)."

That's 10 years of work on mlpack for Ryan.

"2750 stars on Github, 97k downloads (according to my server logs so that’s an
undercount), 142 contributors and counting..."

Undercount, because we can't count whats downloaded via GitHub or PyPi, Conda or
anything else. Would be nice to have some more data in that direction, so if
anybody likes to look into that part, please feel free.

Github says 142 contributors, but there are probably more because there are
contributions from non-Github users that aren't counted.

New website
------------------------------------------------

The big motivation behind the new website was to automate the deployment
process. It used to be when a new version was released Ryan had to manually edit
a bunch of HTML files in order to update the version (a very tedious process).
What we have now is a Jekyll based website, which is nightly autogenerated to
update the Doxygen documentation. Adding a new version is done by putting the
new release in place and the website is automatically rebuilt through a Jenkins
(build server) job. It's not perfect so if you see any problem, please don't
hesitate to open a bug report so that we can get it fixed.

Kernel Density Estimation
------------------------------------------------

Huge thanks to robertohueso, kernel density estimation using dual-tree
algorithms is now part of the codebase; can be used from the command line or
from python via the KDE binding, you can also use it from C++ directly, it's in
the methods/kde directory.

 Neural Network / Reinforcement Learning
 ------------------------------------------------

 A lot of things are in the process, and the data shown on the slide is just
 what has been merged so far. Excited to see everything else merged as well.

Python binding fixes
------------------------------------------------

We now check the type of parameters passed to Python bindings, e.g. if you pass
in a bool, where you supposed to pass in a matrix it will now through an
exception, instead of trying to run and return with a segmentation fault.

Missing things that haven't been mentioned on the slides:

Previously there we had some benchmarks on the website, that aren't included
anymore, we will fix that soon. We also have to check if those are up to date.

There is a license mlpack/armadillo issue, armadillo switched from Mozilla
Public License 2.0 to Apache License 2.0, and mlpack is still referencing the
old license. We will have to look that up and update the license.

The next release of mlpack will be 3.1, it has taken far to long and part of the
problem was the tedious deployment process. There are a couple of open PR's that
will probably go into mlpack 3.1 that are close to finish. Interesting note,
almost all PR's are adding new features, so we don't have to wait/prioritize
PR's that will fix bugs.

* All PR's mentioned in the slides that are already merged will go into the new
  release, that includes all the work from last years GSoC.
* If anybody can think of something that shouldn't go into the release please
  let us know.
* Some of the smaller PR's will be part of a patch release after mlpack 3.1.

Discussion topics: mlpack is a big project, that can take a lot of different
directions, but it would be useful to have a handful of directions and people
that are interested to take the lead here. Some interesting directions are:

* Low-power and embedded devices:

- mlpack is written in C++, so the deployment is somewhat easy since you can
  compile it to a specific device. There is a GSoC project that goes into that
  direction. The opportunity here is to get something out that can be easily
  deployed and optimized for specific devices; for example, Tensorflow which is
  used around the world is a huge toolkit and is not necessarily optimized. Even
  the light version that they provide, is for iOS and Android so it's not quite
  for embedded devices.
- gmanlan is working on a tiny (removed dependencies, still depends on BLAS and
  LAPACK) version of mlpack, for some methods we don't need BLAS/LAPACK so this
  could be reduced even more.
- A couple of ideas to reduce the size and the number of dependencies: strip out
  all of the unused functions, link statically, maybe there are some lightweight
  BLAS/LAPACK replacements or we could inline the BLAS/LAPACK directly into the
  code.
- There are some toolkits around there, but mostly they are built for neural
  networks, mlpack, in contrast, does provide more than neural networks

* Automatic selection of methods:

- mlpack has many different implementations that give you the same result,
  KMeans is one example, but it's not clear which one is best, so it could be an
  interesting project both form the code as well as from the research side, for
  someone to try to make some heuristic to automatically choose the best
  algorithm for a given set
- another idea that goes into the same direction is to use meta machine learning

* Better Windows accessibility:

- we have a bunch of Windows users, but the current Windows support is limited,
  the main reason is the lack of expertise in that area
- mlpack is currently missing an easy deployment and build system something like
  the pip install version for windows
- Windows package manager,  chocolatey or nuget if you are using Visual Studio.
- Windows is also lacking pip install for the python bindings.
- New users like to test mlpack as quickly as possible so starting with binary
  ackages might be a good start.

* NumFocus:
- NumFocus is an organization that we have talked to at different events.
- They will handle donations, help us to organize workshops and or hackathons.
- Similar to the Apache software foundation.

Let us know if you have any opinion on this one, we are not going to rush
anything here.

* Automated release process:

- There are a couple of dev-ops related tasks open, so if anyone is interested
  in that please feel free to send us a mail or join the IRC channel.

* Sommer of Docs/ Google Code-In
- If someone likes to join those opportunities please feel free to talk with us.

* Arbitrary precision data support:
- low precision machine learning is very popular now, people like to train
  neural network on 8-Bit floating point numbers, etc.
- mlpack mostly requires armadillo double precision matrices everywhere, there
  are some issues open about templatizing the whole API.
- Armadillo doesn't support low precision floating point (32,64).
- There could be an opportunity to make an armadillo compatible layer/library
  for low precision floating point.

* Making cutting edge neural network available:

- There are some open PR's for Neural Turing machines and Highway Networks and a
  couple of other models that build on top of the existing API.
- The models repository is a good place to show what can be done with mlpack.

* Improve the visibility of implemented methods:
- It's somewhat difficult to figure out what mlpack implements and supports, in
  some cases you have to search through the code.

Happy to clarify anything.

Thanks,
Marcus

> On 4. Apr 2019, at 15:32, Ryan Curtin <ryan at ratml.org> wrote:
> 
> On Fri, Mar 29, 2019 at 11:51:24PM -0400, Ryan Curtin wrote:
>> Hey there everyone,
>> 
>> After some discussion, schedule wrangling, and playing with
>> videoconferencing software, we've decided that we'll have the first
>> mlpack video meeting on
>> 
>>    Thursday, April 4 at 1600-1700 UTC
>> 
>> (so to convert that to some common time zones, from west to east: 9am
>> PST, 12pm EST, 4pm GMT, 6pm CEST, 7pm MST, 9:30pm IST), and we'll use
>> the open-source Jitsi videoconferencing software to meet at
>> 
>> https://meet.mlpack.org/mlpack-meeting
> 
> Hey everyone,
> 
> It turns out when I made the URL above that I did not realize that the
> Jitsi software disallows hyphens in the meeting name.  So, instead,
> if you're able to attend the meeting, let's meet at this URL instead:
> 
> https://meet.mlpack.org/mlpackmeeting
> 
> Like I mentioned before, I'll post notes afterwards and we can follow up
> on the mailing list or IRC or wherever if needed, so if you can't make
> it no worries.
> 
> Sorry for the confusion and see you in 2.5 hours!  Nothing ever goes
> perfect the first time.  We will figure out what other difficulties we
> will have shortly enough. :)
> 
> Thanks,
> 
> Ryan
> 
> -- 
> Ryan Curtin    | "Moo."
> ryan at ratml.org |   - Eugene Belford
> _______________________________________________
> mlpack mailing list
> mlpack at lists.mlpack.org
> http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack