[mlpack] Improve mlpack's tree ensemble support - GSoC 2021
Ryan Curtin
ryan at ratml.org
Mon Mar 29 08:03:11 EDT 2021
On Mon, Mar 29, 2021 at 10:59:32AM +0530, RISHABH GARG wrote:
> Hey Ryan, thanks for the feedback.
>
>
> I also agree with you. XGBoost is one of the most widely used ML
> algorithms. It would be really great for MLPACK to have it and this will
> undoubtedly attract more and more users to MLPACK. This discussion with
> you, has changed my perspective and I think we can prioritise XGBoost over
> others.
>
>
> As you mentioned in the previous mail, it will be straightforward to
> implement the core XGBoost algorithm provided the flexible implementation
> of trees in MLPACK. But, how can we implement optimisations like
> cache-aware access and out-of-core computation with armadillo matrices? I
> remember I had a chat with you related to this and you slightly mentioned
> that it can be done with a simple tweak. Can you please elaborate it a bit?
I wouldn't worry about out-of-core learning for your proposal---ideally
we should just be able to demonstrate that the performance of what we
implement is comparable to XGBoost's performance.
That said, if you are interested in doing out-of-core learning, the way
I know to do it is to create a file of the right size on disk (e.g.
n_rows * n_cols * sizeof(double) bytes). Then, in your program, use
mmap() to memory map the file. This will give you a pointer to some
memory, which you can cast to a double*. You can then use the Armadillo
advanced constructor that takes a memory pointer to create the Armadillo
matrix that is wrapped around the mmap()-ed file. Now, ta-da, you have
an out-of-core matrix. :) (But some restrictions are that you can't
resize it, and operations on that matrix that result in a new matrix
will not be mmap()-ed.)
Anyway, hope that is helpful!
Thanks,
Ryan
--
Ryan Curtin | "If it's something that can be stopped, then just try to stop it!"
ryan at ratml.org | - Skull Kid
More information about the mlpack
mailing list