[mlpack] How to get started on CF idea of GSoC 2018

Sat Feb 17 08:45:14 EST 2018

Hi Ryan,

I have done some research on collaborative filtering literature and here
are my thoughts.

I think removing global effects, as described in the reference paper, is
worth implementing the most. It is relatively easy to implment and it leads
to significant improvement in rating prediction. I did a small experiment
on Grouplens-100k dataset. I centered the ratings by substracting overall
mean and then run the cf algorithm with different factorizers. The RMSE of
the default factorizer (NMFALSFactorizer) decreases from 2.83887 to
1.08704, and that of RegularziedSVD decreases from 1.1613 to 1.11595. So
far I am still thinking about a good way to incorporate this into the
current cf code so that it would be flexible to extend to removing other
global effects.

One problem I also noticed is that the default factorizer
(NMFALSFactorizer) gives poor rating prediction result (2.83887) as shown
above. I had a look at the predicted ratings and found that most of the
predictions are close to zero (the rating scale is 1-5). I am not familiar
with the mathematics behind the updating rule of this factorizer, but I
guess the reason may be that the factorizer is trying to fit zero in the
place where ratings are missing. That could also explain why there is a
significant improvement after normalizing the raw ratings.

There are other svd-related algorithms that are worth implementing:
1) BiasSVD is a method similar to RegularziedSVD. The difference is that
BiasSVD also considers the user/item rating bias.
2) svd++ improves BiasSVD by taking implicit feedback into consideration.
It allows modelling the effect of boolean-valued implicit feedback. A nice
aspect of this is that the rating itself can be regarded as a kind of
implicit feedback (whether the user rated the item). So if no other
implicit feedback (eg. whether the user browsed the item) is provided,
svd++ can still be used with the rating as implicit feedback.

But these two algorithms are not the matrix factorization in the form of V
= W * H which we can directly put into the current cf code. One solution is
to add a new member like "bool UseFactorizerSpecificRatingFunction" in
struct FactorizerTraits, and use SFINAE to write the code. And then a
function like "double getRating(user, item)" needs to be implmented in the
class of BiasSVD/Svd++. I would like to hear some suggestions on this:)

These two algorithms can be found in this paper:
http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf

Progress indicator should be a useful tool. There are some algorithms that
take a relatively long time to compute, such as cf with SVDBatchFactorizer.
With a progress indicator (maybe in the form of progress bar?) the user
will have a rough idea how much time the process needs.

As for now, I think maybe I can focus on removing simple global effects
(overall mean, user/item main effect) or BiasSVD. What do you think?

Thank you!

Best,
Wenhao

On Wed, Feb 7, 2018 at 10:53 PM Ryan Curtin <ryan at ratml.org> wrote:

> On Tue, Feb 06, 2018 at 05:31:14PM +0000, Wenhao Huang wrote:
> > Thanks a lot Ryan! I am going through the code in the cf module. And do
> you
> > know any current relevant issues that I can have a look or even start
> > working on, to better my understanding of the the implementation of cf
> > algorithm in mlpack?
>
> Hi Wenhao,
>
> At this time there are not any open issues that I am aware of for CF.
> However, there are always improvements that can be made to the code, so
> I wpuld encpurage you to explore it and see if you can find any speedups
> or propose any functionality improvements.  For instance, maybe one idea
> is adding another simple factorizer (unfortunately I don't have one
> handy to suggest), or to profile the code and see if you can find any
> slow parts.
>
> I hope this is helpful. :)
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "Moo."
> ryan at ratml.org |   - Eugene Belford
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://knife.lugatgt.org/pipermail/mlpack/attachments/20180217/0bd3033b/attachment.html>