[mlpack] How to get started on CF idea of GSoC 2018

Tue Feb 20 08:33:12 EST 2018

On Sat, Feb 17, 2018 at 01:45:14PM +0000, Wenhao Huang wrote:
> Hi Ryan,
> 
> I have done some research on collaborative filtering literature and here
> are my thoughts.
> 
> I think removing global effects, as described in the reference paper, is
> worth implementing the most. It is relatively easy to implment and it leads
> to significant improvement in rating prediction. I did a small experiment
> on Grouplens-100k dataset. I centered the ratings by substracting overall
> mean and then run the cf algorithm with different factorizers. The RMSE of
> the default factorizer (NMFALSFactorizer) decreases from 2.83887 to
> 1.08704, and that of RegularziedSVD decreases from 1.1613 to 1.11595. So
> far I am still thinking about a good way to incorporate this into the
> current cf code so that it would be flexible to extend to removing other
> global effects.
> 
> One problem I also noticed is that the default factorizer
> (NMFALSFactorizer) gives poor rating prediction result (2.83887) as shown
> above. I had a look at the predicted ratings and found that most of the
> predictions are close to zero (the rating scale is 1-5). I am not familiar
> with the mathematics behind the updating rule of this factorizer, but I
> guess the reason may be that the factorizer is trying to fit zero in the
> place where ratings are missing. That could also explain why there is a
> significant improvement after normalizing the raw ratings.
> 
> There are other svd-related algorithms that are worth implementing:
> 1) BiasSVD is a method similar to RegularziedSVD. The difference is that
> BiasSVD also considers the user/item rating bias.
> 2) svd++ improves BiasSVD by taking implicit feedback into consideration.
> It allows modelling the effect of boolean-valued implicit feedback. A nice
> aspect of this is that the rating itself can be regarded as a kind of
> implicit feedback (whether the user rated the item). So if no other
> implicit feedback (eg. whether the user browsed the item) is provided,
> svd++ can still be used with the rating as implicit feedback.
> 
> But these two algorithms are not the matrix factorization in the form of V
> = W * H which we can directly put into the current cf code. One solution is
> to add a new member like "bool UseFactorizerSpecificRatingFunction" in
> struct FactorizerTraits, and use SFINAE to write the code. And then a
> function like "double getRating(user, item)" needs to be implmented in the
> class of BiasSVD/Svd++. I would like to hear some suggestions on this:)
> 
> These two algorithms can be found in this paper:
> http://www.cs.rochester.edu/twiki/pub/Main/HarpSeminar/Factorization_Meets_the_Neighborhood-_a_Multifaceted_Collaborative_Filtering_Model.pdf
> 
> Progress indicator should be a useful tool. There are some algorithms that
> take a relatively long time to compute, such as cf with SVDBatchFactorizer.
> With a progress indicator (maybe in the form of progress bar?) the user
> will have a rough idea how much time the process needs.
> 
> As for now, I think maybe I can focus on removing simple global effects
> (overall mean, user/item main effect) or BiasSVD. What do you think?

Hi Wenhao,

Thanks for the detailed analysis.  I agree that an option to remove
global effects could be useful and this may help the NMF factorizer.  In
fact the actual computation that the NMF factorizer is performing is
minimizing (V - WH) where V is the existing rating matrix and W and H
are the component low-rank matrices.  V is represented in the code as a
sparse matrix, so any non-existent ratings are implicitly zeros; so it
is not too surprising that the reconstructed matrix W*H has many
close-to-nonzero values.  That is not always a bad thing for every case.
Still, global effect removal, like you pointed out, can help a lot.

For BiasSVD and SVD++, I do agree that they would be useful, but we
would need to think a little bit what the best way to fit them in is.  I
think all the existing CF code uses an AMF<...> factorizer.  Perhaps it
might be best to add some generic functions like GetRating(user, item)
like you suggested to AMF<...> as well as another decomposition
technique you might suggest.  That way we can avoid SFINAE and instead
the technique used for CF can be a policy class template parameter.  But
some thought needs to go into how to support any type of CF technique
(or if there are enough similarities between CF techniques to even make
that reasonable).

A progress indicator would be nice but it's difficult to tell how long
it will take to converge so I don't think we could make a good one.
However improving the output that is printed for the optimizers could be
a step in the right direction.

I hope the reply is helpful; let me know if I can clarify anything I
mean.  Thanks for your in-depth investigation. :)

Thanks!

Ryan

-- 
Ryan Curtin    | "Because for the people who knew what it was, it
ryan at ratml.org | would be attacking their brain."