2021년 4월 6일 화요일

python: SVD(Singular Value Decomposition)로 간단한 추천시스템 만들기( feat. surprise )

svd_example
In [15]:
# !pip install surprise
In [21]:
from surprise import SVD
from surprise import Dataset
from surprise import dump
from surprise import accuracy
from surprise import Reader
import pandas as pd 
from collections import defaultdict
In [6]:
# data = Dataset.load_builtin('ml-100k')

학습할 데이터 읽기

영화 평점 데이터를 사용해 보자

무비렌즈 데이터 다운: https://files.grouplens.org/datasets/movielens/ml-latest-small.zip

In [10]:
rateingsDf = pd.read_csv('~/SVD/ratings.csv')
In [16]:
rateingsDf.head()
Out[16]:
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931

학습하기

In [12]:
### 학습데이터 포멧팅 from DataFrame
DfOrgData = rateingsDf[['userId','movieId','rating']]

r_min = DfOrgData['rating'].min()
r_max = DfOrgData['rating'].max()
reader = Reader(rating_scale=(r_min, r_max))
data = Dataset.load_from_df(DfOrgData[['userId', 'movieId', 'rating']],reader)

trainset = data.build_full_trainset()
testset = trainset.build_testset()

algo = SVD()

### trainset으로 SVD 학습
algo.fit(trainset)
Out[12]:
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f9b3a4634a8>

결과 평가하기

In [17]:
### testset으로 RMSE 측정
predictions = algo.test(testset)
accuracy.rmse(predictions)
RMSE: 0.6411
Out[17]:
0.6411301153941434

그리드 탐색을 통한 최적의 파라메터 출력

In [44]:
from surprise.model_selection import GridSearchCV, cross_validate

param_grid = {'n_factors': [50, 75], 'lr_all': [0.5, 0.05], 'reg_all': [0.06, 0.04]} 
gs = GridSearchCV(algo_class=SVD, measures=['RMSE'], param_grid=param_grid) 
gs.fit(data) 
print('\n###################') 
print('Best Score :', gs.best_score['rmse']) 
print('Best Parameters :', gs.best_params['rmse']) 
print('#####################')
###################
Best Score : 0.864674558789791
Best Parameters : {'n_factors': 75, 'lr_all': 0.05, 'reg_all': 0.06}
#####################

최적의 파라메터로 최종 모델 생성

In [47]:
best_params = gs.best_params['rmse']
In [66]:
final_algo = SVD(n_factors=best_params['n_factors'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'])

### SVD 학습
final_algo.fit(trainset)

### 최종으로 RMSE 측정
predictions = final_algo.test(testset)
accuracy.rmse(predictions)
RMSE: 0.4530
Out[66]:
0.45299748930769984

모든 결과값 확인

In [67]:
pd.DataFrame(predictions)
Out[67]:
uid iid r_ui est details
0 1 1 4.0 4.183936 {'was_impossible': False}
1 1 3 4.0 4.068512 {'was_impossible': False}
2 1 6 4.0 4.570216 {'was_impossible': False}
3 1 47 5.0 5.000000 {'was_impossible': False}
4 1 50 5.0 4.902995 {'was_impossible': False}
... ... ... ... ... ...
100831 610 166534 4.0 4.005984 {'was_impossible': False}
100832 610 168248 5.0 4.764278 {'was_impossible': False}
100833 610 168250 5.0 4.579400 {'was_impossible': False}
100834 610 168252 5.0 4.708238 {'was_impossible': False}
100835 610 170875 3.0 2.907783 {'was_impossible': False}

100836 rows × 5 columns

user1, item2 에 대해서 예측값 확인 하기

In [33]:
userid = 1 
movieid = 2
sample_test = [(userid, movieid, 0)]
algo.test(sample_test)
Out[33]:
[Prediction(uid=1, iid=2, r_ui=0, est=4.35294054528359, details={'was_impossible': False})]

상위 N개만 출력하기

In [19]:
def get_top_n(predictions, n=10):
    # 각 사용자의 예측데이터를 defaultdict에 저장
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # 정렬 후 Top N 개만 저장
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n
In [37]:
topNresults = get_top_n(predictions, 5)
In [42]:
topNresults[1]
Out[42]:
[(260, 5.0), (1136, 5.0), (1196, 5.0), (1197, 5.0), (2329, 5.0)]
In [43]:
topNresults[2]
Out[43]:
[(58559, 4.345062592243402),
 (80906, 4.248299909098796),
 (79132, 4.238211813290691),
 (112552, 4.218742060225583),
 (318, 4.205383400895031)]

추천 게시물

python: SVD(Singular Value Decomposition)로 간단한 추천시스템 만들기( feat. surprise )

svd_example In [15]: # !pip install surprise In [21]: from...