최적의 이웃 크기 설정
2022. 3. 22. 13:05ㆍAI/Big data
- 목차
반응형
CF 최적의 이웃 크기
이웃의 크기는 너무 작아도 너무 커도 성능에 좋지 않음
적절한 최적의 크기를 알아내야 함
이웃이 너무 작으면 overfit 됨
최적의 이웃 크기는 domain 별로 다름
- 영화, 옷 등 제품에 따라 다를 수 있음
- 고객의 특징에 따라 다를 수 있음
brute force
data loding
import numpy as np
import pandas as pd
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv('u.user', sep='|', names=u_cols, encoding='latin-1')
i_cols = ['movie_id', 'title', 'release date', 'video release date', 'IMDB URL', 'unknown',
'Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary',
'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
movies = pd.read_csv('u.item', sep='|', names=i_cols, encoding='latin-1')
r_cols = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=r_cols, encoding='latin-1')
data cleansing
ratings = ratings.drop('timestamp', axis=1)
movies = movies[['movie_id', 'title']]
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
create user-movie rating matrix
x = ratings.copy()
y = ratings['user_id']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, stratify=y)
define RMSE, kNN function
def RMSE(y_true, y_pred):
return np.sqrt(np.mean((np.array(y_true) - np.array(y_pred))**2))
def score(model, neighbor_size=0):
id_pairs = zip(x_test['user_id'], x_test['movie_id'])
y_pred = np.array([model(user, movie, neighbor_size) for (user, movie) in id_pairs])
y_true = np.array(x_test['rating'])
return RMSE(y_true, y_pred)
def cf_knn(user_id, movie_id, neighbor_size=0):
if movie_id not in rating_matrix:
return 3.0
sim_scores = user_similarity[user_id].copy() # get user similarity vector
movie_ratings = rating_matrix[movie_id].copy() # get movie's rating vector
none_rating_idx = movie_ratings[movie_ratings.isnull()].index # remove not rated user rows
movie_ratings = movie_ratings.drop(none_rating_idx)
sim_scores = sim_scores.drop(none_rating_idx) # remove not rated user rows in sim vector
if neighbor_size == 0:
# get mean from all users
return np.dot(sim_scores, movie_ratings) / sim_scores.sum()
if len(sim_scores) > 1:
neighbor_size = min(neighbor_size, len(sim_scores))
sim_scores = np.array(sim_scores)
movie_ratings = np.array(movie_ratings)
user_idx = np.argsort(sim_scores) # sort and get sorted index
sim_scores = sim_scores[user_idx][-neighbor_size:] # sim scores as much as user_idx size
movie_ratings = movie_ratings[user_idx][-neighbor_size:]
mean_rating = np.dot(sim_scores, movie_ratings) / sim_scores.sum()
else:
mean_rating = 3.0
return mean_rating
calc. similarities
from sklearn.metrics.pairwise import cosine_similarity
rating_matrix = x_train.pivot_table(values='rating', index='user_id', columns='movie_id')
matrix_dummy = rating_matrix.copy().fillna(0)
user_similarity = cosine_similarity(matrix_dummy, matrix_dummy)
user_similarity = pd.DataFrame(user_similarity, index=rating_matrix.index, columns=rating_matrix.index)
for num_nei in range(10, 70, 10):
print("num_nei = %d : RMSE = %.4f" % (num_nei, score(cf_knn, num_nei)))
num_nei = 10 : RMSE = 1.0201
num_nei = 20 : RMSE = 1.0057
num_nei = 30 : RMSE = 1.0037
num_nei = 40 : RMSE = 1.0032
num_nei = 50 : RMSE = 1.0037
num_nei = 60 : RMSE = 1.0038
반응형
'AI > Big data' 카테고리의 다른 글
CF 정확도 개선 (0) | 2022.03.22 |
---|---|
CF considering user bias (0) | 2022.03.22 |
CF considering neighbor (0) | 2022.03.22 |
Collaboration Filtering (0) | 2022.03.21 |
영화 추천 - 높은 평균 평점 기준 (0) | 2022.03.21 |