Sunday, December 9, 2007

Search Engine and Ranking

Crawling

Start from an HTML, save the file, crawl its links

def collect_doc(doc)
-- save doc
-- for each link in doc
---- collect_doc(link.href)

Build Word Index

Build a set of tuples

{
word1 => {doc1 => [pos1, pos2, pos3], doc2 => [pos4]}
word2 => {...}
}

def build_index(doc)
-- for each word in doc.extract_words
---- index[word][doc] << style="font-weight: bold;" size="5">Search

Given a set of words, locate the relevant docs

A simple example ...

def search(query)
-- Find the most selective word within query
-- for each doc in index[most_selective_word].keys
---- if doc.has_all_words(query)
------ result << doc
-- return result


Ranking

There are many scoring functions and the result need to be able to combine these scoring functions in a flexible way

def scoringFunction(query, doc)
-- do something from query and doc
-- return score

Scoring function can be based on word counts, page rank, or where the location of words within the doc ... etc.
Scoring need to be normalized, say within the same range.
weight is between 0 to 1 and the sum of all weights equals to 1

def score(query, weighted_functions, docs)
-- scored_docs = []
-- for each weighted_function in weighted_functions
---- for each doc in docs
------ scored_docs[doc] += (weight_function[:func](query, doc) * weight_function[:weight])
-- return scored_docs.sort(:rank)

Collaborative Filtering

Given a set of user's rating on movies, how to recommend movies to a new user ?

Ideas from the book: Collective Intelligence chapter one

Given a set of ratings per user
[ person1 => {movieA => rating-1A, movieB => rating-1B},
person2 => {movieX => rating-2X, movieY => rating-2Y, movieA => rating-2A}
...
]

Determine similarity

person1 and person2 are similar if they have rate the same movies with similar ratings

person_distance =
square_root of sum of
-- for each movie_name in person1.ratings.keys
---- if (person2.ratings.keys contains movie_name)
------ square(person1.ratings[movie_name] - person1.ratings[movie_name])


Person similarity =
0 if no common movies in corresponding ratings
1 / (1 + person_distance) otherwise

How to find similar persons to personK ?
Calculate every other person's similarity to personK, and sorted by similarity.

How about movie similarity ?

Invert the set of rankings to ...
[ movieA => {person1 => rating-1A, person2 => rating-2A},
movieB => {person1 => rating-1B}
movieX => {person2 => rating-2X}
movieY => {person2 => rating-2Y}
...
]

movie_distance =
square_root of sum of
-- for each person_name in movieX.ratings.keys
---- if (movieX.ratings.keys contains person_name)
------ square(movieX.ratings[person_name] - movieY.ratings[person_name])


Movie similarity =
0 if no common persons in corresponding ratings
1 / (1 + movie_distance) otherwise

How to find similar movies to movieX ?
Calculate every other movie's similarity to movieX, and sorted by similarity.

Making recommendations

Lets say there is a new personK provide his ratings. How do we recommend movies that may interests him ?

User-based filtering

For each person in persons
-- similarity = person_similarity(personK, person)
-- For each movie_name in person.ratings.keys
---- weighted_ratings[movie_name] += (similarity * person.ratings[movie_name])
---- sum_similarity[movie_name] += similarity

For each movie_name in weighted_ratings.keys
-- weighted_ratings[movie_name] /= sum_similarity[movie_name]

return weighted_ratings.sort_by(:rating)


Item-based filtering

Pre-calculate the movie similarity

For each movie in movies
-- neighbors[movie] = rank_similarity(movie, no_of_close_neighbors)

{
movie1 => {movieA => similarity1A, movieB => similarity1B}
movie2 => {movieC => similarity2C, movieD => similarity2D}
}

At run time ...
personK => {movieX => rating-kX, movieY => rating-kY}

For each movie in personK.ratings.keys
-- for each close_movie in neighbors[movie]
---- weight_ratings[close_movie] += neighbors[movie][close_movie] * personK.ratings[movie]
---- similarity_sum[close_movie] += neighbors[movie][close_movie]

For each target_movie in weight_ratings.keys
-- weight_ratings[target_movie] /= similarity_sum[target_movie]

return weight_ratings.sort_by(:rating)