Friday, February 18, 2005

My research

My research this semester, under Prof. Christopher Olston is on search engine index freshness - particularly the discovery of new pages on the web, using the Web Information Collector (WIC) algorithm.

The general idea is that there is a fixed resource limitation while trying to index the web. For example, bandwidth is limited so you can only download 3 pages per site per week.

What would be the best algorithm to find all of these pages given a certain set of requirements? Timeliness and completeness are at odds, so you must optimize for one over the other.

The WIC algorithm is a greedy algorithm that improves on the naive algorithm - instead of only downloading the same best 3 pages per week, it balances what it has recently viewed to balance completeness and timliness and rotates which page it downloads.

I got results today on WIC vs the naive algorithm, and WIC kicks its ass for small resource limitation. It's about 70% better when the resource limitation is 1 page, and it's always at least 15% better from my estimation.

Peace

Matt

1 comment:

Anonymous said...

Crap. I'm trying so hard to understand what that means, and it doesn't immediately make sense.

Hmph. I don't want to be dumb.

H