Does anyone know where to find the information about web text
classification applications in Yahoo! or Excite? How do they
automatically do the text categorizing? How did they do it in the very
beginning? How's their history and also current evolving?
I can tell you something about Excite -- I was employee #10 there and the
technical lead for NewsTracker (in the beginning, I more or less was the entire
NewsTracker project). I left in April of 1997, but when I was there,
both searching and the NewsTracker news story classifications were based on a
variant of the vector-space model. In the vector-space model you represent a
document by a normalized vector of word counts in the document, with
less-common words weighted more heavily. A query has the same representation,
and you score documents by the angular distance between the query and the
document. Boolean queries are basically a combination of a filtering
predicate (the Boolean part), with the words in the query used to construct
the vector used to rank those documents that make it through the filter.
There's a book by Salton (don't recall the title just now) that's a good
reference on this stuff.
This archive was generated by hypermail 2b29 : Sun Mar 26 2000 - 08:32:05 PST