Re: [UAI] web text classification applications in real life

From: Kevin S. Van Horn (ksvhsoft@xmission.com)
Date: Sun Mar 26 2000 - 08:22:40 PST

  • Next message: Denver Dash: "Calculating joint over arbitrary sets of variables"

        Does anyone know where to find the information about web text
       classification applications in Yahoo! or Excite? How do they
       automatically do the text categorizing? How did they do it in the very
       beginning? How's their history and also current evolving?

    I can tell you something about Excite -- I was employee #10 there and the
    technical lead for NewsTracker (in the beginning, I more or less was the entire
    NewsTracker project). I left in April of 1997, but when I was there,
    both searching and the NewsTracker news story classifications were based on a
    variant of the vector-space model. In the vector-space model you represent a
    document by a normalized vector of word counts in the document, with
    less-common words weighted more heavily. A query has the same representation,
    and you score documents by the angular distance between the query and the
    document. Boolean queries are basically a combination of a filtering
    predicate (the Boolean part), with the words in the query used to construct
    the vector used to rank those documents that make it through the filter.

    There's a book by Salton (don't recall the title just now) that's a good
    reference on this stuff.



    This archive was generated by hypermail 2b29 : Sun Mar 26 2000 - 08:32:05 PST