Alexandros Ntoulas is a Researcher at Microsoft Research in Mountain View, California since 2006.
He received his Ph.D. and M.Sc. degrees in Computer Science from the University of California Los Angeles (UCLA) in 2006 and 2003 respectively.
He has also received an Engineering Diploma from the Computer Engineering and Informatics Department (CEID) of Patras University, Greece in 2000.
He was the co-founder of Infocious,
a Web Search Engine indexing more than 2.5 billion pages that applied semantic analysis techniques in order to provide
highly relevant results and a better search experience to the user.
The company relaunched as Lingo Spot in 2007.
His research falls in the area of Web Information Systems. In particular, he is interested in systems and algorithms that facilitate the monitoring, collection, management, mining and searching of information on the World Wide Web.
He is currently working on task-centric Web search, i.e. on a project aiming at detecting, modeling and supporting the task-oriented nature of several search sessions on the Web. Such task-oriented searches typically span several days of search activity for researching, learning, comparing and deciding and may involve other people from the user's social network.
He is also working on a search prototype that aims at identifying and exploiting structure in keyword queries and mixing Web results with results coming from structured data sources that are abundant in the Hidden Web.
He is the recipient of the Best Paper Award for the ICDE 2005 conference, and a runner-up for the Best Paper Award in WWW 2009.
The Web has brought together a wide variety of digital information into
publicly accessible media. However, because of the sheer quantity and varying
quality of available information, the user often feels overwhelmed and
disoriented during his pursuit of information.
My research focuses on the monitoring, collection, sharing, mining and
searching of information in order to help the users identify and extract
the desired information quickly and effectively through simple and intuitive methods.
Here is a list of some of the projects that I have worked on:
Hidden-Web Crawling: Search engines employ automated programs called
crawlers to download pages from the Web. Typical crawlers today follow links
from one Web page to another and download every page in their path. However, an
ever-increasing amount of information on the Web is accessible only through
search interfaces; such information is called the Hidden or Deep Web. For
example, in PubMed (www.pubmed.org) users can access pages of high-quality
papers on medical research only after issuing a set of keywords. Since there
are no static links to the Hidden-Web pages, current search engines cannot
index them, thus depriving users from accessing potentially valuable
information. In my research, I studied how we can build an effective Hidden-Web
crawler that can autonomously discover and download pages from the Hidden Web.
[
JCDL 2005 |
extended version |
slides
]
Indexing Optimizations: Search engines typically create and maintain
large-scale indexes that are used to answer thousands of user queries per
second. Given the vast amount of information available on the Web, such indexes
can easily grow very large and become very costly to operate. In my research, I
proposed and evaluated algorithms for reducing the size of an index, without
sacrificing the quality of results that we return to the users.
[
SIGIR 2007 |
slides
]
Web Spam Detection: One of the most important goals of Web search
engines is to return highly relevant results to the users. However, given the
potential monetary value of the traffic that search engines direct to Web sites,
some Web site operators craft spam Web pages that are useless to human users
and exist for the sole purpose of fooling the search engine rankings into
returning such pages, in the hope of attracting traffic. At Microsoft Research,
we studied the characteristics of Web spam and proposed fast and highly
accurate algorithms for removing spam from the search engine results.
[
WWW 2006 |
slides ]
Data Synchronization: Information on the Web is constantly updated.
Therefore, once the search engine’s crawler has downloaded pages and stored
them locally, the crawler has to refresh the pages periodically. In my research,
I performed large-scale experimental studies on several million pages
collected weekly from the Web over a period of one year. We induced models that
capture the evolution of Web sites and Web-accessible textual databases. The
models were then used to predict when we should refresh the Web pages.
Additionally, since the enormous size of the Web limits most crawlers to
downloading only a subset of the entire Web, I studied sampling-based
algorithms for determining which subset of the Web the crawler should focus on.
[
VLDB 2002 ,
slides
|
WWW 2004 ,
slides
|
ICDE 2005 ,
slides
]
The Infocious Web Search Engine: As part of my research, I worked on the
implementation of a full-fledged commercial Web search engine called Infocious, which blends my research in
crawling, data synchronization and indexing along with a variety of natural
language processing (NLP) techniques in order to improve the quality of results
presented to the users. The search engine
performes highly efficient crawling and indexing of Web data, operates in a
distributed fashion over a cluster of commodity machines,
provides failover capabilities that guarantee 24/7 availability of the service,
and gracefully scales to the size of the Web, currently indexing more than 2
billion pages.
[
WWW 2005 |
slides
]
Automatic Web Directory Construction: Web Directories provide an
alternative (to search engines) way of locating relevant information on the
Web. Typically, Web Directories rely on humans putting in significant time and
effort into finding important pages on the Web and categorizing them in the
Directory. I studied ways for automating the creation of a Web
Directory by assigning every page from a given collection of
pages to a given subject hierarchy. Our method is based on the identification
of important sequences of terms within Web pages (called lexical chains),
which are then used to assign the pages to categories in the hierarchy.
[
APWeb 2006
]
Releasing Search Queries and Clicks Privately: Researchers working within the realm of a real Web search engine have the privilege of accessing real query logs containing the queries and clicks that the users performed over large periods of time. Although such data is of tremendous value for research in the WWW and Social Networks areas, they are kept strictly confidential from Search Engine companies due to privacy concerns. At Microsoft Research, we worked on algorithms that would allow us to release queries and clicks from a real query log to third parties (users or researchers) while rigorously preverving user privacy.
[ WWW 2009 ]