ProBase

Overview | ProBase Browser | Experimental Results | Application | Sample Data

Probase´s 2.7 million concepts are harnessed from a large corpus of 1.68 billion Web pages authored by millions of people, probably it already includes most, if not all, concepts of worldly facts that human beings have formed in their mind. With such a rich concept space, Probase has much better chance to understand writings (including queries in keywords or in natural languages) created by human beings. Indeed, we studied 2 years´ worth of search log of a major search engine, and found that 85% of the searches contain concepts and/or instances that exist in Probase.

Application 1 (Instantiation): Topic Search


Probase can be a powerful tool to interpret user intention behind search. Consider the following search queries: i) politicians commit crimes; and ii)tennis players win grand slams. The user intents of these queries are clear. However, current keyword-based search engines cannot deliver good results as they return pages with exact, word-for-word matches for phrases such as "politicians", "crimes", "tennis players" and "grand slams".

Here are some screen snapshot of a semantic search engine prototype built on top of Probase. The prototype supports topic search, namely, queries contain concepts. Example queries include companies buy tech companies, which simply contains concepts and normal keywords, and tech companies slogan, which contains both concepts and attributes.

For the first type of query, the search engine will return results by rewriting the queries according to the taxonomic information stored in Probase; while for the second type of query, the search engine will directly return a table containing entities that are instances of the concept and their corresponding attribute values. The latter is achieved by integrating the information contained in Probase and Freebase. The following screen snapshots show the results returned by the prototype search engine on some example queries (on the left column of each picture). As comparison, the results returned by a major search engine (on the right column of each picture).

(Please click each query to see the screenshort of results. For more details, please visit here.)

Some Technical Details: Due to the large coverage on concepts in Probase, it is easy to identify the concepts within user queries. The queries are then rewritten by substituting the concepts with their instances. Such transformed queries are then submitted to ordinary search engines (Type I queries), or directly searched in Freebase (Type II queries). The typicality T(i|x) is used in picking appropriate rewrites and ranking the results. The evaluation results show that, on average, about 80% of the returned results are considered as relevant by users, compared with less than 50% in the case of either Google or Bing.

Application 2 (Instantiation): Information Extraction


Marius Pasca developed a weakly-supervised framework that can harvest attributes of concepts fromWeb text (ref. Marius Pasca, "Organizing and Searching the World Wide Web of Facts - Step Two: Harnessing the Wisdom of the Crowds", WWW, 2007). Although this approach has good performance and scalability, it requires a set of "seed" instances and attributes for each concept, which must be manually selected. Probase overcomes this difficulty by automatically selecting instances with higher T(i|x) as seeds. Figure 1 compares the precision of the top 20 attributes so obtained, on 31 concepts for which the precision is reported in the WWW'07 paper. On average, we achieve 88.3% precision, which is comparable to the average 86.2% precision reported there. However, by leveraging Probase, we make the extraction procedure completely automatic.

Precision of the top 20 attributes
Figure 1: Precision of the top 20 attributes

Application 3 (Abstraction): Short Text Understanding


Understanding short text (e.g., web search, tweets, anchor texts) is important to many applications. Statistical approaches such as topic models typically do not work for short text, due to the lack of content. Probase enables machines to conceptualize from a set of words by performing Bayesian analysis based on the typicality T(i|x).

Can the other knowledgebases be used in this task of short text understanding?

Figure 2 compares Probase with a few other taxonomies, including WordNet, Wikipedia, and Freebase. WordNet specializes in the linguistics of English words. For the word "cat", WordNet has detailed descriptions of its various senses, although many of them are rarely used, or even unknown to many people (e.g., gossip and woman as concepts for "cat"). Also, it does not contain information for entities such as "IBM", which is not considered as a word. Wikipedia and Freebase, on the other hand, contain limited number of concepts for the word "cat". In fact, the categories there are biased and sometimes inaccurate. For example, Freebase's concept space is biased toward entertainment, media related concepts. More importantly, the categories in WordNet, Wikipedia, and Freebase are not ranked or scored, and users cannot tell the difference in terms of their importance or typicality. In comparison, the concepts in Probase are more consistent with human's common knowledge. Concepts such as gossip and woman for "cat" are either not included or ranked very low because people rarely associate them with "cat". In addition, for a word such as "language", Probase indicates it can be both an instance on its own or an attribute of some concepts. Thus, Probase provides additional information that is not available fromWordNet, Wikipedia, or Freebase.

Comparison between different knowledgebases
Figure 2: Comparison between different knowledgebases

A Bayesian Inference Model for Short Text Conceptualization

Conceptualizing Instances: Given a set of observed instances E = {ei, i =1, ..., M}, abstract a set of most representative concepts that can best describe the instances. The probability of concepts is estimated using a naive Bayes model (note that P(ei | ck) is the same as T(i|x)):

Conceptualizing instances

Conceptualizing Attributes: Given a set of observed attributes A = {aj, j =1, ..., N}, the probability of concepts is estimated in a similar way as:

Conceptualizing attributes
where
Conceptualizing attributes 2

Mixture Models: In general, given a set of terms T = {tl, l =1, ..., L}, where each term tl can either be an instance or attribute (but the type of the term is unknown), the probabiity of concepts is estimated as:

Mixture model
and hence
Mixture model 2
where
Mixture model 3

Term Conceptualization Examples

Figure 3 shows several examples on the general term conceptualization task, by leveraging the Bayesian inference models illustrated above.

Term conceptualization
Figure 3: Term conceptualization examples. (I: known type as instance. A: known type as attribute. U: unknown type.)

Clustering Twitter Messages

The short text conceptualization approach has been used to cluster Twitter messages. 605,501 tweet messages was collected and pre-processed to detect Probase entities first. When multiple entities can be detected from a single piece of text, the longest entity is chosen. For example, "President Barack Obama" is treated as an entity instead of "President", "Barack Obama" or "Obama", although all these terms are in the knowledgebase. Several examples of the tweets and corresponding concepts are shown in Figure 4.

Twitter Conceptualization Examples
Figure 4: Twitter Conceptualization Examples.

Because tweets data has no ground-truth labels, to evaluate the effectiveness of conceptualizing, clustering problems are designed in the following way. Tweets are collected using some hashtag keys, and then the tweets are grouped into several categories based on the keys. Specifically, two clustering problems are defined:

Several methods, including statistical methods such as LDA, and methods that use knowledgebases including WordNet, Freebase, Wikipedia, and Probase are compared on the two clustering problems:

The clustering quality is evaluated using purity (ref. Y. Zhao and G. Karypis, Criterion Functions for Document Clustering: Experiments and Analysis, UMN CS TR, 2002), adjusted random index (ARI) (ref. L. Hubert and P. Arabie, Comparing partitions, Journal of Classification, 2(1): 193-218, 1985), and normalized mutual information (NMI) (ref. A. Strehl and J. Ghosh, Cluster ensembles --- a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, 3:583-617, 2002). The purity measure assumes that each cluster is predicted to be the dominant class for that cluster; ARI penalizes both false positive and false negative clustering results; and NMI can be information-theoretically interpreted and has been increasingly used. Larger purity/ARI/NMI scores mean better clustering results. Figure 5, Figure 6 and Figure 7 show the comparison on purity/ARI/NMI scores, respectively. As can be seen from the results, Probase-based approach outperforms all other approaches on both clustering problems.

Clustering purity scores on Twitter data
Figure 5: Clustering purity scores on Twitter data.

Clustering ARI scores on Twitter data
Figure 6: Clustering ARI scores on Twitter data.

Clustering NMI scores on Twitter data
Figure 7: Clustering NMI scores on Twitter data.

Application 4 (Abstraction): Understanding Web Tables


There are billions of tables on the Web, and they contain much valuable information. Tables are relatively well structured, which means they are easier to understand than text in natural languages. Unfortunately, most of this information is understood only by humans but not by machines. Consider the following example table:

Barack Obama Aug 4 1961 Illinois
Hillary Clinton Oct 16 1947 New York

It is easy for human beings to recognize that the first column is names of American politicians, while the second and third columns are their birth date and state. But how can machines infer this?

With the help of Probase, it is possible to unlock the information in such tables, and the information, once understood, can be used to enrich Probase. Specifically, T(x|i) is used to infer the typical concept and hence the likely header of a given column of instances in a table. Instances that are not already in Probase are then added in under the inferred concept. The evaluation shows 96% precision on this task in average.

Please visit here for details.