This page summarizes our experimental results on the taxonomy we build. In addition, proofs of Theorem 1 and 2 in Section 3.6 can be found here, which are related to the taxonomy construction framework.
|name||# of concepts||# of isA pairs|
|ResearchCyc||≈ 120,000||< 5,000,000|
We extract 326,110,911 sentences from a corpus containing 1,679,189,480 web pages. To the best of our knowledge, the scale of our corpus is one order of magnitude larger than the previously known largest corpus. The inferred taxonomy contains 2,653,872 distinct concepts, 16,218,369 distinct concept-instance pairs, and 4,539,176 distinct concept-subconcept pairs (20,757,545 pairs in total).
As comparison, Table 1 shows statistics of several well-known open-domain taxonomies in comparison with Probase. For WordNet, we only count the sub-taxonomy related to nouns in WordNet, and we have converted synsets in WordNet to their lexical form. For Freebase, the statistics are obtained from a version downloaded in early Match, 2010. More than 3,000 topics in this data source are incorrect and cannot be found on the official Freebase website, and are therefore ignored in our analysis. For ResearchCyc, the number of isA pairs shown is in fact the number of all the relationships, since the exact numbers are not reported. For YAGO, the statistics are obtained from its latest version (Dec. 2010), and the number of isA pairs is inferred by summing up the number of SubConceptOf and Type relations reported.
For completeness, in Table 1, we have also included statistics for KnowItAll, TextRunner, OMCS, and NELL. However, these frameworks are not intended to build a taxonomy as we desired, but to extract general facts that may indicate various relationships between concepts or entities. Therefore, it is usually hard to tell concepts from entities, and also hard to tell how many isA pairs are among all the pairs, if not reported.
Given that Probase has many more concepts than any other taxonomies, a reasonable question to ask is whether they are more effective in understanding text. We measure one aspect of the effectiveness here by examining Probase´s concept coverage on web search queries. Here for the purpose of comparison, we define a concept to be relevant, if it appears at least once in web queries. We analyzed Bing´s query log from a two-year period, sorted the queries in decreasing order of their frequency (i.e., the number of times they are issued through Bing), and computed the number of relevant concepts in Probase and 4 other general-purposed open-domain taxonomies WordNet, WikiTaxonomy, YAGO, and Freebase, with respect to the top 50 million queries. Figure 1 shows the result.
In total, 664,775 concepts are considered relevant in Probase, compared to 70,656 in YAGO. This reflects the well-known long-tail phenomena of user queries. While a small number of basic concepts (e.g., company, city, country) representing common sense knowledge appear very frequently in user queries, Web users do mention other less well-known concepts. Probase does a better job at capturing these concepts in the long tail and hence has a better chance of understanding these user queries.
We next measure the taxonomy coverage of queries by Probase. A query is said to be covered by a taxonomy if the query contains at least one concept or instance within the taxonomy. Figure 2 compares the coverage of queries by Probase taxonomy against the other four aforementioned taxonomies. Probase outperforms all the other taxonomies on the coverage of top 10 million to top 50 million queries. In all, Probase covers 40,517,506 (or, 81.04%) of the top 50 million queries.
We further measure concept coverage, which is the number of queries containing at least one concept in the taxonomy. Figure 3 compares the concept coverage by Probase against the other four taxonomies. Again, Probase outperforms all the others. Note that, although Freebase presents comparable taxonomy coverage with Probase in Figure 2, its concept coverage is much smaller.
There are two kinds of isA relationships in Probase: the concept-subconcept relationship which are the edges connecting internal nodes in the hierarchy, and the concept-instance relationship which are the edges connecting a leaf node.
Table 2 compares the concept-subconcept relationship space of Probase with the other taxonomies. The level of a concept is defined to be one plus the length of the longest path from it to a leaf concept (i.e., concept without any subconcepts/children). All leaf concepts thus receive a level of 1. Table 1 shows that even with an order of magnitude larger number of concepts, Probase still has a comparable hierarchical complexity to the other taxonomies. The exception is Freebase which exhibits trivial values on these measured metrics because it has no isA relationship among its concepts at all.
|# of isA pairs||Avg # of children||Avg # of parents||Avg level|
We also compare Probase and Freebase on the concept-instance relationships. We choose Freebase since it is the only existing taxonomy with comparable scale on instance space (24,483,434 concept-instance pairs, see Table 1). We define concept size to be the number of instances directly under a concept node. Figure 4 (logarithmic scale on the Y-axis) compares distributions of concept sizes in Probase and Freebase. While Freebase focuses on a few very popular concepts like track and book which include over two million instances, Probase has many more medium to small size concepts. In fact, the top 10 concepts in Freebase contain 17,174,891 concept-instance pairs, or 70% of all the pairs it has. In contrast, the top 10 concepts in Probase only contains 727,136 pairs, or 4.5% of its total. Therefore, Probase provides a much broader coverage on diverse topics, while Freebase is more informative on specific topics. On the other hand, the instances of large concepts in Freebase like book are mostly from specific websites like Amazon, which could be easily dumped into Probase.
Moreover, due to the transitivity nature of the isA relationship, we can assume that all instances in a subconcept conceptually also belong to its superconcept. Freebase, however, lacks such concept-subconcept information and the pairs it contains are merely concept-instance pairs. So if we propagate all instances up through the taxonomy, the number of instances in each concept becomes much larger. If we take all the distinct concepts in Probase which also exist in Freebase and divide them into 7 groups by their Freebase sizes, or the number of instances in each Freebase concept, Figure 5 depicts the relative sizes of Probase concepts in these 7 groups, which indicates that Probase clearly contains more instances in medium to smaller classes, but slightly less instances for very large and popular concepts.
To estimate the correctness of the isA pairs within Probase, we create a benchmark dataset containing 40 concepts in various domains. The concept size varies from 21 instances (for aircraft model) to 85,391 (for company), with a median of 917. Benchmarks with similar number of concepts and domain coverage have also been reported in previous information extraction research. For each concept, we randomly pick up to 50 instances/subconcepts and ask human judge to evaluate their correctness and hence also the precision of the extraction algorithm. Figure 6 shows the result. The average precision of all pairs in benchmark is 92.8%, which outperforms precision reported from other prominent information extraction frameworks like KnowItAll (64% on average), NELL (74%) and TextRunner (80% on average). It is not fair to directly compare our results with Wikipedia-based frameworks like WikiTaxonomy (86% precision) and YAGO (95% precision), whose data sources are much cleaner. Nevertheless, only YAGO has a better overall precision than Probase.
As a detailed case study, since KnowItAll also used Hearst's patterns to extract isA relationships, we compare our precision with that of KnowItAll on actor, city and film, three concepts that are common to both systems. Table 3 shows that Probase has notable advantage in isA extraction precision over KnowItAll.
We experimented with the simple model in Equation (1) for computing plausibility of a claim using the benchmark concepts. We expect that the plausibility to be approximately equal to the actual percentage of true claims as the number of evidences grows. This is verified in Figure 7. The average plausibility matches the actual percentage of true claims (checked by human judges) quite well, except when there is only one evidence. Figure 7 has an uneven scale on the x-axis because the frequency distribution of claims in Probase has a long tail.
Top 50 instances for selected concepts in the benchmark, ordered by decreasing typicality.
We further conduct a user study for this relatively subjective measure. First, we pick 10 concepts, and the top 50 instances for each concept, according to their typicality. Then, we invite 4 users to manually score the typicality of the instances (with order shuffled) in their respective concepts, as 3 (very representative), 2 (correct but not very representative), 1 (unknown), and 0 (incorrect).
We divide the 50 instances of each concept into 5 groups by their typicality ranks (i.e. top 10 instances from each concept go to Group 1, second 10 instances from each concept go to Group 2, and so on), and then compute the average judge scores assigned to instances within each group. Figure 8 shows that the typicality of the instances in their concepts, as perceived by human judges, decreases with computed typicality, which means our definition of the typicality is sensible.
Please see the following link for applications of the typicality:
Application of Probase