The RiST/ViST Algorithm for

Querying DBLP Data

 

 

This distribution contains only the querying program of the RiST/ViST algorithm. It comes with the index files for the DBLP data, which contains more than 400 K documents.

 

The package contains the following files:

 

q       The executable for querying XML database: search.exe. You may also need the cygwin runtime library since it was compiled under the Cygwin environment.

q       The index files for the DBLP dataset: the ancestor/descendant index and the document id index.

q       The preprocessed DBLP data file in sequence form. (The query algorithm does not use this file)

 

 

Usage:

 

                search.exe dblp.idx0 dblp.idx query-sequence

 

                The query sequence is a sequence of pairs. For instance, “0,0 10,2 39,2 –4005,3”

 

                The algorithm does not accept queries in XML form (although an conversion can be very easily implemented). The nodes and the paths of the DBLP data are encoded by the index algorithm. The encodings can be found here.

 

Examples:

 

q       [~/projects/sequence] $ ./search dblp.idx0 dblp.idx "0,0 8,2"
[0,2] 3 [4,145490]
Elapsed time (read index head):  0 sec 58000 usec
Elapsed time (searching):  0 sec 1000 usec
Elapsed time ( total ):  0 sec 59000 usec

The answer “[0,2] 3 [4,145490]” means documents 0 to 2, document 3, and document 4 to 145490 satisfy the query. (document i is the i-th document in the DBLP data file)

 

The elapsed time shown in the output does not include the time in displaying the results.

q       [~/projects/sequence] $ ./search dblp.idx0 dblp.idx "0,0 8,2 11,2 11,2 11,2 -29,3"
8 39 1347 1970 [6899,6900] [10381,10382] 13750 [17715,17719] 26395 [26575,26579] 37834
Elapsed time (read index head):  0 sec 59000 usec
Elapsed time (searching):  0 sec 1000 usec
Elapsed time ( total ):  0 sec 60000 usec

 

Note:

                The program is for demonstration purpose, and it is not fully optimized for performance comparison. The querying algorithm performs sequence matching only, and it does not handle duplicates, false alarms, etc.

 

 

Haixun Wang, haixun@us.ibm.com