The RiST/ViST Algorithm for
Querying DBLP Data
This distribution contains only the querying program of the RiST/ViST algorithm. It comes with the index files for the DBLP data, which contains more than 400 K documents.
The
package contains the following files:
q
The
executable for querying XML database: search.exe. You
may also need the cygwin runtime library since it was
compiled under the Cygwin environment.
q
The
index files for the DBLP dataset: the ancestor/descendant
index and the document id index.
q
The
preprocessed DBLP data file in sequence form. (The query
algorithm does not use this file)
Usage:
search.exe
dblp.idx0 dblp.idx query-sequence
The
query sequence is a sequence of pairs. For instance, “0,0 10,2 39,2 –4005,3”
The
algorithm does not accept queries in XML form (although an conversion can be
very easily implemented). The nodes and the paths of the DBLP data are encoded
by the index algorithm. The encodings can be found here.
Examples:
q [~/projects/sequence]
$ ./search dblp.idx0 dblp.idx "0,0 8,2"
[0,2] 3 [4,145490]
Elapsed time (read index head): 0
sec 58000 usec
Elapsed time (searching): 0 sec
1000 usec
Elapsed time ( total ): 0 sec
59000 usec
The answer “[0,2] 3 [4,145490]” means documents 0 to
2, document 3, and document 4 to 145490 satisfy the query. (document i is the
i-th document in the DBLP data file)
The elapsed time shown in the output does not
include the time in displaying the results.
q
[~/projects/sequence]
$ ./search dblp.idx0 dblp.idx "0,0 8,2 11,2 11,2 11,2 -29,3"
8 39 1347 1970 [6899,6900] [10381,10382] 13750 [17715,17719] 26395
[26575,26579] 37834
Elapsed time (read index head): 0
sec 59000 usec
Elapsed time (searching): 0 sec
1000 usec
Elapsed time ( total ): 0 sec
60000 usec
Note:
The
program is for demonstration purpose, and it is not fully optimized for
performance comparison. The querying algorithm performs sequence matching only,
and it does not handle duplicates, false alarms, etc.
Haixun
Wang, haixun@us.ibm.com