How to download the library
The library can be downloaded from this git repository by using the following command
[prettify]git clone git://github.com/maropu/integer_encoding_library.git[/prettify]
We provide several datasets for testing the library. Each datasets consists in a collection of posting lists obtained by indexing collections of web pages. We excluded from the datasets a list whenever its length is smaller than 33 elements or larger than 200,000,000 elements.
Each dataset is compressed by using Delta coding. After having “untarred” the archive, the collection of lists can be decompressed by means of our decoder (see the tutorial for further instructions). The decompress consists in the concatenation of lists. Each lists is a sequence of integers sorted in increasing order and represented in binary by using 32 bits per element. Each lists starts with its length written in binary by using 32 bits. Code in encoders.cpp shows how to process a (decompressed) dataset.
The following are the currently available datasets.
- Gov2.sort is formed by posting lists for an index built on .gov2, one of the document collections available in the TREC’s Terabyte Track. Document Ids are obtained by sorting in lexicographically the URLs of the documents.
[number of integers: 5.3 billion, compress size: 2.5 Gbytes]
- Yandex.int is a dataset that has been kindly provided by Yandex.
[number of integers: 1.3 billion, compress size: 0.7 Gbytes]
We planned to provide other datasets soon. If you have other datasets worth to be published here, we encourage you to contact us.