TY - GEN

T1 - The SBC-Tree

T2 - 11th International Conference on Extending Database Technology, EDBT 2008

AU - Eltabakh, Mohamed Y.

AU - Hon, Wing Kai

AU - Shah, Rahul

AU - Aref, Walid G.

AU - Vitter, Jeffrey S.

N1 - Publisher Copyright:
© 2008 ACM.

PY - 2008

Y1 - 2008

N2 - Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, Prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, Prefix matching, and range search execute in an optimal O(Formula Presented) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations effciently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(mlogB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

AB - Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the String B-tree for Compressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, Prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, Prefix matching, and range search execute in an optimal O(Formula Presented) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations effciently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(mlogB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.

UR - http://www.scopus.com/inward/record.url?scp=43349103348&partnerID=8YFLogxK

U2 - 10.1145/1353343.1353407

DO - 10.1145/1353343.1353407

M3 - Conference contribution

AN - SCOPUS:43349103348

SN - 9781595939265

T3 - Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings

SP - 523

EP - 534

BT - Advances in Database Technology - EDBT 2008 - 11th International Conference on Extending Database Technology, Proceedings

PB - Association for Computing Machinery

Y2 - 25 March 2008 through 29 March 2008

ER -