Regular Expression Search on Compressed Text

zearch is a regular expression engine that takes a regular expression and a grammar-compressed text file and returns every line of the uncompressed text containing a match.

zearch implements the algorithm presented at the Data Compression Conference (DCC) in 2019. A link to the article is given here.

Limitations

no match across lines
no support for invert match option
only regular languages (like RE2) — e.g. no backreferences
no zero-width characters
no named character classes (e.g. alnum, digit, lower, …)
no UTF-8 support, only ASCII characters
no highlighted output, only matching lines are reported in full

Compiling

Quick Start: Install All Tools

The benchmark/install_tools.sh script installs all dependencies and builds zearch in one step. It installs:

Build tools: gcc, make
libaugeas-dev — NFA library required to compile zearch
Compression tools: zstd, lz4, gzip, ncompress
Search baselines: ripgrep, grep, libhyperscan-dev
Data tools: wget, unzip, m4, python3, pip, requests
Re-Pair compressor (built from source at benchmark/repair110811/)
graphs/hyperscan binary (built from source)
zearch binary (built from the project root)

cd benchmark
bash install_tools.sh

After the script completes, zearch is ready to use from the project root.

Manual Installation

Installing libfa

Install the tools and libraries required to build augeas:

sudo apt-get install bison flex libreadline-dev libxml2-dev

Clone the augeas project repository:

git clone git@github.com:hercules-team/augeas.git

From the augeas root folder run:

./autogen.sh
make
sudo make install

Update the library path so that zearch can find libfa:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

Installing Re-Pair

Re-Pair is the compressor used to produce .rp input files for zearch. It must be built from source:

wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/re-pair/repair110811.tar.gz
tar xzf repair110811.tar.gz -C benchmark/
make -C benchmark/repair110811

Compiling Zearch

From the root folder of this repository run make zearch stats debug, which generates three executables:

zearch — the main program
stats — works like zearch but prints memory usage statistics (JSON on stdout)
debug — prints verbose debugging information to stderr

These three tools are invoked the same way:

./{zearch,debug,stats} [-m] <option> <input_regex> <input_file>

where:

-m is optional. When present, the NFA is determinized and minimized after construction. This does not always improve performance.
option can be -c, -l, -a, or -b to print the number of matching lines, the matching lines, both, or simply whether at least one match exists.
input_regex is a regular expression following the format accepted by libfa.
input_file is a Re-Pair-compressed file (.rp).

Example:

# Compress a file
benchmark/repair110811/repair myfile.txt    # produces myfile.txt.rp

# Count matching lines
./zearch -c "hello.*world" myfile.txt.rp

# Print matching lines
./zearch -l "error|warning" logfile.txt.rp

# Boolean existence check
./zearch -b "^admin" users.txt.rp

Comparison with Other Tools

We have compared the performance of zearch with the state-of-the-art approach for regular expression matching on compressed text. The comparison measures only the running time to count matching lines. The state-of-the-art baseline is:

{lz4,zstd} -dc compressed_file | {grep,rg,hyperscan} -c regex

The experiments show that zearch outperforms this baseline even when decompression and search are done in parallel. Pre-computed results are available here.

To reproduce the experiments, follow the steps below.

Step 1 — Install All Tools

Run the install script from the benchmark/ directory (see Quick Start above).

Step 2 — Prepare Benchmark Files

The benchmark/generate_files.sh script downloads real-world datasets (NASA web logs, OpenSubtitles, Project Gutenberg books, Google N-grams, CSV data) and produces compressed test files (.rp, .zst, .lz4, .gz, .Z) at sizes from 1 KB to 500 MB.

Requirements: approximately 12 GB of free disk space.

cd benchmark
bash generate_files.sh

The script will ask for confirmation before downloading. Pass --quick to skip files larger than 100 KB (useful for testing):

bash generate_files.sh --quick

The script creates sub-directories benchmark/logs/, benchmark/subs/, benchmark/gutenberg/, and optionally benchmark/json/, benchmark/csv/, benchmark/yes/, each containing the original text files and all compressed variants.

Step 3 — Run the Benchmarks and Generate Graphs

The graphs/generate_graphs.sh script runs all benchmarks across datasets and sizes, writes timing results to JSON files in graphs/, and produces an index.html with embedded D3.js charts.

cd graphs
bash generate_graphs.sh                       # all file types, all sizes
bash generate_graphs.sh --small               # sizes up to 1MB only
bash generate_graphs.sh --types logs,subs     # specific file types only
bash generate_graphs.sh --small --types csv   # combine flags

Available --types values: subs, gutenberg, csv, logs, qwerty, all (default).

Alternatively, graphs/generate_table.sh runs a smaller fixed benchmark (files up to 100 KB) that also records compression ratios and decompression speeds:

cd graphs
bash generate_table.sh

Both scripts expect:

../zearch binary at the project root
../benchmark/repair110811/repair and despair binaries
../benchmark/{logs,subs,gutenberg,...}/original*.txt files generated by generate_files.sh
graphs/hyperscan binary (built by install_tools.sh)

Once complete, open graphs/index.html in a browser to view the results.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.claude		.claude
benchmark		benchmark
graphs		graphs
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
README.md		README.md
main.pdf		main.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regular Expression Search on Compressed Text

Limitations

Compiling

Quick Start: Install All Tools

Manual Installation

Installing libfa

Installing Re-Pair

Compiling Zearch

Comparison with Other Tools

Step 1 — Install All Tools

Step 2 — Prepare Benchmark Files

Step 3 — Run the Benchmarks and Generate Graphs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Regular Expression Search on Compressed Text

Limitations

Compiling

Quick Start: Install All Tools

Manual Installation

Installing libfa

Installing Re-Pair

Compiling Zearch

Comparison with Other Tools

Step 1 — Install All Tools

Step 2 — Prepare Benchmark Files

Step 3 — Run the Benchmarks and Generate Graphs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages