zearch is a regular expression engine that takes a regular expression and a grammar-compressed text file and returns every line of the uncompressed text containing a match.
zearch implements the algorithm presented at the Data Compression Conference (DCC) in 2019. A link to the article is given here.
- no match across lines
- no support for invert match option
- only regular languages (like RE2) — e.g. no backreferences
- no zero-width characters
- no named character classes (e.g. alnum, digit, lower, …)
- no UTF-8 support, only ASCII characters
- no highlighted output, only matching lines are reported in full
The benchmark/install_tools.sh script installs all dependencies and builds zearch in one step. It installs:
- Build tools:
gcc,make libaugeas-dev— NFA library required to compile zearch- Compression tools:
zstd,lz4,gzip,ncompress - Search baselines:
ripgrep,grep,libhyperscan-dev - Data tools:
wget,unzip,m4,python3,pip,requests - Re-Pair compressor (built from source at
benchmark/repair110811/) graphs/hyperscanbinary (built from source)zearchbinary (built from the project root)
cd benchmark
bash install_tools.shAfter the script completes, zearch is ready to use from the project root.
Install the tools and libraries required to build augeas:
sudo apt-get install bison flex libreadline-dev libxml2-devClone the augeas project repository:
git clone git@github.com:hercules-team/augeas.gitFrom the augeas root folder run:
./autogen.sh
make
sudo make installUpdate the library path so that zearch can find libfa:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/libRe-Pair is the compressor used to produce .rp input files for zearch. It must be built from source:
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/re-pair/repair110811.tar.gz
tar xzf repair110811.tar.gz -C benchmark/
make -C benchmark/repair110811From the root folder of this repository run make zearch stats debug, which generates three executables:
zearch— the main programstats— works like zearch but prints memory usage statistics (JSON on stdout)debug— prints verbose debugging information to stderr
These three tools are invoked the same way:
./{zearch,debug,stats} [-m] <option> <input_regex> <input_file>
where:
-mis optional. When present, the NFA is determinized and minimized after construction. This does not always improve performance.optioncan be-c,-l,-a, or-bto print the number of matching lines, the matching lines, both, or simply whether at least one match exists.input_regexis a regular expression following the format accepted by libfa.input_fileis a Re-Pair-compressed file (.rp).
Example:
# Compress a file
benchmark/repair110811/repair myfile.txt # produces myfile.txt.rp
# Count matching lines
./zearch -c "hello.*world" myfile.txt.rp
# Print matching lines
./zearch -l "error|warning" logfile.txt.rp
# Boolean existence check
./zearch -b "^admin" users.txt.rpWe have compared the performance of zearch with the state-of-the-art approach for regular expression matching on compressed text. The comparison measures only the running time to count matching lines. The state-of-the-art baseline is:
{lz4,zstd} -dc compressed_file | {grep,rg,hyperscan} -c regex
The experiments show that zearch outperforms this baseline even when decompression and search are done in parallel. Pre-computed results are available here.
To reproduce the experiments, follow the steps below.
Run the install script from the benchmark/ directory (see Quick Start above).
The benchmark/generate_files.sh script downloads real-world datasets (NASA web logs, OpenSubtitles, Project Gutenberg books, Google N-grams, CSV data) and produces compressed test files (.rp, .zst, .lz4, .gz, .Z) at sizes from 1 KB to 500 MB.
Requirements: approximately 12 GB of free disk space.
cd benchmark
bash generate_files.shThe script will ask for confirmation before downloading. Pass --quick to skip files larger than 100 KB (useful for testing):
bash generate_files.sh --quickThe script creates sub-directories benchmark/logs/, benchmark/subs/, benchmark/gutenberg/, and optionally benchmark/json/, benchmark/csv/, benchmark/yes/, each containing the original text files and all compressed variants.
The graphs/generate_graphs.sh script runs all benchmarks across datasets and sizes, writes timing results to JSON files in graphs/, and produces an index.html with embedded D3.js charts.
cd graphs
bash generate_graphs.sh # all file types, all sizes
bash generate_graphs.sh --small # sizes up to 1MB only
bash generate_graphs.sh --types logs,subs # specific file types only
bash generate_graphs.sh --small --types csv # combine flagsAvailable --types values: subs, gutenberg, csv, logs, qwerty, all (default).
Alternatively, graphs/generate_table.sh runs a smaller fixed benchmark (files up to 100 KB) that also records compression ratios and decompression speeds:
cd graphs
bash generate_table.shBoth scripts expect:
../zearchbinary at the project root../benchmark/repair110811/repairanddespairbinaries../benchmark/{logs,subs,gutenberg,...}/original*.txtfiles generated bygenerate_files.shgraphs/hyperscanbinary (built byinstall_tools.sh)
Once complete, open graphs/index.html in a browser to view the results.