Find Similar Patterns - Next Generation Data Compression
The FSP (Find Similar Patterns) algorithm is a revolutionary data compression technique designed to efficiently reduce data size by identifying and leveraging similar patterns within the data. Unlike traditional compression methods, FSP specializes in working with small to medium-sized data volumes where conventional algorithms often introduce significant overhead.
FSP is a universal data compression algorithm for any files or byte streams. Its core idea is to find similar patterns in data and store only the differences or references, avoiding duplication. Version 2.1 adds precise byte-level storage, automatic pattern length selection, optimized reference handling, and extended max pattern lengths for long repeats.
FSP introduces automatic pattern length selection, optimized reference handling, and extended pattern matching capabilities that make it exceptionally efficient for data with repeated sequences. The algorithm minimizes overhead by storing patterns that repeat only once as LITERAL instead of REF.
The FSP algorithm operates on the principle of identifying repeated patterns in data and replacing subsequent occurrences with references to the first occurrence. This approach significantly reduces redundancy without the overhead of traditional compression methods.
Think of FSP as a LEGO box organization system:
The algorithm uses sequence comparison at the byte or character level. For a base string B and target string T, the compression efficiency is determined by the number and length of common substrings.
The compression ratio CR is defined as: CR = Original Size / Compressed Size
Where Compressed Size = Size(Literals) + Size(References)
The FSP algorithm is implemented in Python but can be easily ported to other programming languages. Below is the core implementation of the pattern finding and compression logic.
def find_patterns(text, min_len=3, max_len=5):
"""Find repeated patterns in text and return dictionary with positions"""
patterns = defaultdict(list)
n = len(text)
# Search for patterns of all lengths between min_len and max_len
for L in range(min_len, max_len + 1):
for i in range(n - L + 1):
s = text[i:i+L]
patterns[s].append(i)
# Keep only patterns that occur more than once
return {p: pos for p, pos in patterns.items() if len(pos) > 1}
def fsp_compress(text, min_len=3, max_len=5):
"""Compress text using the FSP algorithm with byte-level storage"""
patterns = find_patterns(text, min_len, max_len)
compressed = bytearray()
i = 0
while i < len(text):
best_pat = None
best_pos = None
best_len = 0
for pat, positions in patterns.items():
if i in positions and len(pat) > best_len:
# Find first previous occurrence
prev_positions = [p for p in positions if p < i]
if prev_positions:
best_pat = pat
best_pos = prev_positions[-1]
best_len = len(pat)
if best_pat and best_pos is not None:
# REF: 2 bytes pos + 1 byte len
compressed += struct.pack(">HB", best_pos, best_len)
i += best_len
else:
# LITERAL: 1 byte
compressed.append(ord(text[i]))
i += 1
return compressed
def fsp_decompress(compressed):
"""Decompress FSP-compressed data with byte-level storage"""
output = []
i = 0
while i < len(compressed):
# Check if next 3 bytes could be a REF
if i + 3 <= len(compressed):
pos, length = struct.unpack(">HB", compressed[i:i+3])
# Valid REF: pos < len(output)
if pos < len(output):
output.extend(output[pos:pos+length])
i += 3
continue
# Otherwise, literal
output.append(compressed[i])
i += 1
return bytes(output).decode("ascii")
The algorithm can be optimized for specific use cases by adjusting the min_len and max_len parameters. For highly repetitive data, increasing max_len can improve compression ratio. For binary data, the algorithm compares bytes instead of characters.
Linear complexity O(n) for most operations
Excellent compression ratio for repetitive data
No complex data structures or mathematical operations
Works with any data type (text, binary, etc.)
Automatically selects optimal pattern lengths
Can be adapted for streaming data processing
REF and LITERAL stored as raw bytes for exact size tracking
Compressed data can be saved directly to files
"Hello friend! Hello fiend! Hello friends! Hello friend!"
The algorithm identifies repeated patterns like "Hello", "friend", and "ello " and replaces subsequent occurrences with references. Patterns that repeat only once are stored as LITERAL to minimize REF overhead.
('LITERAL', 'H')
('LITERAL', 'e')
('LITERAL', 'l')
('LITERAL', 'l')
('LITERAL', 'o')
('LITERAL', ' ')
('LITERAL', 'f')
('LITERAL', 'r')
('LITERAL', 'i')
('LITERAL', 'e')
('LITERAL', 'n')
('LITERAL', 'd')
('LITERAL', '!')
('LITERAL', ' ')
('REF', 0, 5) // Reference to "Hello"
('LITERAL', ' ')
('LITERAL', 'f')
('REF', 8, 5) // Reference to "riend"
// ... additional references
Original size: 55 bytes
Compressed size: 33 bytes
Compression ratio: 1.67
The algorithm can be adapted for binary data by comparing bytes instead of characters:
def fsp_compress_binary(data, min_len=3, max_len=5):
"""Compress binary data using FSP algorithm"""
patterns = defaultdict(list)
n = len(data)
# Search for patterns of all lengths between min_len and max_len
for L in range(min_len, max_len + 1):
for i in range(n - L + 1):
pattern = data[i:i+L]
patterns[pattern].append(i)
# Keep only patterns that occur more than once
patterns = {p: pos for p, pos in patterns.items() if len(pos) > 1}
compressed = bytearray()
i = 0
while i < n:
# Try to find the longest pattern starting at i
best_pat = None
best_pos = None
best_len = 0
for pat, positions in patterns.items():
if i in positions and len(pat) > best_len:
prev_positions = [p for p in positions if p < i]
if prev_positions:
best_pat = pat
best_pos = prev_positions[-1]
best_len = len(pat)
if best_pat and best_pos is not None:
# REF: position + length
compressed += struct.pack(">HB", best_pos, best_len)
i += best_len
else:
# LITERAL: 1 byte
compressed.append(data[i])
i += 1
return compressed
Compressed data can be saved directly to files:
# Save compressed to file
with open("output.txt", "wb") as f:
f.write(compressed)
# Read compressed from file
with open("output.txt", "rb") as f:
compressed_data = f.read()
decompressed = fsp_decompress(compressed_data)
| Characteristic | FSP | ZIP | RLE | LZ77 |
|---|---|---|---|---|
| Efficiency for small data | Excellent | Poor | Good | Fair |
| Overhead | Minimal | High | Low | Medium |
| Implementation complexity | Low | High | Very Low | High |
| Compression speed | Very High | Medium | Extremely High | Medium |
| Decompression speed | Very High | Medium | Extremely High | Medium |
| Compression ratio (repetitive data) | Excellent | Good | Fair | Good |
| Pattern adaptability | Excellent | Good | Poor | Good |
| Byte-exact storage | Yes | No | Yes | No |
| Direct file output | Yes | No | Yes | No |
Unlike dictionary methods (LZ77, LZ78), FSP does not require storing a dictionary. Unlike entropy methods (Huffman, Arithmetic), FSP does not require complex probability calculations. This makes FSP uniquely suited for small to medium-sized repetitive data. Version 2.1 adds byte-exact storage and direct file output capabilities.
Storing differences between file versions efficiently
Compressing similar records with minimal overhead
Minimizing data transfer volume for repetitive content
Efficient compression of highly similar log entries
Compressing consecutive frames with minimal changes
Working with DNA/RNA sequences with repeated patterns
Storing only changes between backup versions
Compressing sets of images with minor differences