FSP Algorithm

Find Similar Patterns - Next Generation Data Compression

About the FSP Algorithm

The FSP (Find Similar Patterns) algorithm is a revolutionary data compression technique designed to efficiently reduce data size by identifying and leveraging similar patterns within the data. Unlike traditional compression methods, FSP specializes in working with small to medium-sized data volumes where conventional algorithms often introduce significant overhead.

Version 2.1 - Next Level FSP

FSP is a universal data compression algorithm for any files or byte streams. Its core idea is to find similar patterns in data and store only the differences or references, avoiding duplication. Version 2.1 adds precise byte-level storage, automatic pattern length selection, optimized reference handling, and extended max pattern lengths for long repeats.

Key Innovation

FSP introduces automatic pattern length selection, optimized reference handling, and extended pattern matching capabilities that make it exceptionally efficient for data with repeated sequences. The algorithm minimizes overhead by storing patterns that repeat only once as LITERAL instead of REF.

How the FSP Algorithm Works

The FSP algorithm operates on the principle of identifying repeated patterns in data and replacing subsequent occurrences with references to the first occurrence. This approach significantly reduces redundancy without the overhead of traditional compression methods.

Compression Process

  1. Pattern Discovery - The algorithm scans the input data to identify repeated patterns of lengths between the configured minimum and maximum values (typically 3-5 characters).
  2. Automatic Pattern Selection - The algorithm automatically selects the optimal pattern length (3-5 characters) that maximizes repeated patterns while avoiding overhead from too-short sequences.
  3. Reference Creation - For each repeated pattern, the algorithm creates a reference pointing to the first occurrence of that pattern. Patterns that repeat only once are stored as LITERAL to minimize REF overhead.
  4. Byte-Exact Storage - REF and LITERAL are stored as raw bytes for exact size tracking. LITERAL stores 1 byte per character. REF stores position and length information.
  5. Output Generation - The compressed output consists of a combination of literal data (for unique patterns) and references (for repeated patterns) and can be saved directly to a file.

Decompression Process

  1. Parsing - The decompressor reads the compressed data, parsing literal values and references.
  2. Reconstruction - For each reference, the decompressor looks up the referenced pattern and inserts it into the output.
  3. Output - The original data is perfectly reconstructed by combining literal values and resolved references.

LEGO Box Analogy

Think of FSP as a LEGO box organization system:

  1. Pick one construction as the base.
  2. Store similar constructions as base + differences only.
  3. Unique constructions remain unchanged.
  4. You save space by storing only new details instead of the entire construction.

Mathematical Foundation

The algorithm uses sequence comparison at the byte or character level. For a base string B and target string T, the compression efficiency is determined by the number and length of common substrings.

The compression ratio CR is defined as: CR = Original Size / Compressed Size

Where Compressed Size = Size(Literals) + Size(References)

Implementation Details

The FSP algorithm is implemented in Python but can be easily ported to other programming languages. Below is the core implementation of the pattern finding and compression logic.

Pattern Discovery Function

def find_patterns(text, min_len=3, max_len=5):
    """Find repeated patterns in text and return dictionary with positions"""
    patterns = defaultdict(list)
    n = len(text)
    
    # Search for patterns of all lengths between min_len and max_len
    for L in range(min_len, max_len + 1):
        for i in range(n - L + 1):
            s = text[i:i+L]
            patterns[s].append(i)
    
    # Keep only patterns that occur more than once
    return {p: pos for p, pos in patterns.items() if len(pos) > 1}

Compression Function (Byte-Level)

def fsp_compress(text, min_len=3, max_len=5):
    """Compress text using the FSP algorithm with byte-level storage"""
    patterns = find_patterns(text, min_len, max_len)
    compressed = bytearray()
    i = 0
    
    while i < len(text):
        best_pat = None
        best_pos = None
        best_len = 0
        
        for pat, positions in patterns.items():
            if i in positions and len(pat) > best_len:
                # Find first previous occurrence
                prev_positions = [p for p in positions if p < i]
                if prev_positions:
                    best_pat = pat
                    best_pos = prev_positions[-1]
                    best_len = len(pat)
        
        if best_pat and best_pos is not None:
            # REF: 2 bytes pos + 1 byte len
            compressed += struct.pack(">HB", best_pos, best_len)
            i += best_len
        else:
            # LITERAL: 1 byte
            compressed.append(ord(text[i]))
            i += 1
            
    return compressed

Decompression Function (Byte-Level)

def fsp_decompress(compressed):
    """Decompress FSP-compressed data with byte-level storage"""
    output = []
    i = 0
    
    while i < len(compressed):
        # Check if next 3 bytes could be a REF
        if i + 3 <= len(compressed):
            pos, length = struct.unpack(">HB", compressed[i:i+3])
            # Valid REF: pos < len(output)
            if pos < len(output):
                output.extend(output[pos:pos+length])
                i += 3
                continue
        # Otherwise, literal
        output.append(compressed[i])
        i += 1
        
    return bytes(output).decode("ascii")

Pattern Length Guidelines

Implementation Note

The algorithm can be optimized for specific use cases by adjusting the min_len and max_len parameters. For highly repetitive data, increasing max_len can improve compression ratio. For binary data, the algorithm compares bytes instead of characters.

Features & Advantages

High Speed

Linear complexity O(n) for most operations

Efficient Compression

Excellent compression ratio for repetitive data

Simple Implementation

No complex data structures or mathematical operations

Versatility

Works with any data type (text, binary, etc.)

Adaptive Pattern Length

Automatically selects optimal pattern lengths

Streaming Support

Can be adapted for streaming data processing

Byte-Exact Storage

REF and LITERAL stored as raw bytes for exact size tracking

Direct File Output

Compressed data can be saved directly to files

Technical Advantages

Detailed Examples

Example 1: Simple Text Compression

Input Text

"Hello friend! Hello fiend! Hello friends! Hello friend!"

Compression Process

The algorithm identifies repeated patterns like "Hello", "friend", and "ello " and replaces subsequent occurrences with references. Patterns that repeat only once are stored as LITERAL to minimize REF overhead.

Compressed Output

('LITERAL', 'H')
('LITERAL', 'e')
('LITERAL', 'l')
('LITERAL', 'l')
('LITERAL', 'o')
('LITERAL', ' ')
('LITERAL', 'f')
('LITERAL', 'r')
('LITERAL', 'i')
('LITERAL', 'e')
('LITERAL', 'n')
('LITERAL', 'd')
('LITERAL', '!')
('LITERAL', ' ')
('REF', 0, 5)  // Reference to "Hello"
('LITERAL', ' ')
('LITERAL', 'f')
('REF', 8, 5)  // Reference to "riend"
// ... additional references

Compression Results

Original size: 55 bytes

Compressed size: 33 bytes

Compression ratio: 1.67

Example 2: Binary Data Adaptation

The algorithm can be adapted for binary data by comparing bytes instead of characters:

def fsp_compress_binary(data, min_len=3, max_len=5):
    """Compress binary data using FSP algorithm"""
    patterns = defaultdict(list)
    n = len(data)
    
    # Search for patterns of all lengths between min_len and max_len
    for L in range(min_len, max_len + 1):
        for i in range(n - L + 1):
            pattern = data[i:i+L]
            patterns[pattern].append(i)
    
    # Keep only patterns that occur more than once
    patterns = {p: pos for p, pos in patterns.items() if len(pos) > 1}
    
    compressed = bytearray()
    i = 0
    
    while i < n:
        # Try to find the longest pattern starting at i
        best_pat = None
        best_pos = None
        best_len = 0
        
        for pat, positions in patterns.items():
            if i in positions and len(pat) > best_len:
                prev_positions = [p for p in positions if p < i]
                if prev_positions:
                    best_pat = pat
                    best_pos = prev_positions[-1]
                    best_len = len(pat)
        
        if best_pat and best_pos is not None:
            # REF: position + length
            compressed += struct.pack(">HB", best_pos, best_len)
            i += best_len
        else:
            # LITERAL: 1 byte
            compressed.append(data[i])
            i += 1
            
    return compressed

Example 3: File Output

Compressed data can be saved directly to files:

# Save compressed to file
with open("output.txt", "wb") as f:
    f.write(compressed)

# Read compressed from file
with open("output.txt", "rb") as f:
    compressed_data = f.read()
    decompressed = fsp_decompress(compressed_data)

Comparison with Other Algorithms

Characteristic FSP ZIP RLE LZ77
Efficiency for small data Excellent Poor Good Fair
Overhead Minimal High Low Medium
Implementation complexity Low High Very Low High
Compression speed Very High Medium Extremely High Medium
Decompression speed Very High Medium Extremely High Medium
Compression ratio (repetitive data) Excellent Good Fair Good
Pattern adaptability Excellent Good Poor Good
Byte-exact storage Yes No Yes No
Direct file output Yes No Yes No

Key Differentiators

Unlike dictionary methods (LZ77, LZ78), FSP does not require storing a dictionary. Unlike entropy methods (Huffman, Arithmetic), FSP does not require complex probability calculations. This makes FSP uniquely suited for small to medium-sized repetitive data. Version 2.1 adds byte-exact storage and direct file output capabilities.

Applications

Ideal Use Cases

Version Control Systems

Storing differences between file versions efficiently

Databases

Compressing similar records with minimal overhead

Network Transmission

Minimizing data transfer volume for repetitive content

Log File Archiving

Efficient compression of highly similar log entries

Video Surveillance

Compressing consecutive frames with minimal changes

Bioinformatics

Working with DNA/RNA sequences with repeated patterns

Incremental Backups

Storing only changes between backup versions

Image Sets

Compressing sets of images with minor differences

Limitations

  • Less effective for completely random data with no patterns
  • Requires optimization for very large files (GB+)
  • Efficiency depends on the presence of similar patterns in the data
  • Not yet optimized for real-time streaming applications

Changelog

Version 2.1 - Next Level FSP (Current)

  • Byte-exact storage - REF and LITERAL are stored as raw bytes for exact size tracking
  • Improved pattern selection - Automatic minimum pattern length selection between 3-5 characters
  • Minimized REF overhead - Patterns that repeat only once are stored as LITERAL instead of REF
  • Extended maximum pattern length - Support for patterns up to 5-6 characters for longer repeats
  • Direct file output - Compressed data can be saved directly to files (e.g., output.txt)
  • Universal compatibility - Works for any byte file or stream
  • Cross-language implementation - Easily implementable in Python, C, C++, Java, Rust, Go, etc.
  • Enhanced compression ratio - Better compression for various data types
  • LEGO box analogy - Simple explanation of the algorithm's approach
  • Testable implementation - Speed measurement for compression and decompression

Version 2.0 - Next Level FSP

  • Pattern search limited to 3–5 characters to reduce overhead
  • Automatic minimum pattern length selection for maximum repeats
  • Maximum pattern length extended to 5–6 characters for longer repeated sequences
  • Base stream stores concatenated unique patterns
  • References (REF) + literals (LITERAL) replace repeated patterns in text
  • Patterns that repeat only once are stored as LITERAL to minimize REF overhead
  • Works on the entire text, not line-by-line
  • Significantly improved compression ratio
  • Maintains full reversibility

Version 1.0 - Original FSP

  • Block-based compression approach
  • Pattern search from 2 characters to infinity
  • Base + DIFF per line implementation
  • Simple line-oriented approach
  • Potential for compressed text to grow larger than original for small patterns