FSP Algorithm

Find Similar Patterns - A New Era of Data Compression

What is the FSP Algorithm?

FSP (Find Similar Patterns) is a revolutionary data compression algorithm designed to effectively reduce data size by finding similar patterns. Unlike traditional compression methods (like ZIP, RAR, etc.), FSP specializes in working with small and medium-sized data volumes, where classic algorithms are often inefficient.

Why is FSP Unique?

Most compression algorithms add overhead, which makes them inefficient for small files. FSP uses a fundamentally different approach, avoiding this problem.

How the FSP Algorithm Works

The algorithm is based on the principle of delta encoding and finding differences between similar data blocks.

Compression Stages:

  1. Choosing the base block - the first data block is saved in full as a reference
  2. Finding differences - differences from the base block are found for each subsequent block
  3. Storage optimization - if the differences are fewer than the block size, only they are saved
  4. Storing unique blocks - if a block is very different, it is saved in full

Mathematical principles:

The algorithm uses sequence comparison at the byte or character level. For two sequences A and B, the difference is defined as a set of tuples (position, new_value).

Formal definition:

For a base string B and a target string T, the delta D is defined as:

D = {(i, T[i]) | for all i, where B[i] ≠ T[i]}

If len(T) > len(B), additional characters are added to the delta.

Detailed Algorithm Implementation

Full implementation in Python

def fsp_diff(a: str, b: str):
    """
    Finds differences between two strings (can be of different lengths)
    
    Args:
        a (str): The first string (base)
        b (str): The second string (to be compared)
    
    Returns:
        list: A list of tuples (position, new_character)
    """
    diffs = []
    max_len = max(len(a), len(b))
    
    for i in range(max_len):
        char_a = a[i] if i < len(a) else None
        char_b = b[i] if i < len(b) else None
        
        if char_a != char_b:
            diffs.append((i, char_b))
    
    return diffs


def fsp_compress(lines):
    """
    Compresses a list of strings using the FSP algorithm
    
    Args:
        lines (list): A list of strings to compress
    
    Returns:
        list: Compressed data in the format of a list of tuples (tag, data)
    """
    compressed = []
    
    if not lines:
        return compressed
    
    # Save the first string as the base
    base = lines[0]
    compressed.append(("BASE", base))
    
    # Process the remaining strings
    for line in lines[1:]:
        # Find differences from the base string
        differences = fsp_diff(base, line)
        
        # Determine what's more efficient: save the diff or the entire string
        if len(differences) < len(line):
            compressed.append(("DIFF", differences))
        else:
            compressed.append(("RAW", line))
    
    return compressed


def fsp_decompress(compressed):
    """
    Decompresses FSP-compressed data
    
    Args:
        compressed (list): Compressed data in the format of a list of tuples (tag, data)
    
    Returns:
        list: The restored list of strings
    """
    lines = []
    base = None
    
    for tag, data in compressed:
        if tag == "BASE":
            base = data
            lines.append(base)
        elif tag == "DIFF":
            if base is None:
                raise ValueError("BASE block must come before DIFF blocks")
            
            # Create a copy of the base string for modification
            current_chars = list(base)
            
            # Apply all differences
            for position, new_char in data:
                if position < len(current_chars):
                    if new_char is None:
                        # Remove the character
                        current_chars.pop(position)
                    else:
                        # Replace the character
                        current_chars[position] = new_char
                else:
                    if new_char is not None:
                        # Add a new character
                        current_chars.append(new_char)
            
            # Convert back to a string and add to the result
            reconstructed_line = ''.join(current_chars)
            lines.append(reconstructed_line)
        elif tag == "RAW":
            lines.append(data)
    
    return lines


# Example usage
if __name__ == "__main__":
    # Input data for compression
    test_data = [
        "Hello friend!",
        "Hello fiend!",
        "Hello friends!",
        "Hello friend!"
    ]
    
    print("Original data:")
    for i, line in enumerate(test_data):
        print(f"{i}: {line} (length: {len(line)} bytes)")
    
    # Compression
    compressed_data = fsp_compress(test_data)
    
    print("\nCompressed data:")
    for tag, data in compressed_data:
        if tag == "DIFF":
            print(f"{tag}: {len(data)} differences")
        else:
            print(f"{tag}: {data}")
    
    # Decompression
    decompressed_data = fsp_decompress(compressed_data)
    
    print("\nDecompressed data:")
    for i, line in enumerate(decompressed_data):
        print(f"{i}: {line}")
    
    # Check for correctness
    assert test_data == decompressed_data, "Error: Decompressed data does not match the original!"
    print("\n✓ Check passed: Data correctly restored!")
    
    # Calculate compression efficiency
    original_size = sum(len(line.encode('utf-8')) for line in test_data)
    compressed_size = 0
    
    for tag, data in compressed_data:
        if tag in ("BASE", "RAW"):
            compressed_size += len(data.encode('utf-8'))
        elif tag == "DIFF":
            # Assuming each difference takes ~2 bytes
            compressed_size += len(data) * 2
    
    ratio = original_size / compressed_size
    print(f"\nCompression efficiency:")
    print(f"Original size: {original_size} bytes")
    print(f"Compressed size: ~{compressed_size} bytes")
    print(f"Compression ratio: {ratio:.2f}")

Advantages of the FSP Algorithm

Compared to traditional methods:

Technical advantages:

Detailed Examples of Operation

Example 1: Text Data

Input data:

["Hello friend!", "Hello fiend!", "Hello friends!", "Hello friend!"]

Compression process:

  1. Save "Hello friend!" as BASE
  2. For "Hello fiend!" find differences: [(7, 'i'), (8, 'e'), (9, 'n'), (10, 'd'), (11, '!'), (12, None)]
  3. For "Hello friends!" find differences: [(12, 's'), (13, '!')]
  4. For "Hello friend!" there are no differences (empty diff)

Compression result:

[("BASE", "Hello friend!"),
 ("DIFF", [(7, 'i'), (8, 'e'), (9, 'n'), (10, 'd'), (11, '!'), (12, None)]),
 ("DIFF", [(12, 's'), (13, '!')]),
 ("DIFF", [])]

Example 2: Binary Data

The algorithm can be adapted to work with binary data by comparing bytes instead of characters:

def fsp_diff_binary(a: bytes, b: bytes):
    diffs = []
    max_len = max(len(a), len(b))
    
    for i in range(max_len):
        byte_a = a[i] if i < len(a) else None
        byte_b = b[i] if i < len(b) else None
        
        if byte_a != byte_b:
            diffs.append((i, byte_b))
    
    return diffs

Comparison with Other Algorithms

Characteristic FSP ZIP RLE LZ77
Efficiency for small data High Low Medium Low
Overhead Minimal High Low Medium
Implementation complexity Low High Very Low High
Compression speed High Medium Very High Medium
Compression ratio High for similar data High for large data Low High

Key differences:

Unlike dictionary methods (LZ77, LZ78), FSP does not require storing a dictionary. Unlike entropy methods (Huffman, Arithmetic), FSP does not require complex probability calculations.

Areas of Application

Ideal usage scenarios:

Limitations: