Find Similar Patterns - A New Era of Data Compression
FSP (Find Similar Patterns) is a revolutionary data compression algorithm designed to effectively reduce data size by finding similar patterns. Unlike traditional compression methods (like ZIP, RAR, etc.), FSP specializes in working with small and medium-sized data volumes, where classic algorithms are often inefficient.
Most compression algorithms add overhead, which makes them inefficient for small files. FSP uses a fundamentally different approach, avoiding this problem.
The algorithm is based on the principle of delta encoding and finding differences between similar data blocks.
The algorithm uses sequence comparison at the byte or character level. For two sequences A and B, the difference is defined as a set of tuples (position, new_value).
For a base string B and a target string T, the delta D is defined as:
D = {(i, T[i]) | for all i, where B[i] ≠ T[i]}
If len(T) > len(B), additional characters are added to the delta.
def fsp_diff(a: str, b: str):
"""
Finds differences between two strings (can be of different lengths)
Args:
a (str): The first string (base)
b (str): The second string (to be compared)
Returns:
list: A list of tuples (position, new_character)
"""
diffs = []
max_len = max(len(a), len(b))
for i in range(max_len):
char_a = a[i] if i < len(a) else None
char_b = b[i] if i < len(b) else None
if char_a != char_b:
diffs.append((i, char_b))
return diffs
def fsp_compress(lines):
"""
Compresses a list of strings using the FSP algorithm
Args:
lines (list): A list of strings to compress
Returns:
list: Compressed data in the format of a list of tuples (tag, data)
"""
compressed = []
if not lines:
return compressed
# Save the first string as the base
base = lines[0]
compressed.append(("BASE", base))
# Process the remaining strings
for line in lines[1:]:
# Find differences from the base string
differences = fsp_diff(base, line)
# Determine what's more efficient: save the diff or the entire string
if len(differences) < len(line):
compressed.append(("DIFF", differences))
else:
compressed.append(("RAW", line))
return compressed
def fsp_decompress(compressed):
"""
Decompresses FSP-compressed data
Args:
compressed (list): Compressed data in the format of a list of tuples (tag, data)
Returns:
list: The restored list of strings
"""
lines = []
base = None
for tag, data in compressed:
if tag == "BASE":
base = data
lines.append(base)
elif tag == "DIFF":
if base is None:
raise ValueError("BASE block must come before DIFF blocks")
# Create a copy of the base string for modification
current_chars = list(base)
# Apply all differences
for position, new_char in data:
if position < len(current_chars):
if new_char is None:
# Remove the character
current_chars.pop(position)
else:
# Replace the character
current_chars[position] = new_char
else:
if new_char is not None:
# Add a new character
current_chars.append(new_char)
# Convert back to a string and add to the result
reconstructed_line = ''.join(current_chars)
lines.append(reconstructed_line)
elif tag == "RAW":
lines.append(data)
return lines
# Example usage
if __name__ == "__main__":
# Input data for compression
test_data = [
"Hello friend!",
"Hello fiend!",
"Hello friends!",
"Hello friend!"
]
print("Original data:")
for i, line in enumerate(test_data):
print(f"{i}: {line} (length: {len(line)} bytes)")
# Compression
compressed_data = fsp_compress(test_data)
print("\nCompressed data:")
for tag, data in compressed_data:
if tag == "DIFF":
print(f"{tag}: {len(data)} differences")
else:
print(f"{tag}: {data}")
# Decompression
decompressed_data = fsp_decompress(compressed_data)
print("\nDecompressed data:")
for i, line in enumerate(decompressed_data):
print(f"{i}: {line}")
# Check for correctness
assert test_data == decompressed_data, "Error: Decompressed data does not match the original!"
print("\n✓ Check passed: Data correctly restored!")
# Calculate compression efficiency
original_size = sum(len(line.encode('utf-8')) for line in test_data)
compressed_size = 0
for tag, data in compressed_data:
if tag in ("BASE", "RAW"):
compressed_size += len(data.encode('utf-8'))
elif tag == "DIFF":
# Assuming each difference takes ~2 bytes
compressed_size += len(data) * 2
ratio = original_size / compressed_size
print(f"\nCompression efficiency:")
print(f"Original size: {original_size} bytes")
print(f"Compressed size: ~{compressed_size} bytes")
print(f"Compression ratio: {ratio:.2f}")
Input data:
["Hello friend!", "Hello fiend!", "Hello friends!", "Hello friend!"]
Compression process:
Compression result:
[("BASE", "Hello friend!"), ("DIFF", [(7, 'i'), (8, 'e'), (9, 'n'), (10, 'd'), (11, '!'), (12, None)]), ("DIFF", [(12, 's'), (13, '!')]), ("DIFF", [])]
The algorithm can be adapted to work with binary data by comparing bytes instead of characters:
def fsp_diff_binary(a: bytes, b: bytes):
diffs = []
max_len = max(len(a), len(b))
for i in range(max_len):
byte_a = a[i] if i < len(a) else None
byte_b = b[i] if i < len(b) else None
if byte_a != byte_b:
diffs.append((i, byte_b))
return diffs
Characteristic | FSP | ZIP | RLE | LZ77 |
---|---|---|---|---|
Efficiency for small data | High | Low | Medium | Low |
Overhead | Minimal | High | Low | Medium |
Implementation complexity | Low | High | Very Low | High |
Compression speed | High | Medium | Very High | Medium |
Compression ratio | High for similar data | High for large data | Low | High |
Unlike dictionary methods (LZ77, LZ78), FSP does not require storing a dictionary. Unlike entropy methods (Huffman, Arithmetic), FSP does not require complex probability calculations.