Data Compression Algorithm

VEDANT MORKAR
6 min readMay 31, 2022

--

Data Compression algorithms can be defined as the process of reduction in sizes of files at the time of retaining the same or similar to some extent of data. This is done by performing the elimination of unnecessary data or making the data again for higher efficiency.

At the time of compression, you have the option to choose from lossy or lossless methods. If we talk about the lossy method it permanently erases the data. On the other hand, lossless take care of your original data. The type you choose depends on what quality you require your files to be.

Lossless Data Compression Algorithms are normally being used for performing the function of archive or any other high-quality functions. These data compression algorithms permit you to perform a reduction of file size. It also ensures that files can be restored fully if they need to be restored.

This Lossless data compression can be grouped into two categories:

1.Entropy-based encoding: This technique isn’t dependent on definite characteristics of the medium. This method of data compression of the algorithm begins by first counting the frequency of every symbol according to its event in the file. For stated symbols of the original file, these new symbols are fixed and aren’t dependent on the content of the file. The new symbols length is variable and it varies on the frequency of including symbols of the original file.

2.Dictionary-based encoding: Dictionary-based algorithms don’t encode single symbols as variable-length bit strings, they encode variable-length strings of symbols as single tokens. The encoder finds the matching pair of strings in the dictionary from the original text and if this match is found it replaces it with its regard in the dictionary.

Entropy Based Encoding

A. Run Length Encoding (RLE)

Run-length encoding (RLE) is a form of lossless data compression in which runs of data (sequences in which the same data value occurs in many consecutive data elements) are stored as a single data value and count, rather than as the original run. This algorithm uses those runs to compress the original source file while keeping all the non-runs without using them for the compression process. For example, the text “aaabbccccd” is examined as a source to compress, taking the first 3 letters as a non-run having a length 3, and the next 7 letters taken as a run having length 7 since symbol a is repeated.

Pseudo Code

Time-Complexity

The time complexity of the above solution is O(n), where n is the length of the input string and doesn’t require any extra space.

B. Shannon Fano

Shannon Fano Algorithm is an entropy encoding technique for lossless data compression of multimedia. Named after Claude Shannon and Robert Fano, it assigns a code to each symbol based on their probabilities of occurrence. It is a variable-length encoding scheme, i.e. the codes assigned to the symbols will be of varying length.

How does it work ?

The steps of the algorithm are as follows;

  1. Create a list of probabilities or frequency counts for the given set of symbols so that the relative frequency of occurrence of each symbol is known.
  2. Sort the list of symbols in decreasing order of probability, the most probable ones to the left and least probable to the right.
  3. Split the list into two parts, with the total probability of both the parts being as close to each other as possible.
  4. Assign the value 0 to the left part and 1 to the right part.
  5. Repeat steps 3 and 4 for each part, until all the symbols are split into individual subgroups.

If c is a character, Probability(c) = Frequency(c) / sum of frequencies

Pseudo Code

Time-Complexity

The time complexity runs similar to quick sort. The recurrence relation is

T(n) = T(k) + T(n — k) + Θ(n) where k and n — k are the size of the partitions, k >

Worst case : The partitions done each time can be very unbalanced, for example if the probabilities are {0.05,0.1,0.15,0.2, 0.5}, then the recurrence relation becomes: T(n) = T(n — 1) + T(1) + Θ(n) T(n) turns out to be Θ(n²) in this case.

Best case: If the partitions can be divided in such a way that the sizes are almost equal, for example if the probabilities are {0.2,0.25,0.25,0.3}, then the recurrence relation becomes: T(n) = 2 T(n/2) + Θ(n) T(n) turns out to be Θ(n log n) in this case.

C. Huffman Coding

Huffman Coding was developed by David Huffman in 1950. Many times Huffman Coding performs better than Shannon Fano Coding. In this coding, the characters in a data file are converted into binary code. Two types of categories of Huffman Encoding have been proposed: Static Huffman Algorithm and Adaptive Huffman Algorithm. Static Huffman Algorithms first count the frequencies and then generate a common tree for the compression and decompression processes. Adaptive Huffman algorithms generate the tree while counting the frequencies and there will be two trees in both processes.

Pseudo Code

Time Complexity

O(N): Putting n pairs into H and Creating n leaves

O(N): Turning H into a heap using Build Heap

O(N Log N): For loop runs n-1 times, in each iteration we perform EXTRACTMIN operation and one insert operation will take O(N Log N).

Therefore: O(N) + O(N) + O(N Log N) ⇒ O(N Log N)

Dictionary Based Encoding

A. LZ77

LZ77 developed by Jacob Ziv and Abraham Lempel in 1977. A dictionary is a part of the previous encoded sequence. The encoder searches into the input sequence within a sliding window. This algorithm assumes patterns in the input stream appear close together. Any pattern that resorts over a period longer than the search buffer size will not be captured. A better compression method would save frequently appearing patterns in the dictionary.

Pseudo Code

Time Complexity

O(M) for a text of M characters.

B. LZ78

This compression algorithm was developed by Jacob Ziv and Abraham Lempel in 1978. LZ78 inserts one- or multi-character, non-overlapping, clear patterns of the message to be encoded in a Dictionary. It keeps an open dictionary.

Pseudo Code

Time-Complexity

LZ78 encoding of size m for S in O((n + m) log σ) time, where σ is the number of distinct characters appearing in S. Given an LZ78 encoding of size m for a string S.

Comparing of LZ77 and LZ78

C.LZW

LZW algorithm is denoted by the Lempel–Ziv–Welch developed by Abraham Lampel , Jacob Zev and Terry Welch in 1984 and it is based on LZ78. A dictionary that is indexed by “codes” is used. The dictionary is supposed to be initialised with 256 entries (indexed with the ASCII codes & that is 0 within 255) representing the ASCII table.

Pseudo Code

Time Complexity

There is a tree character LZW compression dictionary. If stored accordingly, the dictionary can be traversed one node per input byte, essentially making the compression algorithm O(n)-time based on input length.

CONCLUSION

In this blog, we discussed the situations that require lossless compression methods. There are different compression techniques discussed in detail. Our main focus in this blog was on various data compression methods like dictionary based and entropy based. In entropy based techniques, the RLE (Run length encoding) is not used much but Shannon Fano and Huffman encoding algorithms are the two methods which are better than the RLE algorithm. But Shannon Fano and Huffman compression are almost the same but the Huffman coding is better than the Shanon Fano algorithm. In a dictionary based algorithm there is the LZW method that gives the best and accurate result.

--

--

VEDANT MORKAR
VEDANT MORKAR

Written by VEDANT MORKAR

If you want peace, prepare for war.

Responses (2)