Issue
I am looking for a compression algorithm for integers of 10 digits (ex: 5289412587), the aim would be to reduce the compressed result to a number with the fewest digits possible (ex: 125), an algorithm that is reversible, therefore from the result I should then reconstruct the original number, is there something that is right for me? Thank you.
Solution
Version 2: Added the possibility of compression based on knowing values are related.
No, and maybe.
Suppose the values you want to compress are represented by the variable n
, and 0 <= n <= m
. If the values of n
are evenly distributed over the range, all values in the range are possible, and each value is independent of the others, then it is not possible to have a lossless compression algorithm.
However, it might be that the possible values of n
in your data are not evenly distributed. That is, some values are common, but others values are rare. Maybe some don't occur. If that is the case, there is a possibility you could use Huffman coding or something similar.
Huffman compression represents values using a variable number of bits for the values to be represented: The more common values are represented by fewer bits, the less common values by more bits.
One way to use Huffman coding is to sample your data, and create a "standard" table for use in all your coding and decoding of n
. This can reduce processing time. If you don't create a "standard" table, you would build a table each time you compress by running through your data and looking at the values of n
, count how many times each value occurs, and sort by number of occurrences. This requires more processing, but the table will be fine-tuned to that particular run. It also requires the resulting table be carried with the compressed data.
But, if your data doesn't have values that are more common than other values, Huffman coding won't be useful.
Another, and perhaps more likely possibility in your case, is compression based on knowledge of interrelationship of the values. If there are circumstances that allow you to predict one value given information about another value, it might be possible to use that for data compression. I can best explain that with two examples.
About 11 years ago, I was looking at the Topcoder (topcoder.com) website. There was a contest there to compress data from DNA sequencing. The winning algorithms did this: For each block of data, create a new block with the same length (number of elements). The first value in the new block was a copy of the first value in the original block. But, each subsequent value v[n]
in the new block would be the difference between two adjacent values in the original block: new block v[n] = original block v[n] - original block v[n - 1]
. The set of new blocks would then be compressed using Huffman coding.
That worked because the difference between the minimum and maximum difference between successive values was much smaller than the range of an individual value, and some differences were a lot more common than others.
Another example was for medical image processing. An algorithm looked at images from MRI or CT scans for closed loops. When it found a closed loop, it recorded, in order, the positions of each pixel in the loop. The data was sent to another lab for analysis.
A block of compressed data would represent one loop. A block began with a copy of the location of the first pixel. The location of each subsequent pixel was coded relative to the previous pixel in the sequence. Because a loop was continuous, each pixel in the loop had to be orthogonally or diagonally adjacent to the previous. This allowed relative locations to be coded using 3 bits. For one loop, the compression / decompression algorithm ran until the calculated position of a pixel was the same as the original pixel. Then, the next block would be processed.
Answered By - Old Dog Programmer
Answer Checked By - Katrina (JavaFixing Volunteer)