数据压缩

压缩数据以节省储存空间,节省传输时间。同时很多文件都有很多冗余信息,这为压缩提供了很多可能性。

通用文件压缩 ·文件:GZIP,BZIP,7z ·Archivers:PKZIP ·文件系统:NTFS,HFS +,ZFS

多媒体 ·图像:GIF,JPEG ·声音:MP3 ·视频:MPEG,DivX™,HDTV

通讯 ·ITU-T T4 Group 3 Fax ·V.42bis调制解调器 ·Skype

数据库

压缩率

Compression ratio = Bits in Compressed B / bits in B.

自然语言的压缩率为50-75%或更高.

读写二进制

public class BinaryStdIn {
    boolean readBoolean() // read 1 bit of data and return as a boolean value
    char readChar() // read 8 bits of data and return as a char value
    char readChar(int r) // read r bits of data and return as a char value
    // similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)
    boolean isEmpty() // is the bitstream empty?
    void close() // close the bitstream
}

public class BinaryStdOut {
    void write(boolean b) // write the specified bit
    void write(char c) // write the specified 8-bit char
    void write(char c, int r) // write the r least significant bits of the specified char
    // similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits)
    void close() // close the bitstream
}

比如使用三种方法表达12/31/1999 1, A character stream (StdOut),

StdOut.print(month + "/" + day + "/" + year);

00110001 1 00110010 2 00101111 / 00110111 3 00110001 1 00101111 / 00110001 1 00111001 9 00111001 9 00111001 1 共 80bits 2, Three ints (BinaryStdOut)

BinaryStdOut.write(month);
BinaryStdOut.write(day);
BinaryStdOut.write(year);

00000000 00000000 00000000 00001100 12 00000000 00000000 00000000 00011111 31 00000000 00000000 00000111 11001111 1999 共96bits 3,A 4-bit field, a 5-bit field, and a 12-bit field (BinaryStdOut)

BinaryStdOut.write(month, 4);
BinaryStdOut.write(day, 5);
BinaryStdOut.write(year, 12);

1100 12 11111 13 0111110 01111 1999 共21bits

通用数据压缩算法?

不存在的,因为假如真的存在一种可以压缩所有比特串的算法,那么该算法就可以继续压缩已经被它压缩过的数据,那意味着所有比特串可以被压缩为0比特.

Run-length encoding

Simple type of redundancy in a bitstream. Long runs of repeated bits: 0000000000000001111111000000011111111111 Compression, 4-bit counts to represent alternating runs of 0s and 1s: 15 0s, then 7 1s, then 7 0s, then 11 1s. 1111 0111 0111 1011

public class RunLength
{
    // maximum run-length count
   private final static int R    = 256;
   // number of bits per count
   private final static int LG_R = 8;

   /**
     * Reads a sequence of bits from standard input; compresses
     * them using run-length coding with 8-bit run lengths; and writes the
     * results to standard output.
     */
   public static void compress()
   {
        char run = 0;
        boolean old = false;
        while (!BinaryStdIn.isEmpty()) {
            boolean b = BinaryStdIn.readBoolean();
            if (b != old) {
                BinaryStdOut.write(run, LG_R);
                run = 1;
                old = !old;
            }
            else { // 如果长度超过最大值, 写入0
                if (run == R-1) {
                    BinaryStdOut.write(run, LG_R);
                    run = 0;
                    BinaryStdOut.write(run, LG_R);
                }
                run++;
            }
        }
        BinaryStdOut.write(run, LG_R);
        BinaryStdOut.close();
   }

   /**
     * Reads a sequence of bits from standard input (that are encoded
     * using run-length encoding with 8-bit run lengths); decodes them;
     * and writes the results to standard output.
     */
   public static void expand()
   {
      boolean bit = false;
      while (!BinaryStdIn.isEmpty())
      {
         int run = BinaryStdIn.readInt(lgR);
         for (int i = 0; i < run; i++)
            BinaryStdOut.write(bit);
         bit = !bit;
      }
      BinaryStdOut.close();
   }
}