数据压缩

场景：DNA 序列（A、C、T、G），4 字符。
压缩：
ASCII（8位）(\to) 2位编码（00=A, 01=C, 10=T, 11=G）。
前置 32 位记录长度 ( N )。
代码： java public static void compress() { Alphabet DNA = new Alphabet("ACTG"); String s = BinaryStdIn.readString(); int N = s.length(); BinaryStdOut.write(N); for (int i = 0; i < N; i++) BinaryStdOut.write(DNA.toIndex(s.charAt(i)), 2); BinaryStdOut.close(); }
展开：
读 ( N )，逐 2 位解码。
代码： java public static void expand() { Alphabet DNA = new Alphabet("ACTG"); int N = BinaryStdIn.readInt(); for (int i = 0; i < N; i++) BinaryStdOut.write(DNA.toChar(BinaryStdIn.readChar(2))); BinaryStdOut.close(); }
性能：压缩率接近 25%（含 32 位开销）。
局限：需约定字母表，基因数据无更多结构可压缩。

场景：长连续重复比特（如位图）。
压缩：
记录交替 0/1 游程长度（8位，0-255）。
超 255 用 0 分隔。
代码： java public static void compress() { char cnt = 0; boolean b, old = false; while (!BinaryStdIn.isEmpty()) { b = BinaryStdIn.readBoolean(); if (b != old) { BinaryStdOut.write(cnt); cnt = 0; old = !old; } else if (cnt == 255) { BinaryStdOut.write(cnt); BinaryStdOut.write((char)0); cnt = 0; } cnt++; } BinaryStdOut.write(cnt); BinaryStdOut.close(); }
展开：
读长度，输出对应比特。
代码： java public static void expand() { boolean b = false; while (!BinaryStdIn.isEmpty()) { char cnt = BinaryStdIn.readChar(); for (int i = 0; i < cnt; i++) BinaryStdOut.write(b); b = !b; } BinaryStdOut.close(); }
性能：
示例：40 位（15个0,7个1,7个0,11个1）(\to) 16 位，压缩率 40%。
位图：1536 位(\to) 1144 位（74%），分辨率翻倍压缩率减半。
适用性：长游程有效，短游程可能膨胀。

场景：字符频率差异大（如文本）。
原理：变长前缀码，高频字符短码。
步骤：
统计频率。
构造霍夫曼树（最小堆合并低频节点）。
建编译表。
写树+长度+编码。
实现：
压缩： java public static void compress() { String s = BinaryStdIn.readString(); char[] input = s.toCharArray(); int[] freq = new int[256]; for (int i = 0; i < input.length; i++) freq[input[i]]++; Node root = buildTrie(freq); String[] st = new String[256]; buildCode(st, root, ""); writeTrie(root); BinaryStdOut.write(input.length); for (int i = 0; i < input.length; i++) for (char c : st[input[i]].toCharArray()) BinaryStdOut.write(c == '1'); BinaryStdOut.close(); }
展开：读树、长度，沿树解码。
树构造：
用 Node 类（含 ch, freq, left, right）。
最小堆合并频率最低的节点。
性能：
示例：408 位(\to) 352 位（86%），含树成本。
最优性（命题 U）：加权外部路径长度最小。
适用性：通用，文本压缩率 50%-60%。

场景：重复子字符串（如文本、位图）。
原理：变长模式(\to)定长码，无需附表。
压缩：
用 TST 维护字符串到编码表。
找最长前缀，输出编码，加新条目。
代码： java public static void compress() { String input = BinaryStdIn.readString(); TST<Integer> st = new TST<Integer>(); for (int i = 0; i < 256; i++) st.put("" + (char)i, i); int code = 257; while (input.length() > 0) { String s = st.longestPrefixOf(input); BinaryStdOut.write(st.get(s), 12); int t = s.length(); if (t < input.length() && code < 4096) st.put(input.substring(0, t + 1), code++); input = input.substring(t); } BinaryStdOut.write(256, 12); BinaryStdOut.close(); }
展开：
用数组维护编码到字符串表。
处理前瞻字符特殊情况。
性能：
示例：119 位(\to) 104 位（87%）。
《双城记》：46% 压缩率。
适用性：高效，适合多种文件。

算法	适用场景	压缩率示例	优点	局限性
双位编码	小字母表（如基因）	25%	简单，针对性强	需约定字母表
游程编码	长重复比特	74%（位图）	随分辨率提升更优	短游程无效
霍夫曼压缩	频率差异大	53%（文本）	最优前缀码，通用	需附带树
LZW 压缩	重复子字符串	46%（文本）	无需附表，高效	小文件效果差