Algorithms

Find All Collinear Points - A Pattern Recognition Problem

The Line Patterns Recognition A basic but important application of pattern recognition is to recognize line patterns in a given set of points. http://coursera.cs.princeton.edu/algs4/assignments/collinear.html. This blog will give a breif introduction to this problem and provide an enfficient solution. Codes available in algs4/collinear/src/ The problem could be described as: Given a set of n distinct points in the plane, find every (maximal) line segment that connects a subset of 4 or more of the points.. ...

Randomized Queue with Reservoir Sampling

This blog explains an apllication of randomized queue algorithms. Permutation client memory challenge A client program Permutation.java that takes an integer k as a command-line argument; reads in a sequence of strings from standard input using StdIn.readString(); and prints exactly k of them, uniformly at random. Print each item from the sequence at most once. More detail could be found at programming assignment specification and checklist, codes available in algs4/queues/src/. Randomized queue For a randomized queue, the item removed is chosen uniformly at random from items in the data structure. ...

Percolations problem

Union-find applications: Percolation Problem discriptions Percolation data type. To model a percolation system, create a data type Percolation with the following API: public class Percolation { public Percolation(int n); // create n-by-n grid, with all sites blocked public void open(int row, int col); // open site (row, col) if it is not open already public boolean isOpen(int row, int col); // is site (row, col) open? public boolean isFull(int row, int col); // is site (row, col) full? public int numberOfOpenSites(); // number of open sites public boolean percolates(); // does the system percolate? } Monte Carlo simulation. To estimate the percolation threshold, consider the following computational experiment: ...

Algorithms - Princeton

Algorithms, Part I, https://online.princeton.edu/course/algorithms-part-i Algorithms, Part II, https://online.princeton.edu/course/algorithms-part-ii Algorithms, 4th Edition by Robert Sedgewick and Kevin Wayne https://algs4.cs.princeton.edu/ Union−Find Considering the dynamic connectivity problem, modeling of multiple objects connected in a space/network. Applications involve manipulating objects of all types. ・Pixels in a digital photo. ・Computers in a network. ・Friends in a social network. ・Transistors in a computer chip. Given a set of N objects. union(a, b): connect two objects. connected(p, q): is two objects connected? find(p): Find component identifier for p (0 to N – 1) Modeling the objects: array. ...

信息处理 - 数据压缩 - 哈夫曼编码

避免歧义的编码在构建压缩编码的对应关系时，我们使用不同的数量的位来编码不同的字符. 比如摩斯密码. 如果单纯使用这种对应关系，会出现一些问题，如•••−−−•••会产生歧义: SOS? V7? IAMIE? EEWNI? 所以在实际使用中, 密码使用一些间隔来分隔代码字。那么对于不同的压缩编码, 有什么常用方法来避免歧义？方法是确保没有一个编码是另一个编码的前缀。比如使用固定长度编码。为每个编码添加特殊的stop char。使用一种具备广泛使用性的prefix-free编码。用什么数据结构来设计prefix-free编码? 用Trie构造编码一个二叉(0, 1)Trie: 叶节点是字符, 根节点到叶节点的路径就是编码. 压缩: 方法1：从叶开始; 按照路径到达根; 反向打印bits。方法2：创建键-值对的符号表。解压: 从根节点开始, 根据位值是0还是1在Trie图上游走, 直到走到叶节点，则解压出一个字符返回根节点, 继续第一步, 直到跑完所有编码. private static class Node implements Comparable<Node> { private final char ch; // used only for leaf nodes private final int freq; // used only for compress private final Node left, right; public Node(char ch, int freq, Node left, Node right) { this.ch = ch; this.freq = freq; this.left = left; this.right = right; } public boolean isLeaf() { return left == null && right == null; } // compare Nodes by frequency public int compareTo(Node that) { return this.freq - that.freq; } // Runtime - Linear in input size N public void expand() { Node root = readTrie(); // read in encoding trie int N = BinaryStdIn.readInt(); // read in number of chars for (int i = 0; i < N; i++) { Node x = root; while (!x.isLeaf()) { if (!BinaryStdIn.readBoolean()) x = x.left; else x = x.right; } BinaryStdOut.write(x.ch, 8); } BinaryStdOut.close(); } } 如何读取一个Trie：根据Trie的前序遍历序列重构. ...

信息处理 - 数据压缩

数据压缩压缩数据以节省储存空间，节省传输时间。同时很多文件都有很多冗余信息，这为压缩提供了很多可能性。通用文件压缩 ·文件：GZIP，BZIP，7z ·Archivers：PKZIP ·文件系统：NTFS，HFS +，ZFS 多媒体 ·图像：GIF，JPEG ·声音：MP3 ·视频：MPEG，DivX™，HDTV 通讯 ·ITU-T T4 Group 3 Fax ·V.42bis调制解调器 ·Skype 数据库压缩率 Compression ratio = Bits in Compressed B / bits in B. 自然语言的压缩率为50-75％或更高. 读写二进制 public class BinaryStdIn { boolean readBoolean() // read 1 bit of data and return as a boolean value char readChar() // read 8 bits of data and return as a char value char readChar(int r) // read r bits of data and return as a char value // similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) boolean isEmpty() // is the bitstream empty? void close() // close the bitstream } public class BinaryStdOut { void write(boolean b) // write the specified bit void write(char c) // write the specified 8-bit char void write(char c, int r) // write the r least significant bits of the specified char // similar methods for byte (8 bits); short (16 bits); int (32 bits); long and double (64 bits) void close() // close the bitstream } 比如使用三种方法表达12/31/1999 1, A character stream (StdOut), ...

众数问题 - Boyer–Moore majority vote algorithm

数组中有一个数字出现的次数超过数组长度的一半，例如输入一个长度为9的数组1,2,3,2,2,2,5,4,2。由于数字2在数组中出现了5次，超过数组长度的一半，因此输出2。如果不存在则输出0。因为这个数出现次数超过了数组长度一半以上, 那么它就是数组中出现次数最多的数, 故谓之众数. class Solution: def MoreThanHalfNum_Solution(self, numbers): # write code here most = numbers[0] count = 1 for item in numbers: if item == most: count += 1 else: count -= 1 if count < 0: most = item count = 1 return 0 if numbers.count(most) <= len(numbers) / 2 else most 众数问题众数问题可以推广泛化：给定大小为n的整数数组，找到所有出现超过n / m次的元素。这种问题可以使用 Boyer-Moore 算法解决. The Boyer–Moore majority vote algorithm is an algorithm for finding the majority of a sequence of elements using linear time and constant space. It is named after Robert S. Boyer and J Strother Moore, who published it in 1981, and is a prototypical example of a streaming algorithm. ...

不同树结构的字符串符号表

各种树的变种为了适应不同的应用场景, 人们使用不同的树结构来实现符号表. 九宫格输入法对于手机的九宫格输入法, 简单的实现方式是多次敲击: 通过反复按键输入一个字母，直到出现所需的字母。但 http://www.t9.com/ 的 T9 texting 支持更高效的输入方法: ・Find all words that correspond to given sequence of numbers. ・Press 0 to see all completion options. Ex. hello ・多次敲击: 4 4 3 3 5 5 5 5 5 5 6 6 6 ・T9: 4 3 5 5 6 可以使用 8-way trie 来实现. 三元搜索Trie R较大的R-way trie的空间效率不高，读取比较大的文件往往导致内存不足。但弊端是开辟出的数组内存利用率其实不高。现在很多系统都使用Unicode，分支可高达65,536. 所以需要更高效的方法。 Ternary search tries: ・Store characters and values in nodes (not keys). ・Each node has 3 children: smaller (left), equal (middle), larger (right). Search in a TST: Follow links corresponding to each character in the key. ・If less, take left link; if greater, take right link. ・If equal, take the middle link and move to the next key character. ...

字符串符号表和三元搜索Trie

符号表在计算机科学中，符号表是一种用于语言翻译器（例如编译器和解释器）中的数据结构。在符号表中，程序源代码中的每个标识符都和它的声明或使用信息绑定在一起，比如其数据类型、作用域以及内存地址。常用哈希表来实现. 符号表的应用非常广泛, 可用于实现Set, Dictionary, 文件索引, 稀疏向量/矩阵等数据结构和相关的运算操作, 还有其他如过滤查询(Exception filter), 一致性查询(concordance queries)等操作. 字符符号表就是专门针对字符操作的符号表, API: Prefix match - Keys with prefix sh: she, shells, and shore. Wildcard match - Keys that match .he: she and the. Longest prefix - Key that is the longest prefix of shellsort: shells. public interface StringST<Value> { StringST(); create a symbol table with string keys void put(String key, Value val); put key-value pair into the symbol table Value get(String key); value paired with key void delete(String key); delete key and corresponding value Iterable<String> keys(); all keys Iterable<String> keysWithPrefix(String s); keys having s as a prefix Iterable<String> keysThatMatch(String s); keys that match s (where . is a wildcard) String longestPrefixOf(String s); longest key that is a prefix of s } 以Trie为基础的字符符号表 algs4中提供了用 R-way trie 来实现符号表(symbol table)例子: ...

“和谐” - 多模式匹配算法 - AC自动机

虽然KMP可以用于单模式匹配问题，但如果是多模式问题, KMP的性能就得不到保证。比如根据墙内法律要求, 墙内的搜索引擎需要过滤敏感词后才能合法运营。敏感词的数量不少, 如果要求包含敏感词的网页不能被搜索到, 那么搜索引擎在爬取网页信息时, 就要标记网页的文本中是否包含任意个敏感词. 这就是典型的多模匹配问题. 这种情况下如果使用Trie，那么需要遍历网页的每一个字符位置，对每一个位置进行Trie前缀匹配。如果词典的词语数量为N，每个词语长度为L，文章的长度为M，那么需要进行的计算次数是在N*M*L这个级别的. 即使把词语的长度L简化为常数级别的, 整个算法的复杂度也至少是$O(n^2)$. AC自动机可以看到，KMP算法可以避免back up（在检查字符的过程中不需要回头），而Trie可以存储多个模式的信息。如果把二者结合在一起，也许能从性能上解决多模式（任意位置）匹配问题。这就是Aho–Corasick算法（AC自动机）。 Aho–Corasick算法是由Alfred V. Aho和Margaret J.Corasick 发明的字符串搜索算法，用于在输入的一串字符串中匹配有限组字典中的子串。它与普通字符串匹配的不同点在于同时与所有字典串进行匹配。算法均摊情况下具有近似于线性的时间复杂度，约为字符串的长度加所有匹配的数量。所以算法的关键就是通过Trie把多个模式构建为一个DFA（Deterministic finite state automaton），然后让模式串末尾对应的状态作为一个DFA的终止节点。这样，对于一个要检查的长字符串（如一段网页内容），让这个字符串在DFA上跑一趟，每一个字符表示一种跳转方式，如果这段字符能够跳到任何一个终结节点, 那么就表明这段字符串匹配了至少一个模式, 如果整段字符跑完都没到达终结节点, 那么这个网页就是"和谐的". 在单模式匹配中, 用KMP构建的DFA是比较简单的, 从左到右, 开头的状态就是开始状态, 结尾的状态就是结束状态: 而多模式匹配中, 在Trie的结构基础上构建出来的DFA更像一个DFA的样子: Trie中的节点, 就类似于DFA中的状态. 如果让字符串shis在上面跑, 假如仅仅是靠Trie(也即是没有虚线标识的转移), 那么第一次从字符串的第一个字符s开始转移, 经过转移路径0 - 85 - 90之后就转不动了, 因为Trie记录的模式中没有shi, 这个时候得back up, 从第二个位置h开始再匹配一遍. 这个过程中就产生重复匹配, 而参考KMP的思路, 在匹配shi的过程中, 其实已经挖掘出了hi这个子串了, 而这个子串是跟模式his对应的, 如果有办法不回头继续匹配下去就能提高性能了. 而DFA中虚线的失败转移就是用来解决这个问题的: 当走到状态90时, 前面有了小部分子串h刚好对应状态74, 这个时候用虚线作为失败转移, 转移到74, 在状态74中寻找下一个转移i, 这样就实现了不回头继续匹配了. ...