Bitap算法

Bitap算法（或稱為shift-or、shift-and、Baeza-Yates–Gonnet算法）是一種字符串近似匹配算法。該算法可判斷給定的文本是否包含與定義模式「近似相等」的子字符串。其中，根據萊文斯坦距離 – 如果子字符串和定義模式彼此在給定距離「K」之內，則該算法認為他們近似。該算法預先計算一組位掩碼，其中每個位掩碼的每一個元素都包含一個模式。然後，它可以通過按位操作以完成大部分工作。

可以將Bitap算法作為Unix 軟體開發工具 agrep的基礎算法之一，它由Udi Manber（英語：Udi Manber）、Sun Wu和Burra Gopal開發。Udi Manber和Sun Wu的原始論文對算法進行了擴展，以處理一般正則表達式的模糊匹配。

由於算法需要的資料結構，它在小於固定長度（通常是字長）的模式上表現最佳。但是，一旦針對給定的字長「m」進行定義，則其運行時間是完全可以預測的——無論文本的結構或模式如何，它的運行時間均為O（"mn"）。

搜索精確字符串的Bitap算法在1964年由BálintDömölki發明，並由R. K. Shyamasundar在1977年進行了擴充。隨後，在1989年由Ricardo Baeza-Yates（英語：Ricardo Baeza-Yates）和Gaston Gonnet（英語：Gaston Gonnet）開發，該算法也延伸到了處理字符、通配符和不匹配字符。1991年，Manber（英語：Udi Manber）和Sun Wu對其進行了擴展，以處理插入和刪除操作（完全的模糊字符串搜索）。此算法後來在1996年由 Baeza-Yates和Navarro（英語：Gonzalo Navarro）改進。

精確搜尋

完全通用的用於搜尋精確字符串的Bitap算法偽代碼，如下所示：

algorithm bitap_search is
    input: text as a string.
           pattern as a string.
    output: string
    m := length(pattern)

    if m = 0 then
        return text

    /* 初始化位数组R */
    R := new array[m+1] of bit, initially all 0
    R[0] := 1

    for i := 0; i < length(text); i += 1 do
        /* Update the bit array. */
        for k := m; k ≥ 1; k -= 1 do
          R[k] := R[k - 1] & (text[i] = pattern[k - 1])

        if R[m] then
            return (text + i - m) + 1

    return null

如上程序所示，Bitap在自然映射到簡單的按位運算上與其他著名的字符串搜索算法不同。注意，在該實例中與多數人的直覺相反的是：值為0的每個位表示與其匹配，而值為1的每個位表示不匹配。可以使用0和1的直觀含義編寫相同的算法，但是在這種情況下，我們必須在內部循環中引入另一條指令來設置R |= 1。

為了將一般實例中的(text[i] == pattern[k-1])條件轉換為按位運算，我們需要為CHAR_MAX附加位掩碼。因此，Bitap算法在應用於查找較少字符串時性能更好。

 #include <string.h>
 #include <limits.h>
 
 const char *bitap_bitwise_search(const char *text, const char *pattern)
 {
     int m = strlen(pattern);
     unsigned long R;
     unsigned long pattern_mask[CHAR_MAX+1];
     int i;
 
     if (pattern[0] == '\0') return text;
     if (m > 31) return "The pattern is too long!";
  
     /* 初始化位数组R */
     R = ~1;
 
     /* 初始化位掩码 */
     for (i=0; i <= CHAR_MAX; ++i)
       pattern_mask[i] = ~0;
     for (i=0; i < m; ++i)
       pattern_mask[pattern[i]] &= ~(1UL << i);
 
     for (i=0; text[i] != '\0'; ++i) {
         /* 更新位数组 */
         R |= pattern_mask[text[i]];
         R <<= 1;
 
         if (0 == (R & (1UL << m)))
           return (text + i - m) + 1;
     }
 
     return NULL;
 }

模糊搜尋

為了使用Bitap算法執行模糊字符串的查找，有必要將位數組「R」擴展到第二尺度。現在，我們有了「k」個不同的數組_1..k – 數組 R_i，而不是具有隨文本長度變化的單個數組「R」。「R_i」數組包含模式前綴的表示形式，該模式的前綴與「i」或擁有更少錯誤的當前字符串的任何後綴相匹配。在這種情況下，錯誤可以是插入、刪除或替換的。有關這些操作的更多信息，請參見萊文斯坦距離。

下面的實例使用模糊Bitap算法執行模糊匹配（英語：fuzzy matching）（返回的第一個匹配值，錯誤率最高為「 k」）。但是，它僅僅關注替換，而不關注插入或刪除 – 換句話說，是[k]的漢明距離。同上一個實例，0和1的含義與它們的常規含義相反。

 #include <stdlib.h>
 #include <string.h>
 #include <limits.h>
 
 const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
 {
     const char *result = NULL;
     int m = strlen(pattern);
     unsigned long *R;
     unsigned long pattern_mask[CHAR_MAX+1];
     int i, d;
 
     if (pattern[0] == '\0') return text;
     if (m > 31) return "The pattern is too long!";
 
     /* 初始化位数组R */
     R = malloc((k+1) * sizeof *R);
     for (i=0; i <= k; ++i)
         R[i] = ~1;
 
     /* 初始化位掩码 */
     for (i=0; i <= CHAR_MAX; ++i)
         pattern_mask[i] = ~0;
     for (i=0; i < m; ++i)
         pattern_mask[pattern[i]] &= ~(1UL << i);
 
     for (i=0; text[i] != '\0'; ++i) {
         /* 更新位数组 */
         unsigned long old_Rd1 = R[0];
 
         R[0] |= pattern_mask[text[i]];
         R[0] <<= 1;
 
         for (d=1; d <= k; ++d) {
             unsigned long tmp = R[d];
             R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
             old_Rd1 = tmp;
         }
 
         if (0 == (R[k] & (1UL << m))) {
             result = (text+i - m) + 1;
             break;
         }
     }
 
     free(R);
     return result;
 }

參考文獻

^ Bálint Dömölki, An algorithm for syntactical analysis, Computational Linguistics 3, Hungarian Academy of Science pp. 29–46, 1964.
^ Bálint Dömölki, A universal compiler system based on production rules, BIT Numerical Mathematics（英語：BIT Numerical Mathematics）, 8(4), pp 262–275, 1968. doi:10.1007/BF01933436
^ R. K. Shyamasundar, Precedence parsing using Dömölki's algorithm, International Journal of Computer Mathematics（英語：International Journal of Computer Mathematics）, 6(2)pp 105–114, 1977.
^ Ricardo Baeza-Yates. "Efficient Text Searching." PhD Thesis, University of Waterloo, Canada, May 1989.
^ Udi Manber, Sun Wu. "Fast text searching with errors." Technical Report TR-91-11. Department of Computer Science, University of Arizona, Tucson, June 1991. (gzipped PostScript （頁面存檔備份，存於網際網路檔案館）)
^ Ricardo Baeza-Yates, Gastón H. Gonnet. "A New Approach to Text Searching." Communications of the ACM, 35(10): pp. 74–82, October 1992.
^ Udi Manber, Sun Wu. "Fast text search allowing errors." Communications of the ACM, 35(10): pp. 83–91, October 1992, doi:10.1145/135239.135244.
^ R. Baeza-Yates and G. Navarro. A faster algorithm for approximate string matching. In Dan Hirchsberg and Gene Myers, editors, Combinatorial Pattern Matching (CPM'96), LNCS 1075, pages 1–23, Irvine, CA, June 1996.
^ G. Myers. "A fast bit-vector algorithm for approximate string matching based on dynamic programming." Journal of the ACM 46 (3), May 1999, 395–415.
Ricardo Baeza-Yates, Berthier Ribeiro-Neto. Modern Information Retrieval. 1999. ISBN 0-201-39829-X.

外部連結

libbitap，一個自由的實例，它展示了如何輕鬆地將算法擴展為大多數正則表達式。與上面的代碼不同，它對模式長度沒有限制。
bitap.py（頁面存檔備份，存於網際網路檔案館） - Python實現：Wu-Manber修改的Bitap算法。

閱論編字符串
String metric（英語：String metric）	字符串近似匹配 Bitap算法 Damerau–Levenshtein距離編輯距離漢明距離 Jaro–Winkler距離李距離萊文斯坦自動機萊文斯坦距離 Wagner–Fischer算法
字符串搜索算法	Apostolico–Giancarlo算法博耶-穆爾字符串搜索算法 Boyer–Moore–Horspool算法 KMP算法拉賓-卡普算法
多字符串搜索	AC自動機 Commentz-Walter算法拉賓-卡普算法
正則表達式	正則表達式引擎比較 Regular tree grammar（英語：Regular tree grammar）湯普森構造法非確定有限狀態自動機
序列比對	Hirschberg's algorithm（英語：Hirschberg's algorithm）尼德曼-翁施算法史密斯-沃特曼算法
資料結構	DAFSA（英語：Deterministic acyclic finite state automaton）後綴數組後綴自動機（英語：Suffix automaton）後綴樹 Generalized suffix tree（英語：Generalized suffix tree） Rope（英語：Rope (data structure)）三元搜索樹
其它	語法分析模式匹配 Compressed pattern matching（英語：Compressed pattern matching）最長公共子序列最長公共子串 Sequential pattern mining（英語：Sequential pattern mining）字符串排序算法（英語：:Category:String sorting algorithms）