An Efficient Uniform-Cost Normalized Edit Distance Algorithm

Report ID

1999-14

Report Authors

Abdullah N. Arslan and Omer Egecioglu

Report Date

1999-04-01

Abstract

A common model for computing the similarity of two strings X and Y of lengthsm, and n respectively with m >= n, is to transform X into Y through a sequenceof edit operations which are of three types: insertion, deletion, andsubstitution of symbols. The model assumes a given weight function whichassigns a non-negative real cost to each of these edit operations. Theamortized weight for a given edit sequence is the ratio of the total weight ofthe sequence to its length, and the minimum of this ratio over all editsequences is the normalized edit distance between X and Y. Experimentalresults suggest that for some applications normalized edit distance is betteras a similarity measure than ordinary edit distance, which ignores the lengthof the sequences. Existing algorithms for the computation of normalized editdistance with provable bounds on the complexity require O(mn^2) operations inthe worst-case. In this paper we develop an O(mnlogn) worst-case algorithm forthe normalized edit distance problem for uniform weights. In this case theweight of each edit operation is constant within the same type, exceptsubstitutions can have different costs depending on whether they are matchingor non-matching.

Document

1999-14.ps242.74 KB