PHP TOP

levenshtein

(PHP 4 >= 4.0.1, PHP 5, PHP 7, PHP 8)

levenshtein — 二つの文字列のレーベンシュタイン距離を計算する

説明

levenshtein(
    string $string1,
    string $string2,
    int $insertion_cost = 1,
    int $replacement_cost = 1,
    int $deletion_cost = 1
): int

レーベンシュタイン距離は、string1 を string2 に変換するために置換、挿入、削除しなければならない最小の文字数として定義されます。アルゴリズムの計算量は、 O(m*n) です。ここで、n および m はそれぞれ string1 および string2 の長さです。 O(max(n,m)**3) となる similar_text() よりは良いですが、まだかなりの計算量です)。

insertion_cost, replacement_cost かつ/または deletion_cost が 1 以外の場合、変換コストが最も小さいアルゴリズムを採用します。たとえば、$insertion_cost + $deletion_cost < $replacement_cost の場合、置換をせず、挿入と削除が行われます。

パラメータ

string1: レーベンシュタイン距離を計算する文字列のひとつ。
string2: レーベンシュタイン距離を計算する文字列のひとつ。
insertion_cost: 挿入のコストを定義します。
replacement_cost: 置換のコストを定義します。
deletion_cost: 削除のコストを定義します。

戻り値

この関数は、引数で指定した二つの文字列のレーベンシュタイン距離を返します。

変更履歴

バージョン	説明
8.0.0	これより前のバージョンでは、引数を2個、または5個指定して呼び出さなければなりませんでした。
8.0.0	これより前のバージョンでは、引数文字列の一つが 255 文字の制限より長い場合に `-1` を返していました。

例

例1 levenshtein() の例

<?php
// スペルミスした単語を入力します
$input = 'carrrot';

// チェックするための単語の配列
$words  = array('apple','pineapple','banana','orange',
                'radish','carrot','pea','bean','potato');

// まだ最短距離は見つかっていません
$shortest = -1;

// 最短距離を見つけるため単語をループします
foreach ($words as $word) {

    // 入力した単語と現在の単語の距離を
    // 計算します
    $lev = levenshtein($input, $word);

    // マッチするかどうかチェックします
    if ($lev == 0) {

        // 最短な単語はこれだ (マッチした)
        $closest = $word;
        $shortest = 0;

        // ループを抜ける; マッチしたものを見つけました
        break;
    }

    // もし距離が次に見つけた最短距離よりも短い場合、
    // もしくは次の最短の単語がまだ見つかっていない場合
    if ($lev <= $shortest || $shortest < 0) {
        // 最短のマッチと最短距離をセットします
        $closest  = $word;
        $shortest = $lev;
    }
}

echo "入力した単語: $input\n";
if ($shortest == 0) {
    echo "一致するものが見つかりました: $closest\n";
} else {
    echo "もしかして: $closest\n";
}

?>

上の例の出力は以下となります。

入力した単語: carrrot
もしかして: carrot

参考

soundex() - 文字列の soundex キーを計算する
similar_text() - 二つの文字列の間の類似性を計算する
metaphone() - 文字列の metaphone キーを計算する

Found A Problem?

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 27 notes

down

luciole75w at no dot spam dot gmail dot com ¶

11 years ago

The levenshtein function processes each byte of the input string individually. Then for multibyte encodings, such as UTF-8, it may give misleading results.

Example with a french accented word :
- levenshtein('notre', 'votre') = 1
- levenshtein('notre', 'nôtre') = 2 (huh ?!)

You can easily find a multibyte compliant PHP implementation of the levenshtein function but it will be of course much slower than the C implementation.

Another option is to convert the strings to a single-byte (lossless) encoding so that they can feed the fast core levenshtein function.

Here is the conversion function I used with a search engine storing UTF-8 strings, and a quick benchmark. I hope it will help.

<?php
// Convert an UTF-8 encoded string to a single-byte string suitable for
// functions such as levenshtein.
// 
// The function simply uses (and updates) a tailored dynamic encoding
// (in/out map parameter) where non-ascii characters are remapped to
// the range [128-255] in order of appearance.
//
// Thus it supports up to 128 different multibyte code points max over
// the whole set of strings sharing this encoding.
//
function utf8_to_extended_ascii($str, &$map)
{
    // find all multibyte characters (cf. utf-8 encoding specs)
    $matches = array();
    if (!preg_match_all('/[\xC0-\xF7][\x80-\xBF]+/', $str, $matches))
        return $str; // plain ascii string
    
    // update the encoding map with the characters not already met
    foreach ($matches[0] as $mbc)
        if (!isset($map[$mbc]))
            $map[$mbc] = chr(128 + count($map));
    
    // finally remap non-ascii characters
    return strtr($str, $map);
}

// Didactic example showing the usage of the previous conversion function but,
// for better performance, in a real application with a single input string
// matched against many strings from a database, you will probably want to
// pre-encode the input only once.
//
function levenshtein_utf8($s1, $s2)
{
    $charMap = array();
    $s1 = utf8_to_extended_ascii($s1, $charMap);
    $s2 = utf8_to_extended_ascii($s2, $charMap);
    
    return levenshtein($s1, $s2);
}
?>

Results (for about 6000 calls)
- reference time core C function (single-byte) : 30 ms
- utf8 to ext-ascii conversion + core function : 90 ms
- full php implementation : 3000 ms

down

paulrowe at iname dot com ¶

16 years ago

[EDITOR'S NOTE: original post and 2 corrections combined into 1 -- mgf]



Here is an implementation of the Levenshtein Distance calculation that only uses a one-dimensional array and doesn't have a limit to the string length. This implementation was inspired by maze generation algorithms that also use only one-dimensional arrays.



I have tested this function with two 532-character strings and it completed in 0.6-0.8 seconds. 



<?php

/*

* This function starts out with several checks in an attempt to save time.

*   1.  The shorter string is always used as the "right-hand" string (as the size of the array is based on its length).  

*   2.  If the left string is empty, the length of the right is returned.

*   3.  If the right string is empty, the length of the left is returned.

*   4.  If the strings are equal, a zero-distance is returned.

*   5.  If the left string is contained within the right string, the difference in length is returned.

*   6.  If the right string is contained within the left string, the difference in length is returned.

* If none of the above conditions were met, the Levenshtein algorithm is used.

*/

function LevenshteinDistance($s1, $s2)

{

  $sLeft = (strlen($s1) > strlen($s2)) ? $s1 : $s2;

  $sRight = (strlen($s1) > strlen($s2)) ? $s2 : $s1;

  $nLeftLength = strlen($sLeft);

  $nRightLength = strlen($sRight);

  if ($nLeftLength == 0)

    return $nRightLength;

  else if ($nRightLength == 0)

    return $nLeftLength;

  else if ($sLeft === $sRight)

    return 0;

  else if (($nLeftLength < $nRightLength) && (strpos($sRight, $sLeft) !== FALSE))

    return $nRightLength - $nLeftLength;

  else if (($nRightLength < $nLeftLength) && (strpos($sLeft, $sRight) !== FALSE))

    return $nLeftLength - $nRightLength;

  else {

    $nsDistance = range(1, $nRightLength + 1);

    for ($nLeftPos = 1; $nLeftPos <= $nLeftLength; ++$nLeftPos)

    {

      $cLeft = $sLeft[$nLeftPos - 1];

      $nDiagonal = $nLeftPos - 1;

      $nsDistance[0] = $nLeftPos;

      for ($nRightPos = 1; $nRightPos <= $nRightLength; ++$nRightPos)

      {

        $cRight = $sRight[$nRightPos - 1];

        $nCost = ($cRight == $cLeft) ? 0 : 1;

        $nNewDiagonal = $nsDistance[$nRightPos];

        $nsDistance[$nRightPos] = 

          min($nsDistance[$nRightPos] + 1, 

              $nsDistance[$nRightPos - 1] + 1, 

              $nDiagonal + $nCost);

        $nDiagonal = $nNewDiagonal;

      }

    }

    return $nsDistance[$nRightLength];

  }

}

?>

down

Johan Gennesson php at genjo dot fr ¶

8 years ago

Please, be aware that:

<?php
// Levenshtein Apostrophe (U+0027 &#39;) and Right Single Quotation Mark (U+2019 &#8217;)
echo levenshtein("'", "’");
?>

will output 3!

down

engineglue at gmail dot com ¶

12 years ago

I really like [the manual's] example for the use of the levenshtein function to match against an array. I ran into the need to specify the sensitivity of the result. There are circumstances when you want it to return false if the match is way out of line. I wouldn't want "marry had a little lamb" to match with "saw viii" simply because it was the best match in the array. Hence the need for sensitivity:



<?php

    function wordMatch($words, $input, $sensitivity){

        $shortest = -1;

        foreach ($words as $word) {

            $lev = levenshtein($input, $word);

            if ($lev == 0) {

                $closest = $word;

                $shortest = 0;

                break;

            }

            if ($lev <= $shortest || $shortest < 0) {

                $closest  = $word;

                $shortest = $lev;

            }

        }

        if($shortest <= $sensitivity){

            return $closest;

        } else {

            return 0;

        }

    }



    $word = 'PINEEEEAPPLE';



    $words  = array('apple','pineapple','banana','orange',

                    'radish','carrot','pea','bean','potato');

                    

    echo wordMatch($words, strtolower($word), 2);

?>

down

fgilles at free dot fr ¶

23 years ago

Exempla of use for a forum: users can't post messages too much uppercased



<?php

if ((strlen($subject)>10) and ( ( levenshtein ($subject, strtolower ($subject) / strlen ($subject) ) > .3 ) ){

    $subject = strtolower($subject);

}

?>

down

dale3h ¶

16 years ago

Using PHP's example along with Patrick's comparison percentage function, I have come up with a function that returns the closest word from an array, and assigns the percentage to a referenced variable:

<?php
  function closest_word($input, $words, &$percent = null) {
    $shortest = -1;
    foreach ($words as $word) {
      $lev = levenshtein($input, $word);

      if ($lev == 0) {
        $closest = $word;
        $shortest = 0;
        break;
      }

      if ($lev <= $shortest || $shortest < 0) {
        $closest  = $word;
        $shortest = $lev;
      }
    }

    $percent = 1 - levenshtein($input, $closest) / max(strlen($input), strlen($closest));

    return $closest;
  }
?>

Usage:
<?php
  $input = 'carrrot';
  $words = array('apple','pineapple','banana','orange',
                 'radish','carrot','pea','bean','potato');

  $percent = null;
  $found = closest_word($input, $words, $percent);

  printf('Closest word to "%s": %s (%s%% match)', $input, $found, round($percent * 100, 2));
?>

I found that lowercasing the array prior to comparing yields a better comparison when the case is not of importance, for example: comparing a user-inputted category to a list of existing categories.

I also found that when the percentage was above 75%, it was usually the match that I was looking for.

down

june05 at tilo-hauke dot de ¶

19 years ago

//levenshtein for arrays
function array_levenshtein($array1,$array2)
         {   $aliases= array_flip(array_values(array_unique(array_merge($array1,$array2))));
             if(count($aliases)>255) return -1;
             $stringA=''; $stringB='';
             foreach($array1 as $entry) $stringA.=chr($aliases[$entry]);
             foreach($array2 as $entry) $stringB.=chr($aliases[$entry]);
             return levenshtein($stringA,$stringB);
         }
         
// e.g. use array_levenshtein to detect special expressions in unser-inputs

echo array_levenshtein(split(" ", "my name is xxx"), split(" ","my name is levenshtein")); 

//output: 1

down

yhoko at yhoko dot com ¶

8 years ago

Note that this function might cause problems when working with multibyte charactes like in UTF-8. Example:

<?php
print( similar_text( 'hä', 'hà' ) ); // Returns 2 where only 1 character matches
?>

down

dschultz at protonic dot com ¶

24 years ago

It's also useful if you want to make some sort of registration page and you want to make sure that people who register don't pick usernames that are very similar to their passwords.

down

carey at NOSPAM dot internode dot net dot au ¶

18 years ago

I have found that levenshtein is actually case-sensitive (in PHP 4.4.2 at least).

<?php
$distance=levenshtein('hello','ELLO');
echo "$distance";
?>

Outputs: "5", instead of "1". If you are implementing a fuzzy search feature that makes use of levenshtein, you will probably need to find a way to work around this.

down

"inerte" is my hotmail.com username ¶

21 years ago

I am using this function to avoid duplicate information on my client's database.

After retrieving a series of rows and assigning the results to an array values, I loop it with foreach comparing its levenshtein() with the user supplied string.

It helps to avoid people re-registering "John Smith", "Jon Smith" or "Jon Smit".

Of course, I can't block the operation if the user really wants to, but a suggestion is displayed along the lines of: "There's a similar client with this name.", followed by the list of the similar strings.

down

justin at visunet dot ie ¶

19 years ago

<?php

    /*********************************************************************
    * The below func, btlfsa, (better than levenstien for spelling apps)
    * produces better results when comparing words like haert against
    * haart and heart.
    *
    * For example here is the output of levenshtein compared to btlfsa
    * when comparing 'haert' to 'herat, haart, heart, harte'
    *
    * btlfsa('haert','herat'); output is.. 3
    * btlfsa('haert','haart'); output is.. 3
    * btlfsa('haert','harte'); output is.. 3
    * btlfsa('haert','heart'); output is.. 2
    *
    * levenshtein('haert','herat'); output is.. 2
    * levenshtein('haert','haart'); output is.. 1
    * levenshtein('haert','harte'); output is.. 2
    * levenshtein('haert','heart'); output is.. 2
    *
    * In other words, if you used levenshtein, 'haart' would be the
    * closest match to 'haert'. Where as, btlfsa sees that it should be
    * 'heart'
    */

    function btlfsa($word1,$word2)
    {
        $score = 0;

        // For each char that is different add 2 to the score
        // as this is a BIG difference

        $remainder  = preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word1)."]/i",'',$word2);
        $remainder .= preg_replace("/[".preg_replace("/[^A-Za-z0-9\']/",' ',$word2)."]/i",'',$word1);
        $score      = strlen($remainder)*2;

        // Take the difference in string length and add it to the score
        $w1_len  = strlen($word1);
        $w2_len  = strlen($word2);
        $score  += $w1_len > $w2_len ? $w1_len - $w2_len : $w2_len - $w1_len;

        // Calculate how many letters are in different locations
        // And add it to the score i.e.
        //
        // h e a r t
        // 1 2 3 4 5
        //
        // h a e r t     a e        = 2
        // 1 2 3 4 5   1 2 3 4 5
        //

        $w1 = $w1_len > $w2_len ? $word1 : $word2;
        $w2 = $w1_len > $w2_len ? $word2 : $word1;

        for($i=0; $i < strlen($w1); $i++)
        {
            if ( !isset($w2[$i]) || $w1[$i] != $w2[$i] )
            {
                $score++;
            }
        }

        return $score;
    }

    // *************************************************************
    // Here is a full code example showing the difference

    $misspelled = 'haert';

    // Imagine that these are sample suggestions thrown back by soundex or metaphone..
    $suggestions = array('herat', 'haart', 'heart', 'harte');

    // Firstly order an array based on levenshtein
    $levenshtein_ordered = array();
    foreach ( $suggestions as $suggestion )
    {
        $levenshtein_ordered[$suggestion] = levenshtein($misspelled,$suggestion);
    }
    asort($levenshtein_ordered, SORT_NUMERIC );

    print "<b>Suggestions as ordered by levenshtein...</b><ul><pre>";
    print_r($levenshtein_ordered);
    print "</pre></ul>";

    // Secondly order an array based on btlfsa
    $btlfsa_ordered = array();
    foreach ( $suggestions as $suggestion )
    {
        $btlfsa_ordered[$suggestion] = btlfsa($misspelled,$suggestion);
    }
    asort($btlfsa_ordered, SORT_NUMERIC );

    print "<b>Suggestions as ordered by btlfsa...</b><ul><pre>";
    print_r($btlfsa_ordered);
    print "</pre></ul>";

?>

down

atx dot antrax at gmail dot com ¶

16 years ago

I have made a function that removes the length-limit of levenshtein function and ajust the result with similar_text:

<?php
function _similar($str1, $str2) {
    $strlen1=strlen($str1);
    $strlen2=strlen($str2);
    $max=max($strlen1, $strlen2);

    $splitSize=250;
    if($max>$splitSize)
    {
        $lev=0;
        for($cont=0;$cont<$max;$cont+=$splitSize)
        {
            if($strlen1<=$cont || $strlen2<=$cont)
            {
                $lev=$lev/($max/min($strlen1,$strlen2));
                break;
            }
            $lev+=levenshtein(substr($str1,$cont,$splitSize), substr($str2,$cont,$splitSize));
        }
    }
    else
    $lev=levenshtein($str1, $str2);

    $porcentage= -100*$lev/$max+100;
    if($porcentage>75)//Ajustar con similar_text
    similar_text($str1,$str2,$porcentage);

    return $porcentage;
}
?>

down

WiLDRAGoN ¶

9 years ago

Some small changes allow you to calculate multiple words.

<?php

$input = array();
$dictionary  = array();
foreach ($input as $output) {
    $shortest = -1;
    foreach ($dictionary as $word) {
        $lev = levenshtein($output, $word);
        if ($lev == 0) {
            $closest = $word;
            $shortest = 0;
        }
        if ($lev <= $shortest || $shortest < 0) {
            $closest  = $word;
            $shortest = $lev;
        }
    }
    echo "Input word: $output\n";
    if ($shortest == 0) {
        echo "Exact match found: $closest\n";
    } else {
        echo "Did you mean: $closest?\n";
    }
}

?>

down

Chaim Chaikin ¶

13 years ago

As regards to Example #1 above, would it not be more efficient to first use a simple php == comparison to check if the strings are equal even before testing the word with levenshtein().

Something like this:

<?php
// input misspelled word
$input = 'carrrot';

// array of words to check against
$words  = array('apple','pineapple','banana','orange',
                'radish','carrot','pea','bean','potato');

// no shortest distance found, yet
$shortest = -1;

// loop through words to find the closest
foreach ($words as $word) {

    // check for an exact match
    if ($input == $word) {

        // closest word is this one (exact match)
        $closest = $word;
        $shortest = 0;

        // break out of the loop; we've found an exact match
        break;
    }

    // calculate the distance between the input word,
    // and the current word
    $lev = levenshtein($input, $word);

    // if this distance is less than the next found shortest
    // distance, OR if a next shortest word has not yet been found
    if ($lev <= $shortest || $shortest < 0) {
        // set the closest match, and shortest distance
        $closest  = $word;
        $shortest = $lev;
    }
}

echo "Input word: $input\n";
if ($shortest == 0) {
    echo "Exact match found: $closest\n";
} else {
    echo "Did you mean: $closest?\n";
}

?>

down

Atelier ”慶”

WEBクリエイター歴２０年、最先端技術と安心をお届けします。

PHPのお勉強！

levenshtein

説明

パラメータ

戻り値

変更履歴

例

参考

Found A Problem?

User Contributed Notes 27 notes