PHPのお勉強!

PHP TOP

マルチバイト文字列 関数

リファレンス

マルチバイト文字エンコーディングおよびそれに関連する問題は非常に複雑で、 このドキュメントの範囲を超えています。これらの問題に関連するより詳細な情報は、 以下の URL やその他のリソースを参照ください。

目次

  • mb_check_encoding — 文字列が、指定したエンコーディングで有効なものかどうかを調べる
  • mb_chr — Unicode のコードポイントに対応する文字を返す
  • mb_convert_case — 文字列に対してケースフォールディングを行う
  • mb_convert_encoding — ある文字エンコーディングの文字列を、別の文字エンコーディングに変換する
  • mb_convert_kana — カナを("全角かな"、"半角かな"等に)変換する
  • mb_convert_variables — 変数の文字コードを変換する
  • mb_decode_mimeheader — MIME ヘッダフィールドの文字列をデコードする
  • mb_decode_numericentity — HTML 数値エンティティを文字にデコードする
  • mb_detect_encoding — 文字エンコーディングを検出する
  • mb_detect_order — 文字エンコーディング検出順序を設定あるいは取得する
  • mb_encode_mimeheader — MIMEヘッダの文字列をエンコードする
  • mb_encode_numericentity — 文字を HTML 数値エンティティにエンコードする
  • mb_encoding_aliases — 既知のエンコーディング・タイプのエイリアスを取得
  • mb_ereg — マルチバイト対応の正規表現マッチ
  • mb_ereg_match — マルチバイト文字列が正規表現に一致するか調べる
  • mb_ereg_replace — マルチバイト文字列に正規表現による置換を行う
  • mb_ereg_replace_callback — マルチバイト文字列にコールバック関数を用いた正規表現による置換を行う
  • mb_ereg_search — 指定したマルチバイト文字列が正規表現に一致するか調べる
  • mb_ereg_search_getpos — 次の正規表現検索を開始する位置を取得する
  • mb_ereg_search_getregs — マルチバイト文字列が正規表現に一致する部分があるか調べる
  • mb_ereg_search_init — マルチバイト正規表現検索用の文字列と正規表現を設定する
  • mb_ereg_search_pos — 指定したマルチバイト文字列が正規表現に一致する部分の位置と長さを返す
  • mb_ereg_search_regs — 指定したマルチバイト文字列が正規表現に一致する部分を取得する
  • mb_ereg_search_setpos — 次の正規表現検索を開始する位置を設定する
  • mb_eregi — マルチバイトをサポートし、大文字小文字を無視した正規表現マッチ
  • mb_eregi_replace — マルチバイト文字列に大文字小文字を区別せずに正規表現による置換を行う
  • mb_get_info — mbstring の内部設定値を取得する
  • mb_http_input — HTTP 入力文字エンコーディングを検出する
  • mb_http_output — HTTP 出力文字エンコーディングを設定あるいは取得する
  • mb_internal_encoding — 内部文字エンコーディングを設定あるいは取得する
  • mb_language — 現在の言語を設定あるいは取得する
  • mb_list_encodings — サポートするすべてのエンコーディングの配列を返す
  • mb_ord — 文字の Unicode コードポイントを取得する
  • mb_output_handler — 出力バッファ内で文字エンコーディングを変換するコールバック関数
  • mb_parse_str — GET/POST/COOKIE データをパースし、グローバル変数を設定する
  • mb_preferred_mime_name — MIME 文字設定を文字列で得る
  • mb_regex_encoding — 現在のマルチバイト正規表現用のエンコーディングを取得または設定する
  • mb_regex_set_options — マルチバイト正規表現関数のデフォルトオプションを取得または設定する
  • mb_scrub — 文字列に含まれる不正なバイト列を代替文字に置き換える
  • mb_send_mail — エンコード変換を行ってメールを送信する
  • mb_split — マルチバイト文字列を正規表現により分割する
  • mb_str_pad — マルチバイト文字列を、特定の長さまで別のマルチバイト文字列で埋める
  • mb_str_split — マルチバイト文字列を受取り、文字の配列を返す
  • mb_strcut — 文字列の一部を得る
  • mb_strimwidth — 指定した幅で文字列を丸める
  • mb_stripos — 大文字小文字を区別せず、 文字列の中で指定した文字列が最初に現れる位置を探す
  • mb_stristr — 大文字小文字を区別せず、 文字列の中で指定した文字列が最初に現れる位置を探す
  • mb_strlen — 文字列の長さを得る
  • mb_strpos — 文字列の中に指定した文字列が最初に現れる位置を見つける
  • mb_strrchr — 別の文字列の中で、ある文字が最後に現れる場所を見つける
  • mb_strrichr — 大文字小文字を区別せず、 別の文字列の中である文字が最後に現れる場所を探す
  • mb_strripos — 大文字小文字を区別せず、 文字列の中で指定した文字列が最後に現れる位置を探す
  • mb_strrpos — 文字列の中に指定した文字列が最後に現れる位置を見つける
  • mb_strstr — 文字列の中で、指定した文字列が最初に現れる位置を見つける
  • mb_strtolower — 文字列を小文字にする
  • mb_strtoupper — 文字列を大文字にする
  • mb_strwidth — 文字列の幅を返す
  • mb_substitute_character — 置換文字を設定あるいは取得する
  • mb_substr — 文字列の一部を得る
  • mb_substr_count — 部分文字列の出現回数を数える
add a note

User Contributed Notes 35 notes

up
69
deceze at gmail dot com
12 years ago
Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

<?php

$string
= '漢字はユニコード';
$needle = 'は';
$replace = 'Foo';

echo
str_replace($needle, $replace, $string);
// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.
up
21
Eugene Murai
19 years ago
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

Example:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
up
13
mdoocy at u dot washington dot edu
17 years ago
Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.
up
6
treilor at gmail dot com
10 years ago
A small note for those who will follow rawsrc at gmail dot com's advice: mb_split uses regular expressions, in which case it may make sense to use built-in function mb_ereg_replace.
up
11
Anonymous
10 years ago
Yet another single-line mb_trim() function

<?php
function mb_trim($string, $trim_chars = '\s'){
return
preg_replace('/^['.$trim_chars.']*(?U)(.*)['.$trim_chars.']*$/u', '\\1',$string);
}
$string = ' "some text." ';
echo
mb_trim($string, '\s".');
//some text
?>
up
4
mattr at telebody dot com
10 years ago
A brief note on Daniel Rhodes' mb_punctuation_trim().
The regular expression modifier u does not mean ungreedy, rather it means the pattern is in UTF-8 encoding. Instead the U modifier should be used to get ungreedy behavior. (I have not otherwise tested his code.)
See http://php.net/manual/en/reference.pcre.pattern.modifiers.php
up
5
Hayley Watson
6 years ago
SOME multibyte encodings can safely be used in str_replace() and the like, others cannot. It's not enough to ensure that all the strings involved use the same encoding: obviously they have to, but it's not enough. It has to be the right sort of encoding.

UTF-8 is one of the safe ones, because it was designed to be unambiguous about where each encoded character begins and ends in the string of bytes that makes up the encoded text. Some encodings are not safe: the last bytes of one character in a text followed by the first bytes of the next character may together make a valid character. str_replace() knows nothing about "characters", "character encodings" or "encoded text". It only knows about the string of bytes. To str_replace(), two adjacent characters with two-byte encodings just looks like a sequence of four bytes and it's not going to know it shouldn't try to match the middle two bytes.

While real-world examples can be found of str_replace() mangling text, it can be illustrated by using the HTML-ENTITIES encoding. It's not one of the safe ones. All of the strings being passed to str_replace() are valid HTML-ENTITIES-encoded text so the "all inputs use the same encoding" rule is satisfied.

The text is "x<y". It is represented by the byte string [78 26 6c 74 3b 79]. Note that the text has three characters, but the string has six bytes.

<?php

$string
= 'x&lt;y';
mb_internal_encoding('HTML-ENTITIES');

echo
"Text length: ", mb_strlen($string), "\tString length: ", strlen($string), " ... ", $string, "\n";
// Three characters, six bytes; the text reads "x<y".

$newstring = str_replace('l', 'g', $string);
echo
"Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Three characters, six bytes, but now the text reads "x>y"; the wrong characters have changed.

$newstring = str_replace(';', ':', $string);
echo
"Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Now even the length of the text is wrong and the text is trashed.

?>

Even though neither 'l' nor ';' appear in the text "x<y", str_replace() still found and changed bytes. In one case, it changed the text to "x>y" and in the other it broke the encoding completely.

One more reason to use UTF-8 if you can, I guess.
up
7
mitgath at gmail dot com
15 years ago
according to:
http://bugs.php.net/bug.php?id=21317
here's missing function

<?php
function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {
return
str_pad($input,
strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);
}
?>
up
8
roydukkey at roydukkey dot com
14 years ago
This would be one way to create a multibyte substr_replace function

<?php
function mb_substr_replace($output, $replace, $posOpen, $posClose) {
return
mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);
}
?>
up
6
Ben XO
15 years ago
PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).

Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.

<?php
/**
* Trim characters from either (or both) ends of a string in a way that is
* multibyte-friendly.
*
* Mostly, this behaves exactly like trim() would: for example supplying 'abc' as
* the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of
* course, the added bonus that you can put unicode characters in the charlist.
*
* We are using a PCRE character-class to do the trimming in a unicode-aware
* way, so we must escape ^, \, - and ] which have special meanings here.
* As you would expect, a single \ in the charlist is interpretted as
* "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances
* you can ignore this detail.
*
* As a bonus, however, we also allow PCRE special character-classes (such as '\s')
* because they can be extremely useful when dealing with UCS. '\pZ', for example,
* matches every 'separator' character defined in Unicode, including non-breaking
* and zero-width spaces.
*
* It doesn't make sense to have two or more of the same character in a character
* class, therefore we interpret a double \ in the character list to mean a
* single \ in the regex, allowing you to safely mix normal characters with PCRE
* special classes.
*
* *Be careful* when using this bonus feature, as PHP also interprets backslashes
* as escape characters before they are even seen by the regex. Therefore, to
* specify '\\s' in the regex (which will be converted to the special character
* class '\s' for trimming), you will usually have to put *4* backslashes in the
* PHP code - as you can see from the default value of $charlist.
*
* @param string
* @param charlist list of characters to remove from the ends of this string.
* @param boolean trim the left?
* @param boolean trim the right?
* @return String
*/
function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)
{
$both_ends = $ltrim && $rtrim;

$char_class_inner = preg_replace(
array(
'/[\^\-\]\\\]/S', '/\\\{4}/S' ),
array(
'\\\\\\0', '\\' ),
$charlist
);

$work_horse = '[' . $char_class_inner . ']+';
$ltrim && $left_pattern = '^' . $work_horse;
$rtrim && $right_pattern = $work_horse . '$';

if(
$both_ends)
{
$pattern_middle = $left_pattern . '|' . $right_pattern;
}
elseif(
$ltrim)
{
$pattern_middle = $left_pattern;
}
else
{
$pattern_middle = $right_pattern;
}

return
preg_replace("/$pattern_middle/usSD", '', $string) );
}
?>