utf8_encode
(PHP 4, PHP 5, PHP 7, PHP 8)
utf8_encode — ISO-8859-1 文字列を UTF-8 に変換する
この関数は PHP 8.2.0 で 非推奨になります。この関数に頼らないことを強く推奨します。
説明
この関数は、文字列 string
を ISO-8859-1
エンコードから UTF-8
へ変換します。
注意:
この関数は、指定された文字列の現在の文字エンコーディングを推測しません。 代わりに、 ISO-8859-1 ("Latin 1" とも呼ばれています) としてエンコードされていると解釈し、UTF-8 に変換します。 全てのバイト列は有効な ISO-8859-1 の文字列であるため、 この関数は決してエラーになりません。 しかし、異なるエンコーディングを意図していた場合、 有用な結果にはならないでしょう。
ISO-8859-1
文字エンコーディングを使っているとマークされている 多くの Web ページが、実際にはそれと似たWindows-1252
を使っており、 Web ブラウザはISO-8859-1
Web ページをWindows-1252
として解釈しています。Windows-1252
はISO-8859-1
のある制御文字の代わりに、ユーロ記号 (€
) や curly quote (“
”
) を印字可能な文字として追加しています。 この関数はそうしたWindows-1252
文字を正しく変換しません。Windows-1252
の変換が必要な場合は、別の関数を使ってください。
パラメータ
string
-
ISO-8859-1 形式の文字列。
戻り値
string
を UTF-8 に変換した結果を返します。
変更履歴
バージョン | 説明 |
---|---|
8.2.0 | この関数は、推奨されなくなりました。 |
7.2.0 | この関数は、XML拡張モジュール から PHP のコアに移動しました。 これより前のバージョンでは、 この関数は XML拡張モジュール をインストールしていた場合にのみ利用可能でした。 |
例
例1 基本的な例
<?php
// Convert the string 'Zoë' from ISO 8859-1 to UTF-8
$iso8859_1_string = "\x5A\x6F\xEB";
$utf8_string = utf8_encode($iso8859_1_string);
echo bin2hex($utf8_string), "\n";
?>
上の例の出力は以下となります。
5a6fc3ab
注意
注意: この関数は推奨されません。代替については下記のとおりです。
この関数は、PHP 8.2.0 以降は 推奨されなくなり、 将来のバージョンで削除される予定です。 この関数を使っているコードをチェックし、適切な代替に置き換えるべきです。
この関数と似た機能は、 mb_convert_encoding() で実現できます。 この関数は、ISO-8859-1 と、多くの他の文字エンコーディングをサポートしています。
<?php
$iso8859_1_string = "\xEB"; // 'ë' (e with diaeresis) in ISO-8859-1
$utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1');
echo bin2hex($utf8_string), "\n";
$iso8859_7_string = "\xEB"; // the same string in ISO-8859-7 represents 'λ' (Greek lower-case lambda)
$utf8_string = mb_convert_encoding($iso8859_7_string, 'UTF-8', 'ISO-8859-7');
echo bin2hex($utf8_string), "\n";
$windows_1252_string = "\x80"; // '€' (Euro sign) in Windows-1252, but not in ISO-8859-1
$utf8_string = mb_convert_encoding($windows_1252_string, 'UTF-8', 'Windows-1252');
echo bin2hex($utf8_string), "\n";
?>上の例の出力は以下となります。
c3ab cebb e282ac他の代替として、インストールされている拡張機能に依存した関数ですが、 UConverter::transcode() と iconv() が挙げられます。
次のコードは、いずれも同じ結果を返します:
<?php
$iso8859_1_string = "\x5A\x6F\xEB"; // 'Zoë' in ISO-8859-1
$utf8_string = utf8_encode($iso8859_1_string);
echo bin2hex($utf8_string), "\n";
$utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1');
echo bin2hex($utf8_string), "\n";
$utf8_string = UConverter::transcode($iso8859_1_string, 'UTF8', 'ISO-8859-1');
echo bin2hex($utf8_string), "\n";
$utf8_string = iconv('ISO-8859-1', 'UTF-8', $iso8859_1_string);
echo bin2hex($utf8_string), "\n";
?>上の例の出力は以下となります。
5a6fc3ab 5a6fc3ab 5a6fc3ab 5a6fc3ab
参考
- utf8_decode() - UTF-8 エンコードされた文字列を、ISO-8859-1 に変換し、表現できない文字を置換する
- mb_convert_encoding() - ある文字エンコーディングの文字列を、別の文字エンコーディングに変換する
- UConverter::transcode() - ある文字エンコーディングから別の文字エンコーディングに文字列を変換する
- iconv() - ある文字エンコーディングの文字列を、別の文字エンコーディングに変換する
User Contributed Notes 24 notes
Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be "iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.
If you need to convert text from any encoding to any other encoding, look at iconv() instead.
Here's some code that addresses the issue that Steven describes in the previous comment;
<?php
/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */
$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */
"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);
function cp1252_to_utf8($str) {
global $cp1252_map;
return strtr(utf8_encode($str), $cp1252_map);
}
?>
My version of utf8_encode_deep,
In case you need one that returns a value without changing the original.
/**
* Convert Anything To UTF-8
* @param mixed $var The variable you want to convert.
* @param boolean $deep Deep convertion? (*Default: TRUE).
* @return mixed
*/
function anything_to_utf8($var,$deep=TRUE){
if(is_array($var)){
foreach($var as $key => $value){
if($deep){
$var[$key] = anything_to_utf8($value,$deep);
}elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
$var[$key] = utf8_encode($var);
}
}
return $var;
}elseif(is_object($var)){
foreach($var as $key => $value){
if($deep){
$var->$key = anything_to_utf8($value,$deep);
}elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
$var->$key = utf8_encode($var);
}
}
return $var;
}else{
return (!mb_detect_encoding($var,'utf-8',true))?utf8_encode($var):$var;
}
}
If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:
<?php
function utf8_string_array_encode(&$array){
$func = function(&$value,&$key){
if(is_string($value)){
$value = utf8_encode($value);
}
if(is_string($key)){
$key = utf8_encode($key);
}
if(is_array($value)){
utf8_string_array_encode($value);
}
};
array_walk($array,$func);
return $array;
}
?>
For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string('latin1..utf8', $s)
and:
iconv('iso-8859-1', 'utf-8', $s)
That is, utf8_encode is a specialized case of character set conversions.
If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.
Walk through nested arrays/objects and utf8 encode all strings.
<?php
// Usage
class Foo {
public $somevar = 'whoop whoop';
}
$structure = array(
'object' => (object) array(
'entry' => 'hello wörld',
'another_array' => array(
'string',
1234,
'another string'
)
),
'string' => 'foo',
'foo_object' => new Foo
);
utf8_encode_deep($structure);
// $structure is now utf8 encoded
print_r($structure);
// The function
function utf8_encode_deep(&$input) {
if (is_string($input)) {
$input = utf8_encode($input);
} else if (is_array($input)) {
foreach ($input as &$value) {
utf8_encode_deep($value);
}
unset($value);
} else if (is_object($input)) {
$vars = array_keys(get_object_vars($input));
foreach ($vars as $var) {
utf8_encode_deep($input->$var);
}
}
}
?>
If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. für Webservice-Security/WSS4J compliancy) you might use this:
$textstart = "Größe";
$utf8 ='';
$max = strlen($txt);
for ($i = 0; $i < $max; $i++) {
if ($txt{i} == "&"){
$neu = "&x26;";
}
elseif ((ord($txt{$i}) < 32) or (ord($txt{$i}) > 127)){
$neu = urlencode(utf8_encode($txt{$i}));
$neu = preg_replace('#\%(..)\%(..)\%(..)#','&#x\1;&#x\2;&#x\3;',$neu);
$neu = preg_replace('#\%(..)\%(..)#','&#x\1;&#x\2;',$neu);
$neu = preg_replace('#\%(..)#','&#x\1;',$neu);
}
else {
$neu = $txt{$i};
}
$utf8 .= $neu;
} // for $i
$textnew = $utf8;
In this example $textnew will be "Größe"
I was searching for a function similar to Javascript's unescape(). In most cases it is OK to use url_decode() function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you're OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:
<?php
/**
* Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8).
* Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
*
* @param string $source escaped with Javascript's escape() function
* @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
* @return string
*/
function unescape($source, $iconv_to = 'UTF-8') {
$decodedStr = '';
$pos = 0;
$len = strlen ($source);
while ($pos < $len) {
$charAt = substr ($source, $pos, 1);
if ($charAt == '%') {
$pos++;
$charAt = substr ($source, $pos, 1);
if ($charAt == 'u') {
// we got a unicode character
$pos++;
$unicodeHexVal = substr ($source, $pos, 4);
$unicode = hexdec ($unicodeHexVal);
$decodedStr .= code2utf($unicode);
$pos += 4;
}
else {
// we have an escaped ascii character
$hexVal = substr ($source, $pos, 2);
$decodedStr .= chr (hexdec ($hexVal));
$pos += 2;
}
}
else {
$decodedStr .= $charAt;
$pos++;
}
}
if ($iconv_to != "UTF-8") {
$decodedStr = iconv("UTF-8", $iconv_to, $decodedStr);
}
return $decodedStr;
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num){
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
This function may be useful do encode array keys and values [and checks first to see if it's already in UTF format]:
<?php
public static function to_utf8($in)
{
if (is_array($in)) {
foreach ($in as $key => $value) {
$out[to_utf8($key)] = to_utf8($value);
}
} elseif(is_string($in)) {
if(mb_detect_encoding($in) != "UTF-8")
return utf8_encode($in);
else
return $in;
} else {
return $in;
}
return $out;
}
?>
Hope this may help.
[NOTE BY danbrown AT php DOT net: Original function written by (cmyk777 AT gmail DOT com) on 28-JAN-09.]
I tried a lot of things, but this seems to be the final fail save method to convert any string to proper UTF-8.
<?php
function _convert($content) {
if(!mb_check_encoding($content, 'UTF-8')
OR !($content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {
$content = mb_convert_encoding($content, 'UTF-8');
if (mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return $content;
}
?>
// Reads a file story.txt ascii (as typed on keyboard)
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
//
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML>
<HEAD>
<TITLE>keys to unicode code</TITLE>
// this meta tag is needed
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian
<style TYPE="text/css">
<!--
body {font-family:sylfaen; }
-->
</style>
</HEAD>
<BODY>
<?
$eng=array(97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,87,82,84,83,
67,74,90);
$geo=array(4304,4305,4330,4307,4308,4324,4306,4336,4312,4335,4313,
4314,4315,4316,4317,4318,4325,4320,4321,4322,4323,4309,
4332,4334,4327,4310,4333,4326,4311,4328,4329,4319,4331,
91,93,59,39,44,46,96);
$fc=file("story.txt");
foreach($fc as $line)
{
$spacestart=1;
for ($i=0; $i<strlen($line); $i+=1)
{
$character=ord(substr($line,$i,1));
$found=0;
for ($k=0; $k<count($eng); $k+=1)
{
if ($eng[$k]==$character)
{
print code2utf( $geo[$k] );
$found=1;
}
}
if ($found==0)
{
if ($character==126 || $character==32 || $character==10 || $character==9)
{
if ($character==9) { print ' '; }
if ($character==10) { print "<BR>\n"; }
if ($character==32)
{
if ($spacestart==1) {print ' '; } else { print " "; }
}
if ($character==126){ print "~"; }
} else
{
print substr($line,$i,1);
}
}
if ($character!=32) { $spacestart=0; }
}
}
/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num)
{
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>
</BODY>
</HTML>
// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error
function is_validUTF8($str)
{
// values of -1 represent disalloweded values for the first bytes in current UTF-8
static $trailing_bytes = array (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
);
$ups = unpack('C*', $str);
if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
for ($i = 1; $i <= $aCnt;)
{
if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;
if ($tbytes == -1) return false;
$first = true;
while ($tbytes > 0 && $i <= $aCnt)
{
$cbyte = $ups[$i++];
if (($cbyte & 0xC0) != 0x80) return false;
if ($first)
{
switch ($b1)
{
case 0xE0:
if ($cbyte < 0xA0) return false;
break;
case 0xED:
if ($cbyte > 0x9F) return false;
break;
case 0xF0:
if ($cbyte < 0x90) return false;
break;
case 0xF4:
if ($cbyte > 0x8F) return false;
break;
default:
break;
}
$first = false;
}
$tbytes--;
}
if ($tbytes) return false; // incomplete sequence at EOS
}
return true;
}
If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.