PHPのお勉強!

PHP TOP

utf8_encode

(PHP 4, PHP 5, PHP 7, PHP 8)

utf8_encodeISO-8859-1 文字列を UTF-8 に変換する

警告

この関数は PHP 8.2.0 で 非推奨になります。この関数に頼らないことを強く推奨します。

説明

utf8_encode(string $string): string

この関数は、文字列 stringISO-8859-1 エンコードから UTF-8 へ変換します。

注意:

この関数は、指定された文字列の現在の文字エンコーディングを推測しません。 代わりに、 ISO-8859-1 ("Latin 1" とも呼ばれています) としてエンコードされていると解釈し、UTF-8 に変換します。 全てのバイト列は有効な ISO-8859-1 の文字列であるため、 この関数は決してエラーになりません。 しかし、異なるエンコーディングを意図していた場合、 有用な結果にはならないでしょう。

ISO-8859-1 文字エンコーディングを使っているとマークされている 多くの Web ページが、実際にはそれと似た Windows-1252 を使っており、 Web ブラウザは ISO-8859-1 Web ページを Windows-1252 として解釈しています。Windows-1252ISO-8859-1 のある制御文字の代わりに、ユーロ記号 () や curly quote ( ) を印字可能な文字として追加しています。 この関数はそうした Windows-1252 文字を正しく変換しません。 Windows-1252 の変換が必要な場合は、別の関数を使ってください。

パラメータ

string

ISO-8859-1 形式の文字列。

戻り値

string を UTF-8 に変換した結果を返します。

変更履歴

バージョン 説明
8.2.0 この関数は、推奨されなくなりました。
7.2.0 この関数は、XML拡張モジュール から PHP のコアに移動しました。 これより前のバージョンでは、 この関数は XML拡張モジュール をインストールしていた場合にのみ利用可能でした。

例1 基本的な例

<?php
// Convert the string 'Zoë' from ISO 8859-1 to UTF-8
$iso8859_1_string = "\x5A\x6F\xEB";
$utf8_string = utf8_encode($iso8859_1_string);
echo
bin2hex($utf8_string), "\n";
?>

上の例の出力は以下となります。

5a6fc3ab

注意

注意: この関数は推奨されません。代替については下記のとおりです。

この関数は、PHP 8.2.0 以降は 推奨されなくなり、 将来のバージョンで削除される予定です。 この関数を使っているコードをチェックし、適切な代替に置き換えるべきです。

この関数と似た機能は、 mb_convert_encoding() で実現できます。 この関数は、ISO-8859-1 と、多くの他の文字エンコーディングをサポートしています。

<?php
$iso8859_1_string
= "\xEB"; // 'ë' (e with diaeresis) in ISO-8859-1
$utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1');
echo
bin2hex($utf8_string), "\n";

$iso8859_7_string = "\xEB"; // the same string in ISO-8859-7 represents 'λ' (Greek lower-case lambda)
$utf8_string = mb_convert_encoding($iso8859_7_string, 'UTF-8', 'ISO-8859-7');
echo
bin2hex($utf8_string), "\n";

$windows_1252_string = "\x80"; // '€' (Euro sign) in Windows-1252, but not in ISO-8859-1
$utf8_string = mb_convert_encoding($windows_1252_string, 'UTF-8', 'Windows-1252');
echo
bin2hex($utf8_string), "\n";
?>

上の例の出力は以下となります。

c3ab
cebb
e282ac

他の代替として、インストールされている拡張機能に依存した関数ですが、 UConverter::transcode()iconv() が挙げられます。

次のコードは、いずれも同じ結果を返します:

<?php
$iso8859_1_string
= "\x5A\x6F\xEB"; // 'Zoë' in ISO-8859-1

$utf8_string = utf8_encode($iso8859_1_string);
echo
bin2hex($utf8_string), "\n";

$utf8_string = mb_convert_encoding($iso8859_1_string, 'UTF-8', 'ISO-8859-1');
echo
bin2hex($utf8_string), "\n";

$utf8_string = UConverter::transcode($iso8859_1_string, 'UTF8', 'ISO-8859-1');
echo
bin2hex($utf8_string), "\n";

$utf8_string = iconv('ISO-8859-1', 'UTF-8', $iso8859_1_string);
echo
bin2hex($utf8_string), "\n";
?>

上の例の出力は以下となります。

5a6fc3ab
5a6fc3ab
5a6fc3ab
5a6fc3ab

参考

  • utf8_decode() - UTF-8 エンコードされた文字列を、ISO-8859-1 に変換し、表現できない文字を置換する
  • mb_convert_encoding() - ある文字エンコーディングの文字列を、別の文字エンコーディングに変換する
  • UConverter::transcode() - ある文字エンコーディングから別の文字エンコーディングに文字列を変換する
  • iconv() - ある文字エンコーディングの文字列を、別の文字エンコーディングに変換する

add a note

User Contributed Notes 24 notes

up
140
deceze at gmail dot com
13 years ago
Please note that utf8_encode only converts a string encoded in ISO-8859-1 to UTF-8. A more appropriate name for it would be "iso88591_to_utf8". If your text is not encoded in ISO-8859-1, you do not need this function. If your text is already in UTF-8, you do not need this function. In fact, applying this function to text that is not encoded in ISO-8859-1 will most likely simply garble that text.

If you need to convert text from any encoding to any other encoding, look at iconv() instead.
up
9
Aidan Kehoe <php-manual at parhasard dot net>
20 years ago
Here's some code that addresses the issue that Steven describes in the previous comment;

<?php

/* This structure encodes the difference between ISO-8859-1 and Windows-1252,
as a map from the UTF-8 encoding of some ISO-8859-1 control characters to
the UTF-8 encoding of the non-control characters that Windows-1252 places
at the equivalent code points. */

$cp1252_map = array(
"\xc2\x80" => "\xe2\x82\xac", /* EURO SIGN */
"\xc2\x82" => "\xe2\x80\x9a", /* SINGLE LOW-9 QUOTATION MARK */
"\xc2\x83" => "\xc6\x92", /* LATIN SMALL LETTER F WITH HOOK */
"\xc2\x84" => "\xe2\x80\x9e", /* DOUBLE LOW-9 QUOTATION MARK */
"\xc2\x85" => "\xe2\x80\xa6", /* HORIZONTAL ELLIPSIS */
"\xc2\x86" => "\xe2\x80\xa0", /* DAGGER */
"\xc2\x87" => "\xe2\x80\xa1", /* DOUBLE DAGGER */
"\xc2\x88" => "\xcb\x86", /* MODIFIER LETTER CIRCUMFLEX ACCENT */
"\xc2\x89" => "\xe2\x80\xb0", /* PER MILLE SIGN */
"\xc2\x8a" => "\xc5\xa0", /* LATIN CAPITAL LETTER S WITH CARON */
"\xc2\x8b" => "\xe2\x80\xb9", /* SINGLE LEFT-POINTING ANGLE QUOTATION */
"\xc2\x8c" => "\xc5\x92", /* LATIN CAPITAL LIGATURE OE */
"\xc2\x8e" => "\xc5\xbd", /* LATIN CAPITAL LETTER Z WITH CARON */
"\xc2\x91" => "\xe2\x80\x98", /* LEFT SINGLE QUOTATION MARK */
"\xc2\x92" => "\xe2\x80\x99", /* RIGHT SINGLE QUOTATION MARK */
"\xc2\x93" => "\xe2\x80\x9c", /* LEFT DOUBLE QUOTATION MARK */
"\xc2\x94" => "\xe2\x80\x9d", /* RIGHT DOUBLE QUOTATION MARK */
"\xc2\x95" => "\xe2\x80\xa2", /* BULLET */
"\xc2\x96" => "\xe2\x80\x93", /* EN DASH */
"\xc2\x97" => "\xe2\x80\x94", /* EM DASH */

"\xc2\x98" => "\xcb\x9c", /* SMALL TILDE */
"\xc2\x99" => "\xe2\x84\xa2", /* TRADE MARK SIGN */
"\xc2\x9a" => "\xc5\xa1", /* LATIN SMALL LETTER S WITH CARON */
"\xc2\x9b" => "\xe2\x80\xba", /* SINGLE RIGHT-POINTING ANGLE QUOTATION*/
"\xc2\x9c" => "\xc5\x93", /* LATIN SMALL LIGATURE OE */
"\xc2\x9e" => "\xc5\xbe", /* LATIN SMALL LETTER Z WITH CARON */
"\xc2\x9f" => "\xc5\xb8" /* LATIN CAPITAL LETTER Y WITH DIAERESIS*/
);

function
cp1252_to_utf8($str) {
global
$cp1252_map;
return
strtr(utf8_encode($str), $cp1252_map);
}

?>
up
4
Pini
9 years ago
My version of utf8_encode_deep,
In case you need one that returns a value without changing the original.

/**
* Convert Anything To UTF-8
* @param mixed $var The variable you want to convert.
* @param boolean $deep Deep convertion? (*Default: TRUE).
* @return mixed
*/
function anything_to_utf8($var,$deep=TRUE){
if(is_array($var)){
foreach($var as $key => $value){
if($deep){
$var[$key] = anything_to_utf8($value,$deep);
}elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
$var[$key] = utf8_encode($var);
}
}
return $var;
}elseif(is_object($var)){
foreach($var as $key => $value){
if($deep){
$var->$key = anything_to_utf8($value,$deep);
}elseif(!is_array($value) && !is_object($value) && !mb_detect_encoding($value,'utf-8',true)){
$var->$key = utf8_encode($var);
}
}
return $var;
}else{
return (!mb_detect_encoding($var,'utf-8',true))?utf8_encode($var):$var;
}
}
up
7
a dot rueedlinger at gmail dot com
11 years ago
If you need a function which converts a string array into a utf8 encoded string array then this function might be useful for you:

<?php
function utf8_string_array_encode(&$array){
$func = function(&$value,&$key){
if(
is_string($value)){
$value = utf8_encode($value);
}
if(
is_string($key)){
$key = utf8_encode($key);
}
if(
is_array($value)){
utf8_string_array_encode($value);
}
};
array_walk($array,$func);
return
$array;
}
?>
up
5
bisqwit at iki dot fi
19 years ago
For reference, it may be insightful to point out that:
utf8_encode($s)
is actually identical to:
recode_string('latin1..utf8', $s)
and:
iconv('iso-8859-1', 'utf-8', $s)
That is, utf8_encode is a specialized case of character set conversions.

If your string to be converted to utf-8 is something other than iso-8859-1 (such as iso-8859-2 (Polish/Croatian)), you should use recode_string() or iconv() instead rather than trying to devise complex str_replace statements.
up
6
Oscar Broman
12 years ago
Walk through nested arrays/objects and utf8 encode all strings.

<?php
// Usage
class Foo {
public
$somevar = 'whoop whoop';
}

$structure = array(
'object' => (object) array(
'entry' => 'hello wörld',
'another_array' => array(
'string',
1234,
'another string'
)
),
'string' => 'foo',
'foo_object' => new Foo
);

utf8_encode_deep($structure);

// $structure is now utf8 encoded
print_r($structure);

// The function
function utf8_encode_deep(&$input) {
if (
is_string($input)) {
$input = utf8_encode($input);
} else if (
is_array($input)) {
foreach (
$input as &$value) {
utf8_encode_deep($value);
}

unset(
$value);
} else if (
is_object($input)) {
$vars = array_keys(get_object_vars($input));

foreach (
$vars as $var) {
utf8_encode_deep($input->$var);
}
}
}
?>
up
2
rocketman
18 years ago
If you are looking for a function to replace special characters with the hex-utf-8 value (e.g. für Webservice-Security/WSS4J compliancy) you might use this:

$textstart = "Größe";
$utf8 ='';
$max = strlen($txt);

for ($i = 0; $i < $max; $i++) {

if ($txt{i} == "&"){
$neu = "&x26;";
}
elseif ((ord($txt{$i}) < 32) or (ord($txt{$i}) > 127)){
$neu = urlencode(utf8_encode($txt{$i}));
$neu = preg_replace('#\%(..)\%(..)\%(..)#','&#x\1;&#x\2;&#x\3;',$neu);
$neu = preg_replace('#\%(..)\%(..)#','&#x\1;&#x\2;',$neu);
$neu = preg_replace('#\%(..)#','&#x\1;',$neu);
}
else {
$neu = $txt{$i};
}

$utf8 .= $neu;
} // for $i

$textnew = $utf8;

In this example $textnew will be "Gr&#xC3;&#xB6;&#xC3;&#x9F;e"
up
1
Janci
19 years ago
I was searching for a function similar to Javascript's unescape(). In most cases it is OK to use url_decode() function but not if you've got UTF characters in the strings. They are converted into %uXXXX entities that url_decode() cannot handle.
I googled the net and found a function which actualy converts these entities into HTML entities (&#xxx;) that your browser can show correctly. If you're OK with that, the function can be found here: http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps

But it was not OK with me because I needed a string in my charset to make some comparations and other stuff. So I have modified the above function and in conjuction with code2utf() function mentioned in some other note here, I have managed to achieve my goal:

<?php
/**
* Function converts an Javascript escaped string back into a string with specified charset (default is UTF-8).
* Modified function from http://pure-essence.net/stuff/code/utf8RawUrlDecode.phps
*
* @param string $source escaped with Javascript's escape() function
* @param string $iconv_to destination character set will be used as second paramether in the iconv function. Default is UTF-8.
* @return string
*/
function unescape($source, $iconv_to = 'UTF-8') {
$decodedStr = '';
$pos = 0;
$len = strlen ($source);
while (
$pos < $len) {
$charAt = substr ($source, $pos, 1);
if (
$charAt == '%') {
$pos++;
$charAt = substr ($source, $pos, 1);
if (
$charAt == 'u') {
// we got a unicode character
$pos++;
$unicodeHexVal = substr ($source, $pos, 4);
$unicode = hexdec ($unicodeHexVal);
$decodedStr .= code2utf($unicode);
$pos += 4;
}
else {
// we have an escaped ascii character
$hexVal = substr ($source, $pos, 2);
$decodedStr .= chr (hexdec ($hexVal));
$pos += 2;
}
}
else {
$decodedStr .= $charAt;
$pos++;
}
}

if (
$iconv_to != "UTF-8") {
$decodedStr = iconv("UTF-8", $iconv_to, $decodedStr);
}

return
$decodedStr;
}

/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num){
if(
$num<128)return chr($num);
if(
$num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if(
$num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if(
$num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return
'';
}
?>
up
1
rogeriogirodo at gmail dot com
15 years ago
This function may be useful do encode array keys and values [and checks first to see if it's already in UTF format]:

<?php
public static function to_utf8($in)
{
if (
is_array($in)) {
foreach (
$in as $key => $value) {
$out[to_utf8($key)] = to_utf8($value);
}
} elseif(
is_string($in)) {
if(
mb_detect_encoding($in) != "UTF-8")
return
utf8_encode($in);
else
return
$in;
} else {
return
$in;
}
return
$out;
}
?>

Hope this may help.

[NOTE BY danbrown AT php DOT net: Original function written by (cmyk777 AT gmail DOT com) on 28-JAN-09.]
up
0
powtac 4t gmx d0t de
13 years ago
I tried a lot of things, but this seems to be the final fail save method to convert any string to proper UTF-8.

<?php
function _convert($content) {
if(!
mb_check_encoding($content, 'UTF-8')
OR !(
$content === mb_convert_encoding(mb_convert_encoding($content, 'UTF-32', 'UTF-8' ), 'UTF-8', 'UTF-32'))) {

$content = mb_convert_encoding($content, 'UTF-8');

if (
mb_check_encoding($content, 'UTF-8')) {
// log('Converted to UTF-8');
} else {
// log('Could not converted to UTF-8');
}
}
return
$content;
}
?>
up
0
Anonymous
19 years ago
// Reads a file story.txt ascii (as typed on keyboard)
// converts it to Georgian character using utf8 encoding
// if I am correct(?) just as it should be when typed on Georgian computer
// it outputs it as an html file
//
// http://www.comweb.nl/keys_to_georgian.html
// http://www.comweb.nl/keys_to_georgian.php
// http://www.comweb.nl/story.txt

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

<HTML>
<HEAD>
<TITLE>keys to unicode code</TITLE>

// this meta tag is needed
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

// note the sylfean font seems to be standard installed on Windows XP
// It supports Georgian

<style TYPE="text/css">
<!--
body {font-family:sylfaen; }
-->
</style>
</HEAD>

<BODY>

<?
$eng=array(97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,
112,113,114,115,116,117,118,119,120,121,122,87,82,84,83,
67,74,90);
$geo=array(4304,4305,4330,4307,4308,4324,4306,4336,4312,4335,4313,
4314,4315,4316,4317,4318,4325,4320,4321,4322,4323,4309,
4332,4334,4327,4310,4333,4326,4311,4328,4329,4319,4331,
91,93,59,39,44,46,96);

$fc=file("story.txt");
foreach($fc as $line)
{
$spacestart=1;
for ($i=0; $i<strlen($line); $i+=1)
{
$character=ord(substr($line,$i,1));
$found=0;
for ($k=0; $k<count($eng); $k+=1)
{
if ($eng[$k]==$character)
{
print code2utf( $geo[$k] );
$found=1;
}
}
if ($found==0)
{
if ($character==126 || $character==32 || $character==10 || $character==9)
{
if ($character==9) { print '&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;'; }
if ($character==10) { print "<BR>\n"; }
if ($character==32)
{
if ($spacestart==1) {print '&nbsp;'; } else { print " "; }
}
if ($character==126){ print "~"; }
} else
{
print substr($line,$i,1);
}
}
if ($character!=32) { $spacestart=0; }
}
}

/**
* Function coverts number of utf char into that character.
* Function taken from: http://sk2.php.net/manual/en/function.utf8-encode.php#49336
*
* @param int $num
* @return utf8char
*/
function code2utf($num)
{
if($num<128)return chr($num);
if($num<2048)return chr(($num>>6)+192).chr(($num&63)+128);
if($num<65536)return chr(($num>>12)+224).chr((($num>>6)&63)+128).chr(($num&63)+128);
if($num<2097152)return chr(($num>>18)+240).chr((($num>>12)&63)+128).chr((($num>>6)&63)+128) .chr(($num&63)+128);
return '';
}
?>

</BODY>
</HTML>
up
0
hrpeters (at) gmx (dot) net
20 years ago
// Validate Unicode UTF-8 Version 4
// This function takes as reference the table 3.6 found at http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf
// It also flags overlong bytes as error

function is_validUTF8($str)
{
// values of -1 represent disalloweded values for the first bytes in current UTF-8
static $trailing_bytes = array (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1, -1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,
-1,-1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1
);

$ups = unpack('C*', $str);
if (!($aCnt = count($ups))) return true; // Empty string *is* valid UTF-8
for ($i = 1; $i <= $aCnt;)
{
if (!($tbytes = $trailing_bytes[($b1 = $ups[$i++])])) continue;
if ($tbytes == -1) return false;

$first = true;
while ($tbytes > 0 && $i <= $aCnt)
{
$cbyte = $ups[$i++];
if (($cbyte & 0xC0) != 0x80) return false;

if ($first)
{
switch ($b1)
{
case 0xE0:
if ($cbyte < 0xA0) return false;
break;
case 0xED:
if ($cbyte > 0x9F) return false;
break;
case 0xF0:
if ($cbyte < 0x90) return false;
break;
case 0xF4:
if ($cbyte > 0x8F) return false;
break;
default:
break;
}
$first = false;
}
$tbytes--;
}
if ($tbytes) return false; // incomplete sequence at EOS
}
return true;
}
up
0
Mark AT modernbill DOT com
20 years ago
If you haven't guessed already: If the UTF-8 character has no representation in the ISO-8859-1 codepage, a ? will be returned. You might want to wrap a function around this to make sure you aren't saving a bunch of ???? into your database.
up
-1