strtok
(PHP 4, PHP 5, PHP 7, PHP 8)
strtok — 文字列をトークンに分割する
説明
代替のシグネチャ (名前付き引数をサポートしていません):
strtok() は文字列 (string
)
を 何らかの文字 token
によって区切られている小さな文字列 (トークン) に分割します。
"This is an example string" のような文字列がある場合、
空白文字を token
に指定することでこの文字列を個々の単語に分割することができます。
strtok は最初のコールの時のみ string
引数を使用することに注意してください。
strtok は、文字列のどこにいるのかの情報を保持しているため、
2回目以降のコールでは token
のみが必要です。
最初からやりなおす場合や新しい文字列をトークンに分割する場合、
初期化するために再度 string
引数を指定して strtok をコールします。
文字列は、token
の文字のどれかが見つかった場合はトークンに分割されます。
注意:
この関数は、explode() 関数に慣れている人が期待する動きとは、 ちょっと異なる振る舞いをします。 まず、
token
の ふたつ以上の連続する文字が文字列に含まれていた場合、 その連続する文字は、単一の区切り文字として扱われます。 また、文字列の最初と最後に存在するtoken
は無視されてしまいます。 たとえば、文字列が";aaa;;bbb;"
の場合、token
に";"
を指定して strtok() を 連続してコールすると、 "aaa" と "bbb" を返し、最後にfalse
を返します。 結果として、文字列はふたつにしか分割されません。 一方でexplode(";", $string)
は 5つの要素からなる配列を返します。
パラメータ
string
-
より小さい文字列 (トークン) に分割する文字列。
token
-
string
を分割する際に使用する区切り文字。
戻り値
トークンを文字列で返します。
トークンがない場合は、false
を返します。
例
例1 strtok() の例
<?php
$string = "This is\tan example\nstring";
/* タブと改行をトークンの区切りとして使用します */
$tok = strtok($string, " \n\t");
while ($tok !== false) {
echo "Word=$tok<br />";
$tok = strtok(" \n\t");
}
?>
例2 空の部分が見つかった場合の strtok() の動作
<?php
$first_token = strtok('/something', '/');
$second_token = strtok('/');
var_dump($first_token, $second_token);
?>
上の例の出力は以下となります。
string(9) "something" bool(false)
例3 strtok() と explode() の違い
<?php
$string = ";aaa;;bbb;";
$parts = [];
$tok = strtok($string, ";");
while ($tok !== false) {
$parts[] = $tok;
$tok = strtok(";");
}
echo json_encode($parts),"\n";
$parts = explode(";", $string);
echo json_encode($parts),"\n";
上の例の出力は以下となります。
["aaa","bbb"] ["","aaa","","bbb",""]
注意
User Contributed Notes 20 notes
<?php
// strtok example
$str = 'Hello to all of Ukraine';
echo strtok($str, ' ').' '.strtok(' ').' '.strtok(' ');
?>
Result:
Hello to all
If you have memory-usage critical solution, you should keep in mind, that strtok function holds input string parameter (or reference to it?) in memory after usage.
<?php
function tokenize($str, $token_symbols) {
$word = strtok($str, $token_symbols);
while (false !== $word) {
// do something here...
$word = strtok($token_symbols);
}
}
?>
Test-cases with handling ~10MB plain-text file:
Case #1 - unset $str variable
<?php
$token_symbols = " \t\n";
$str = file_get_contents('10MB.txt'); // mem usage 9.75383758545 MB (memory_get_usage() / 1024 / 1024));
tokenize($str, $token_symbols); // mem usage 9.75400161743 MB
unset($str); // 9.75395584106 MB
?>
Case #1 result: memory is still used
Case #2 - call strtok again
<?php
$token_symbols = " \t\n";
$str = file_get_contents('10MB.txt'); // 9.75401306152 MB
tokenize($str, $token_symbols); // 9.75417709351
strtok('', ''); // 9.75421524048
?>
Case #2 result: memory is still used
Case #3 - call strtok again AND unset $str variable
<?php
$token_symbols = " \t\n";
$str = file_get_contents('10MB.txt'); // 9.75410079956 MB
tokenize($str, $token_symbols); // 9.75426483154 MB
unset($str);
strtok('', ''); // 0.0543975830078 MB
?>
Case #3 result: memory is free
So, better solution for tokenize function:
<?php
function tokenize($str, $token_symbols, $token_reset = true) {
$word = strtok($str, $token_symbols);
while (false !== $word) {
// do something here...
$word = strtok($token_symbols);
}
if($token_reset)
strtok('', '');
}
?>
<pre><?php
/** get leading, trailing, and embedded separator tokens that were 'skipped'
if for some ungodly reason you are using php to implement a simple parser that
needs to detect nested clauses as it builds a parse tree */
$str = "(((alpha(beta))(gamma))";
$seps = '()';
$tok = strtok( $str,$seps ); // return false on empty string or null
$cur = 0;
$dumbDone = FALSE;
$done = (FALSE===$tok);
while (!$done) {
// process skipped tokens (if any at first iteration) (special for last)
$posTok = $dumbDone ? strlen($str) : strpos($str, $tok, $cur );
$skippedMany = substr( $str, $cur, $posTok-$cur ); // false when 0 width
$lenSkipped = strlen($skippedMany); // 0 when false
if (0!==$lenSkipped) {
$last = strlen($skippedMany) -1;
for($i=0; $i<=$last; $i++){
$skipped = $skippedMany[$i];
$cur += strlen($skipped);
echo "skipped: $skipped\n";
}
}
if ($dumbDone) break; // this is the only place the loop is terminated
// process current tok
echo "curr tok: ".$tok."\n";
// update cursor
$cur += strlen($tok);
// get any next tok
if (!$dumbDone){
$tok = strtok($seps);
$dumbDone = (FALSE===$tok);
// you're not really done till you check for trailing skipped
}
};
?></pre>
Remove GET variables from the URL
<?php
echo strtok('http://example.com/index.php?foo=1&bar=2', '?');
?>
Result:
http://example.com/index.php
Simple way to tokenize search parameters, including double or single quoted keys. If only one quote is found, the rest of the string is assumed to be part of that token.
<?php
$token = strtok($keywords,' ');
while ($token) {
// find double quoted tokens
if ($token{0}=='"') { $token .= ' '.strtok('"').'"'; }
// find single quoted tokens
if ($token{0}=="'") { $token .= ' '.strtok("'")."'"; }
$tokens[] = $token;
$token = strtok(' ');
}
?>
Use substr(1,strlen($token)) and remove the part that adds the trailing quotes if you want your output without quotes.
Might be pointing out the obvious but if you'd rather use a for loop rather than a while (to keep the token strings on the same line for readability for example), it can be done. Added bonus, it doesn't put a $tok variable outside the loop itself either.
Downside however is that you're not able to manually free up the memory used using the technique mentioned by elarlang.
<?php
for($tok = strtok($str, ' _-.'); $tok!==false; $tok = strtok(' _-.'))
{
echo "$tok </br>";
}
?>
If you want to tokenize by only one letter, explode() is much faster compared to strtok().
<?php
$str=str_repeat('foo ',10000);
//explode()
$time=microtime(TRUE);
$arr=explode($str,' ');
$time=microtime(TRUE)-$time;
echo "explode():$time sec.".PHP_EOL;
//strtok()
$time=microtime(TRUE);
$ret=strtok(' ',$str);
while($ret!==FALSE){
$ret=strtok(' ');
}
$time=microtime(TRUE)-$time;
echo "strtok():$time sec.".PHP_EOL;
?>
The result is : (PHP 5.3.3 on CentOS)
explode():0.001317024230957 sec.
strtok():0.0058917999267578 sec.
explode() is about five times fast in short strings, too.
This looks very simple, but it took me a long time to figure out so I thought I'd share it incase someone else was wanting the same thing:
this should work similar to substr() but with tokens instead!
<?php
/* subtok(string,chr,pos,len)
*
* chr = chr used to seperate tokens
* pos = starting postion
* len = length, if negative count back from right
*
* subtok('a.b.c.d.e','.',0) = 'a.b.c.d.e'
* subtok('a.b.c.d.e','.',0,2) = 'a.b'
* subtok('a.b.c.d.e','.',2,1) = 'c'
* subtok('a.b.c.d.e','.',2,-1) = 'c.d'
* subtok('a.b.c.d.e','.',-4) = 'b.c.d.e'
* subtok('a.b.c.d.e','.',-4,2) = 'b.c'
* subtok('a.b.c.d.e','.',-4,-1) = 'b.c.d'
*/
function subtok($string,$chr,$pos,$len = NULL) {
return implode($chr,array_slice(explode($chr,$string),$pos,$len));
}
?>
explode breaks the tokens up into an array, array slice alows you to pick then tokens you want, and then implode converts it back to a string
although its far from a clone, this was inspired by mIRC's gettok() function
Note that strtok may receive different tokens each time. Therefore, if, for example, you wish to extract several words and then the rest of the sentence:
<?php
$text = "13 202 5 This is a long message explaining the error codes.";
$error1 = strtok($text, " "); //13
$error2 = strtok(" "); //202
$error3 = strtok(" "); //5
$error_message = strtok(""); //Notice the different token parameter
echo $error_message; //This is a long message explaining the error codes.
?>
As of the change in strtok()'s handling of empty strings, it is now useless for scripts that rely on empty data to function.
Take for instance, a standard header. (with UNIX newlines)
http/1.0 200 OK\n
Content-Type: text/html\n
\n
--HTML BODY HERE---
When parsing this with strtok, one would wait until it found an empty string to signal the end of the header. However, because strtok now skips empty segments, it is impossible to know when the header has ended.
This should not be called `correct' behavior, it certainly is not. It has rendered strtok incapable of (properly) processing a very simple standard.
This new functionality, however, does not affect Windows style headers. You would search for a line that only contains "\r"
This, however, is not a justification for the change.
Here is a java like StringTokenizer class using strtok function:
<?php
/**
* The string tokenizer class allows an application to break a string into tokens.
*
* @example The following is one example of the use of the tokenizer. The code:
* <code>
* <?php
* $str = 'this is:@\t\n a test!';
* $delim = ' !@:'\t\n; // remove these chars
* $st = new StringTokenizer($str, $delim);
* while ($st->hasMoreTokens()) {
* echo $st->nextToken() . "\n";
* }
* prints the following output:
* this
* is
* a
* test
* ?>
* </code>
*/
class StringTokenizer {
/**
* @var string
*/
private $token;
/**
* @var string
*/
private $delim;
/**
* Constructs a string tokenizer for the specified string
* @param string $str String to tokenize
* @param string $delim The set of delimiters (the characters that separate tokens)
* specified at creation time, default to ' '
*/
public function __construct(/*string*/ $str, /*string*/ $delim = ' ') {
$this->token = strtok($str, $delim);
$this->delim = $delim;
}
public function __destruct() {
unset($this);
}
/**
* Tests if there are more tokens available from this tokenizer's string. It
* does not move the internal pointer in any way. To move the internal pointer
* to the next element call nextToken()
* @return boolean - true if has more tokens, false otherwise
*/
public function hasMoreTokens() {
return ($this->token !== false);
}
/**
* Returns the next token from this string tokenizer and advances the internal
* pointer by one.
* @return string - next element in the tokenized string
*/
public function nextToken() {
$current = $this->token;
$this->token = strtok($this->delim);
return $current;
}
}
?>
Hello, portuguese documentation of strtok is wrong, at this part which the example(2) is wrong.
Exemplo #2 Comportamento antigo da strtok()
<?php
$first_token = strtok('/something', '/');
$second_token = strtok('/');
var_dump ($first_token, $second_token);
?>
O exemplo acima produzirá:
string(0) ""
string(9) "something"
(this example above, should be inverted as this:)
Correct:
string(9) "something"
string(0) ""
(exemple 3 is correct)
Exemplo #3 Novo comportamento da strtok()
<?php
$first_token = strtok('/something', '/');
$second_token = strtok('/');
var_dump ($first_token, $second_token);
?>
O exemplo acima produzirá:
string(9) "something"
bool(false)
Here's a simple class that allows you to iterate through string tokens using a foreach loop.
<?php
/**
* The TokenIterator class allows you to iterate through string tokens using
* the familiar foreach control structure.
*
* Example:
* <code>
* <?php
* $string = 'This is a test.';
* $delimiters = ' ';
* $ti = new TokenIterator($string, $delimiters);
*
* foreach ($ti as $count => $token) {
* echo sprintf("%d, %s\n", $count, $token);
* }
*
* // Prints the following output:
* // 0. This
* // 1. is
* // 2. a
* // 3. test.
* </code>
*/
class TokenIterator implements Iterator
{
/**
* The string to tokenize.
* @var string
*/
protected $_string;
/**
* The token delimiters.
* @var string
*/
protected $_delims;
/**
* Stores the current token.
* @var mixed
*/
protected $_token;
/**
* Internal token counter.
* @var int
*/
protected $_counter = 0;
/**
* Constructor.
*
* @param string $string The string to tokenize.
* @param string $delims The token delimiters.
*/
public function __construct($string, $delims)
{
$this->_string = $string;
$this->_delims = $delims;
$this->_token = strtok($string, $delims);
}
/**
* @see Iterator::current()
*/
public function current()
{
return $this->_token;
}
/**
* @see Iterator::key()
*/
public function key()
{
return $this->_counter;
}
/**
* @see Iterator::next()
*/
public function next()
{
$this->_token = strtok($this->_delims);
if ($this->valid()) {
++$this->_counter;
}
}
/**
* @see Iterator::rewind()
*/
public function rewind()
{
$this->_counter = 0;
$this->_token = strtok($this->_string, $this->_delims);
}
/**
* @see Iterator::valid()
*/
public function valid()
{
return $this->_token !== FALSE;
}
}
?>
Please note that strtok memory is shared between all PHP code currently executed, even included files. This can bite you in unexpected ways if you are not careful.
For example:
<?php
$path = 'dir/file.ext';
$dir_name = strtok($path, '/');
if ($dir_name !== (new Module)->getAllowedDirName()) {
throw new \Exception('Invalid directory name');
}
$file_name = strtok('');
?>
Seems easy enough, but if your Module class is not loaded, this triggers the autoloader. The autoloader *MAY* use strtok inside its loading code.
Or your Module class *MAY* use strtok inside its constructor.
This means you will never get your $file_name correctly.
So: you should *always* group strtok calls, without any external code between two strtok calls.
This would be OK:
<?php
$path = 'dir/file.ext';
$dir_name = strtok($path, '/');
$file_name = strtok('');
if ($dir_name !== (new Module)->getAllowedDirName()) {
throw new \Exception('Invalid directory name');
}
?>
This might cause issues:
<?php
$path = 'one/two#three';
$a = strtok($path, '/');
$b = strtok(Module::NAME_SEPARATOR);
$c = strtok('');
?>
Because your autoloader might be using strtok.
This would be avoided by fetching all parameters used in strtok *before* the calls:
<?php
$path = 'one/two#three';
$separator = Module::NAME_SEPARATOR;
$a = strtok($path, '/');
$b = strtok($separator);
$c = strtok('');
?>