mirror of
https://github.com/zoe-may/TDoG-Skin.git
synced 2025-01-19 23:17:22 +08:00
614 lines
26 KiB
Plaintext
614 lines
26 KiB
Plaintext
|
<?xml version="1.0" encoding="utf-8"?>
|
|||
|
|
|||
|
<overlay xmlns="http://hoa-project.net/xyl/xylophone">
|
|||
|
<yield id="chapter">
|
|||
|
|
|||
|
<p>Strings can sometimes be <strong>complex</strong>, especially when they use
|
|||
|
the <code>Unicode</code> encoding format. The <code>Hoa\Ustring</code> library
|
|||
|
provides several operations on UTF-8 strings.</p>
|
|||
|
|
|||
|
<h2 id="Table_of_contents">Table of contents</h2>
|
|||
|
|
|||
|
<tableofcontents id="main-toc" />
|
|||
|
|
|||
|
<h2 id="Introduction" for="main-toc">Introduction</h2>
|
|||
|
|
|||
|
<p>When we manipulate strings, the <a href="http://unicode.org/">Unicode</a>
|
|||
|
format establishes itself because of its <strong>compatibility</strong> with
|
|||
|
historical formats (like ASCII) and its capacity to understand a
|
|||
|
<strong>large</strong> range of characters and symbols for all cultures and
|
|||
|
all regions in the world. PHP provides several tools to manipulate such
|
|||
|
strings, like the following extensions:
|
|||
|
<a href="http://php.net/mbstring"><code>mbstring</code></a>,
|
|||
|
<a href="http://php.net/iconv"><code>iconv</code></a> or also the excellent
|
|||
|
<a href="http://php.net/intl"><code>intl</code></a> which is based on
|
|||
|
<a href="http://icu-project.org/">ICU</a>, the reference implementation of
|
|||
|
Unicode. Unfortunately, sometimes we have to mix these extensions to achieve
|
|||
|
our aims and at the cost of a certain <strong>complexity</strong> along with
|
|||
|
a regrettable <strong>verbosity</strong>.</p>
|
|||
|
<p>The <code>Hoa\Ustring</code> library answers to these issues by providing a
|
|||
|
<strong>simple</strong> way to manipulate strings with
|
|||
|
<strong>performance</strong> and <strong>efficiency</strong> in minds. It
|
|||
|
also provides some evoluated algorithms to perform <strong>search</strong>
|
|||
|
operations on strings.</p>
|
|||
|
|
|||
|
<h2 id="Unicode_strings" for="main-toc">Unicode strings</h2>
|
|||
|
|
|||
|
<p>The <code>Hoa\Ustring\Ustring</code> class represents a
|
|||
|
<strong>UTF-8</strong> Unicode strings and allows to manipulate it easily.
|
|||
|
This class implements the
|
|||
|
<a href="http://php.net/arrayaccess"><code>ArrayAccess</code></a>,
|
|||
|
<a href="http://php.net/countable"><code>Countable</code></a> and
|
|||
|
<a href="http://php.net/iteratoraggregate"><code>IteratorAggregate</code></a>
|
|||
|
interfaces. We are going to use three examples in three different languages:
|
|||
|
French, Arab and Japanese. Thus:</p>
|
|||
|
<pre><code class="language-php">$french = new Hoa\Ustring\Ustring('Je t\'aime');
|
|||
|
$arabic = new Hoa\Ustring\Ustring('أحبك');
|
|||
|
$japanese = new Hoa\Ustring\Ustring('私はあなたを愛して');</code></pre>
|
|||
|
<p>Now, let's see what we can do on these three strings.</p>
|
|||
|
|
|||
|
<h3 id="String_manipulation" for="main-toc">String manipulation</h3>
|
|||
|
|
|||
|
<p>Let's start with <strong>elementary</strong> operations. If we would like
|
|||
|
to <strong>count</strong> the number of characters (not bytes), we will use
|
|||
|
the <a href="http://php.net/count"><code>count</code> function</a>. Thus:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
count($french),
|
|||
|
count($arabic),
|
|||
|
count($japanese)
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* int(9)
|
|||
|
* int(4)
|
|||
|
* int(9)
|
|||
|
*/</code></pre>
|
|||
|
<p>When we speak about text position, it is not suitable to speak about the
|
|||
|
right or the left, but rather about a <strong>beginning</strong> or an
|
|||
|
<strong>end</strong>, and based on the <strong>direction</strong> of writing.
|
|||
|
We can know this direction thanks to the
|
|||
|
<code>Hoa\Ustring\Ustring::getDirection</code> method. It returns the value of
|
|||
|
one of the following constants:</p>
|
|||
|
<ul>
|
|||
|
<li><code>Hoa\Ustring\Ustring::LTR</code>, for left-to-right, if the text is
|
|||
|
written from the left to the right,</li>
|
|||
|
<li><code>Hoa\Ustring\Ustring::RTL</code>, for right-to-left, if the text is
|
|||
|
written from the right to the left.</li>
|
|||
|
</ul>
|
|||
|
<p>Let's observe the result with our examples:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
$french->getDirection() === Hoa\Ustring\Ustring::LTR, // is left-to-right?
|
|||
|
$arabic->getDirection() === Hoa\Ustring\Ustring::RTL, // is right-to-left?
|
|||
|
$japanese->getDirection() === Hoa\Ustring\Ustring::LTR // is left-to-right?
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* bool(true)
|
|||
|
* bool(true)
|
|||
|
* bool(true)
|
|||
|
*/</code></pre>
|
|||
|
<p>The result of this method is computed thanks to the
|
|||
|
<code>Hoa\Ustring\Ustring::getCharDirection</code> static method which computes
|
|||
|
the direction of only one character.</p>
|
|||
|
<p>If we would like to <strong>concatenate</strong> another string to the end
|
|||
|
or to the beginning, we will respectively use the
|
|||
|
<code>Hoa\Ustring\Ustring::append</code> and
|
|||
|
<code>Hoa\Ustring\Ustring::prepend</code> methods. These methods, like most of
|
|||
|
the ones which modifies the string, return the object itself, in order to
|
|||
|
chain the calls. For instance:</p>
|
|||
|
<pre><code class="language-php">echo $french->append('… et toi, m\'aimes-tu ?')->prepend('Mam\'zelle ! ');
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Mam'zelle ! Je t'aime… et toi, m'aimes-tu ?
|
|||
|
*/</code></pre>
|
|||
|
<p>We also have the <code>Hoa\Ustring\Ustring::toLowerCase</code> and
|
|||
|
<code>Hoa\Ustring\Ustring::toUpperCase</code> methods to, respectively, set
|
|||
|
the case of the string to lower or upper. For instance:</p>
|
|||
|
<pre><code class="language-php">echo $french->toUpperCase();
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* MAM'ZELLE ! JE T'AIME… ET TOI, M'AIMES-TU ?
|
|||
|
*/</code></pre>
|
|||
|
<p>We can also add characters to the beginning or to the end of the string to
|
|||
|
reach a <strong>minimum</strong> length. This operation is frequently called
|
|||
|
the <em>padding</em> (for historical reasons dating back to typewriters).
|
|||
|
That's why we have the <code>Hoa\Ustring\Ustring::pad</code> method which
|
|||
|
takes three arguments: the minimum length, characters to add and a constant
|
|||
|
indicating whether we have to add at the end or at the beginning of the string
|
|||
|
(respectively <code>Hoa\Ustring\Ustring::END</code>, by default, and
|
|||
|
<code>Hoa\Ustring\Ustring::BEGINNING</code>).</p>
|
|||
|
<pre><code class="language-php">echo $arabic->pad(20, ' ');
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* أحبك
|
|||
|
*/</code></pre>
|
|||
|
<p>A similar operation allows to remove, by default, <strong>spaces</strong>
|
|||
|
at the beginning and at the end of the string thanks to the
|
|||
|
<code>Hoa\Ustring\Ustring::trim</code> method. For example, to retreive our
|
|||
|
original Arabic string:</p>
|
|||
|
<pre><code class="language-php">echo $arabic->trim();
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* أحبك
|
|||
|
*/</code></pre>
|
|||
|
<p>If we would like to remove other characters, we can use its first argument
|
|||
|
which must be a regular expression. Finally, its second argument allows to
|
|||
|
specify from what side we would like to remove character: at the beginning, at
|
|||
|
the end or both, still by using the
|
|||
|
<code>Hoa\Ustring\Ustring::BEGINNING</code> and
|
|||
|
<code>Hoa\Ustring\Ustring::END</code> constants.</p>
|
|||
|
<p>If we would like to remove other characters, we can use its first argument
|
|||
|
which must be a regular expression. Finally, its second argument allows to
|
|||
|
specify the side where to remove characters: at the beginning, at the end or
|
|||
|
both, still by using the <code>Hoa\Ustring\Ustring::BEGINNING</code> and
|
|||
|
<code>Hoa\Ustring\Ustring::END</code> constants. We can combine these
|
|||
|
constants to express “both sides”, which is the default value:
|
|||
|
<code class="language-php">Hoa\Ustring\Ustring::BEGINNING |
|
|||
|
Hoa\Ustring\Ustring::END</code>. For example, to remove all the numbers and
|
|||
|
the spaces only at the end, we will write:</p>
|
|||
|
<pre><code class="language-php">$arabic->trim('\s|\d', Hoa\Ustring\Ustring::END);</code></pre>
|
|||
|
<p>We can also <strong>reduce</strong> the string to a
|
|||
|
<strong>sub-string</strong> by specifying the position of the first character
|
|||
|
followed by the length of the sub-string to the
|
|||
|
<code>Hoa\Ustring\Ustring::reduce</code> method:</p>
|
|||
|
<pre><code class="language-php">echo $french->reduce(3, 6)->reduce(2, 4);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* aime
|
|||
|
*/</code></pre>
|
|||
|
<p>If we would like to get a specific character, we can rely on the
|
|||
|
<code>ArrayAccess</code> interface. For instance, to get the first character
|
|||
|
of each of our examples (from their original definitions):</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
$french[0],
|
|||
|
$arabic[0],
|
|||
|
$japanese[0]
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* string(1) "J"
|
|||
|
* string(2) "أ"
|
|||
|
* string(3) "私"
|
|||
|
*/</code></pre>
|
|||
|
<p>If we would like the last character, we will use the -1 index. The index is
|
|||
|
not bounded to the length of the string. If the index exceeds this length,
|
|||
|
then a <em>modulo</em> will be applied.</p>
|
|||
|
<p>We can also modify or remove a specific character with this method. For
|
|||
|
example:</p>
|
|||
|
<pre><code class="language-php">$french->append(' ?');
|
|||
|
$french[-1] = '!';
|
|||
|
echo $french;
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Je t'aime !
|
|||
|
*/</code></pre>
|
|||
|
<p>Another very useful method is the <strong>ASCII</strong> transformation.
|
|||
|
Be careful, this is not always possible, according to your settings. For
|
|||
|
example:</p>
|
|||
|
<pre><code class="language-php">$title = new Hoa\Ustring\Ustring('Un été brûlant sur la côte');
|
|||
|
echo $title->toAscii();
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Un ete brulant sur la cote
|
|||
|
*/</code></pre>
|
|||
|
<p>We can also transform from Arabic or Japanese to ASCII. Symbols, like
|
|||
|
Mathemeticals symbols or emojis, are also transformed:</p>
|
|||
|
<pre><code class="language-php">$emoji = new Hoa\Ustring\Ustring('I ❤ Unicode');
|
|||
|
$maths = new Hoa\Ustring\Ustring('∀ i ∈ ℕ');
|
|||
|
|
|||
|
echo
|
|||
|
$arabic->toAscii(), "\n",
|
|||
|
$japanese->toAscii(), "\n",
|
|||
|
$emoji->toAscii(), "\n",
|
|||
|
$maths->toAscii(), "\n";
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* ahbk
|
|||
|
* sihaanatawo aishite
|
|||
|
* I (heavy black heart)️ Unicode
|
|||
|
* (for all) i (element of) N
|
|||
|
*/</code></pre>
|
|||
|
<p>In order this method to work correctly, the
|
|||
|
<a href="http://php.net/intl"><code>intl</code></a> extension needs to be
|
|||
|
present, so that the
|
|||
|
<a href="http://php.net/transliterator"><code>Transliterator</code></a> class
|
|||
|
is present. If it does not exist, the
|
|||
|
<a href="http://php.net/normalizer"><code>Normalizer</code></a> class must
|
|||
|
exist. If this class does not exist neither, the
|
|||
|
<code>Hoa\Ustring\Ustring::toAscii</code> method can still try a
|
|||
|
transformation, but it is less efficient. To activate this last solution,
|
|||
|
<code>true</code> must be passed as a single argument. This <em lang="fr">tour
|
|||
|
de force</em> is not recommended in most cases.</p>
|
|||
|
<p>We also find the <code>getTransliterator</code> method which returns a
|
|||
|
<code>Transliterator</code> object, or <code>null</code> if this class does
|
|||
|
not exist. This method takes a transliteration identifier as argument. We
|
|||
|
suggest to <a href="http://userguide.icu-project.org/transforms/general">read
|
|||
|
the documentation about the transliterator of ICU</a> to understand this
|
|||
|
identifier. The <code>transliterate</code> method allows to transliterate the
|
|||
|
current string based on an identifier and a beginning index and an end
|
|||
|
one. This method works the same way than the
|
|||
|
<a href="http://php.net/transliterator.transliterate"><code>Transliterator::transliterate</code></a>
|
|||
|
method.</p>
|
|||
|
<p>More generally, to change the <strong>encoding</strong> format, we can use
|
|||
|
the <code>Hoa\Ustring\Ustring::transcode</code> static method, with a string
|
|||
|
as first argument, the original encoding format as second argument and the
|
|||
|
expected encoding format as third argument (UTF-8 by default). The get the
|
|||
|
list of encoding formats, we have to refer to the
|
|||
|
<a href="http://php.net/iconv"><code>iconv</code></a> extension or to use the
|
|||
|
following command line in a terminal:</p>
|
|||
|
<pre><code class="language-php">$ iconv --list</code></pre>
|
|||
|
<p>To know if a string is encoded in UTF-8, we can use the
|
|||
|
<code>Hoa\Ustring\Ustring::isUtf8</code> static method; for instance:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
Hoa\Ustring\Ustring::isUtf8('a'),
|
|||
|
Hoa\Ustring\Ustring::isUtf8(Hoa\Ustring\Ustring::transcode('a', 'UTF-8', 'UTF-16'))
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* bool(true)
|
|||
|
* bool(false)
|
|||
|
*/</code></pre>
|
|||
|
<p>We can <strong>split</strong> the string into several sub-strings by using
|
|||
|
the <code>Hoa\Ustring\Ustring::split</code> method. As first argument, we have
|
|||
|
a regular expression (of kind <a href="http://pcre.org/">PCRE</a>), then an
|
|||
|
integer representing the maximum number of elements to return and finally a
|
|||
|
combination of constants. These constants are the same as the ones of
|
|||
|
<a href="http://php.net/preg_split"><code>preg_split</code></a>.</p>
|
|||
|
<p>By default, the second argument is set to -1, which means infinity, and the
|
|||
|
last argument is set to <code>PREG_SPLIT_NO_EMPTY</code>. Thus, if we would
|
|||
|
like to get all the words of a string, we will write:</p>
|
|||
|
<pre><code class="language-php">print_r($title->split('#\b|\s#'));
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Array
|
|||
|
* (
|
|||
|
* [0] => Un
|
|||
|
* [1] => ete
|
|||
|
* [2] => brulant
|
|||
|
* [3] => sur
|
|||
|
* [4] => la
|
|||
|
* [5] => cote
|
|||
|
* )
|
|||
|
*/</code></pre>
|
|||
|
<p>If we would like to <strong>iterate</strong> over all the
|
|||
|
<strong>characters</strong>, it is recommended to use the
|
|||
|
<code>IteratorAggregate</code> method, being the
|
|||
|
<code>Hoa\Ustring\Ustring::getIterator</code> method. Let's see on the Arabic
|
|||
|
example:</p>
|
|||
|
<pre><code class="language-php">foreach ($arabic as $letter) {
|
|||
|
echo $letter, "\n";
|
|||
|
}
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* أ
|
|||
|
* ح
|
|||
|
* ب
|
|||
|
* ك
|
|||
|
*/</code></pre>
|
|||
|
<p>We notice that the iteration is based on the text direction, it means that
|
|||
|
the first element of the iteration is the first letter of the string starting
|
|||
|
from the beginning.</p>
|
|||
|
<p>Of course, if we would like to get an array of characters, we can use the
|
|||
|
<a href="http://php.net/iterator_to_array"><code>iterator_to_array</code></a>
|
|||
|
PHP function:</p>
|
|||
|
<pre><code class="language-php">print_r(iterator_to_array($arabic));
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Array
|
|||
|
* (
|
|||
|
* [0] => أ
|
|||
|
* [1] => ح
|
|||
|
* [2] => ب
|
|||
|
* [3] => ك
|
|||
|
* )
|
|||
|
*/</code></pre>
|
|||
|
|
|||
|
<h3 id="Comparison_and_search" for="main-toc">Comparison and search</h3>
|
|||
|
|
|||
|
<p>Strings can also be <strong>compared</strong> thanks to the
|
|||
|
<code>Hoa\Ustring\Ustring::compare</code> method:</p>
|
|||
|
<pre><code class="language-php">$string = new Hoa\Ustring\Ustring('abc');
|
|||
|
var_dump(
|
|||
|
$string->compare('wxyz')
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* string(-1)
|
|||
|
*/</code></pre>
|
|||
|
<p>This methods returns -1 if the initial string comes before (in the
|
|||
|
alphabetical order), 0 if it is identical and 1 if it comes after. If we
|
|||
|
would like to use all the power of the underlying mechanism, we can call the
|
|||
|
<code>Hoa\Ustring\Ustring::getCollator</code> static method (if the
|
|||
|
<a href="http://php.net/Collator"><code>Collator</code></a> class exists, else
|
|||
|
<code>Hoa\Ustring\Ustring::compare</code> will use a simple byte to bytes
|
|||
|
comparison without taking care of the other parameters). Thus, if we would
|
|||
|
like to sort an array of strings, we will write:</p>
|
|||
|
<pre><code class="language-php">$strings = array('c', 'Σ', 'd', 'x', 'α', 'a');
|
|||
|
Hoa\Ustring\Ustring::getCollator()->sort($strings);
|
|||
|
print_r($strings);
|
|||
|
|
|||
|
/**
|
|||
|
* Could output:
|
|||
|
* Array
|
|||
|
* (
|
|||
|
* [0] => a
|
|||
|
* [1] => c
|
|||
|
* [2] => d
|
|||
|
* [3] => x
|
|||
|
* [4] => α
|
|||
|
* [5] => Σ
|
|||
|
* )
|
|||
|
*/</code></pre>
|
|||
|
<p>Comparison between two strings depends on the <strong>locale</strong>, it
|
|||
|
means of the localization of the system, like the language, the country, the
|
|||
|
region etc. We can use the
|
|||
|
<a href="@hack:chapter=Locale"><code>Hoa\Locale</code> library</a> to modify
|
|||
|
these data, but it's not a dependence of <code>Hoa\Ustring</code>.</p>
|
|||
|
<p>We can also know if a string <strong>matches</strong> a certain pattern,
|
|||
|
still expressed with a regular expression. To achieve that, we will use the
|
|||
|
<code>Hoa\Ustring\Ustring::match</code> method. This method relies on the
|
|||
|
<a href="http://php.net/preg_match"><code>preg_match</code></a> and
|
|||
|
<a href="http://php.net/preg_match_all"><code>preg_match_all</code></a> PHP
|
|||
|
functions, but by modifying the pattern's options to ensure the Unicode
|
|||
|
support. We have the following parameters: the pattern, a variable passed by
|
|||
|
reference to collect the matches, flags, an offset and finally a boolean
|
|||
|
indicating whether the search is global or not (respectively if we have to use
|
|||
|
<code>preg_match_all</code> or <code>preg_match</code>). By default, the
|
|||
|
search is not global.</p>
|
|||
|
<p>Thus, we will check that our French example contains <code>aime</code> with
|
|||
|
a direct object complement:</p>
|
|||
|
<pre><code class="language-php">$french->match('#(?:(?&lt;direct_object>\w)[\'\b])aime#', $matches);
|
|||
|
var_dump($matches['direct_object']);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* string(1) "t"
|
|||
|
*/</code></pre>
|
|||
|
<p>This method returns <code>false</code> if an error is raised (for example
|
|||
|
if the pattern is not correct), 0 if no match has been found, the number of
|
|||
|
matches else.</p>
|
|||
|
<p>Similarly, we can <strong>search</strong> and <strong>replace</strong>
|
|||
|
sub-strings by other sub-strings based on a pattern, still expressed with a
|
|||
|
regular expression. To achieve that, we will use the
|
|||
|
<code>Hoa\Ustring\Ustring::replace</code> method. This method uses the
|
|||
|
<a href="http://php.net/preg_replace"><code>preg_replace</code></a> and
|
|||
|
<a href="http://php.net/preg_replace_callback"><code>preg_replace_callback</code></a>
|
|||
|
PHP functions, but still by modifying the pattern's options to ensure the
|
|||
|
Unicode support. As first argument, we find one or more patterns, as second
|
|||
|
argument, one or more replacements and as last argument the limit of
|
|||
|
replacements to apply. If the replacement is a callable, then the
|
|||
|
<code>preg_replace_callback</code> function will be used.</p>
|
|||
|
<p>Thus, we will modify our French example to be more polite:</p>
|
|||
|
<pre><code class="language-php">$french->replace('#(?:\w[\'\b])(?&lt;verb>aime)#', function ($matches) {
|
|||
|
return 'vous ' . $matches['verb'];
|
|||
|
});
|
|||
|
|
|||
|
echo $french;
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Je vous aime
|
|||
|
*/</code></pre>
|
|||
|
<p>The <code>Hoa\Ustring\Ustring</code> class provides constants which are
|
|||
|
aliases of existing PHP constants and ensure a better readability of the
|
|||
|
code:</p>
|
|||
|
<ul>
|
|||
|
<li><code>Hoa\Ustring\Ustring::WITHOUT_EMPTY</code>, alias of
|
|||
|
<code>PREG_SPLIT_NO_EMPTY</code>,</li>
|
|||
|
<li><code>Hoa\Ustring\Ustring::WITH_DELIMITERS</code>, alias of
|
|||
|
<code>PREG_SPLIT_DELIM_CAPTURE</code>,</li>
|
|||
|
<li><code>Hoa\Ustring\Ustring::WITH_OFFSET</code>, alias of
|
|||
|
<code>PREG_OFFSET_CAPTURE</code> and
|
|||
|
<code>PREG_SPLIT_OFFSET_CAPTURE</code>,</li>
|
|||
|
<li><code>Hoa\Ustring\Ustring::GROUP_BY_PATTERN</code>, alias of
|
|||
|
<code>PREG_PATTERN_ORDER</code>,</li>
|
|||
|
<li><code>Hoa\Ustring\Ustring::GROUP_BY_TUPLE</code>, alias of
|
|||
|
<code>PREG_SET_ORDER</code>.</li>
|
|||
|
</ul>
|
|||
|
<p>Because they are strict aliases, we can write:</p>
|
|||
|
<pre><code class="language-php">$string = new Hoa\Ustring\Ustring('abc1 defg2 hikl3 xyz4');
|
|||
|
$string->match(
|
|||
|
'#(\w+)(\d)#',
|
|||
|
$matches,
|
|||
|
Hoa\Ustring\Ustring::WITH_OFFSET
|
|||
|
| Hoa\Ustring\Ustring::GROUP_BY_TUPLE,
|
|||
|
0,
|
|||
|
true
|
|||
|
);</code></pre>
|
|||
|
|
|||
|
<h3 id="Characters" for="main-toc">Characters</h3>
|
|||
|
|
|||
|
<p>The <code>Hoa\Ustring\Ustring</code> class offers static methods working on
|
|||
|
a single Unicode character. We have already mentionned the
|
|||
|
<code>getCharDirection</code> method which allows to know the
|
|||
|
<strong>direction</strong> of a character. We also have the
|
|||
|
<code>getCharWidth</code> which counts the <strong>number of columns</strong>
|
|||
|
necessary to print a single character. Thus:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
Hoa\Ustring\Ustring::getCharWidth(Hoa\Ustring\Ustring::fromCode(0x7f)),
|
|||
|
Hoa\Ustring\Ustring::getCharWidth('a'),
|
|||
|
Hoa\Ustring\Ustring::getCharWidth('㽠')
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* int(-1)
|
|||
|
* int(1)
|
|||
|
* int(2)
|
|||
|
*/</code></pre>
|
|||
|
<p>This method returns -1 or 0 if the character is not
|
|||
|
<strong>printable</strong> (for instance, if this is a control character, like
|
|||
|
<code>0x7f</code> which corresponds to <code>DELETE</code>), 1 or more if this
|
|||
|
is a character that can be printed. In our example, <code>㽠</code> requires
|
|||
|
2 columns to be printed.</p>
|
|||
|
<p>To get more semantics, we have the
|
|||
|
<code>Hoa\Ustring\Ustring::isCharPrintable</code> method which allows to know
|
|||
|
whether a character is printable or not.</p>
|
|||
|
<p>If we would like to count the number of columns necessary for a whole
|
|||
|
string, we have to use the <code>Hoa\Ustring\Ustring::getWidth</code> method.
|
|||
|
Thus:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
$french->getWidth(),
|
|||
|
$arabic->getWidth(),
|
|||
|
$japanese->getWidth()
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* int(9)
|
|||
|
* int(4)
|
|||
|
* int(18)
|
|||
|
*/</code></pre>
|
|||
|
<p>Try this in your terminal with a <strong>monospaced</strong> font. You will
|
|||
|
observe that Japanese requires 18 columns to be printed. This measure is very
|
|||
|
useful if we would like to know the length of a string to position it
|
|||
|
efficiently.</p>
|
|||
|
<p>The <code>getCharWidth</code> method is different of <code>getWidth</code>
|
|||
|
because it includes control characters. This method is intended to be used,
|
|||
|
for example, with terminals (please, see the
|
|||
|
<a href="@hack:chapter=Console"><code>Hoa\Console</code> library</a>).</p>
|
|||
|
<p>Finally, if this time we are not interested by Unicode characters but
|
|||
|
rather by <strong>machine</strong> characters <code>char</code> (being
|
|||
|
1 byte), we have an extra operation. The
|
|||
|
<code>Hoa\Ustring\Ustring::getBytesLength</code> method will count the
|
|||
|
<strong>length</strong> of the string in bytes:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
$arabic->getBytesLength(),
|
|||
|
$japanese->getBytesLength()
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* int(8)
|
|||
|
* int(27)
|
|||
|
*/</code></pre>
|
|||
|
<p>If we compare these results with the ones of the
|
|||
|
<code>Hoa\Ustring\Ustring::count</code> method, we understand that the Arabic
|
|||
|
characters are encoded with 2 bytes whereas Japanese characteres are encoded
|
|||
|
with 3 bytes. We can also get a specific byte thanks to the
|
|||
|
<code>Hoa\Ustring\Ustring::getByteAt</code> method. Once again, the index is
|
|||
|
not bounded.</p>
|
|||
|
|
|||
|
<h3 id="Code-point" for="main-toc">Code-point</h3>
|
|||
|
|
|||
|
<p>Each character is represented by an integer, called a
|
|||
|
<strong>code-point</strong>. To get the code-point of a character, we can
|
|||
|
use the <code>Hoa\Ustring\Ustring::toCode</code> static method, and to get a
|
|||
|
character based on its code-point, we can use the
|
|||
|
<code>Hoa\Ustring\Ustring::fromCode</code> static method. We also have the
|
|||
|
<code>Hoa\Ustring\Ustring::toBinaryCode</code> method which returns the binary
|
|||
|
representation of a character. Let's take an example:</p>
|
|||
|
<pre><code class="language-php">var_dump(
|
|||
|
Hoa\Ustring\Ustring::toCode('Σ'),
|
|||
|
Hoa\Ustring\Ustring::toBinaryCode('Σ'),
|
|||
|
Hoa\Ustring\Ustring::fromCode(0x1a9)
|
|||
|
);
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* int(931)
|
|||
|
* string(32) "1100111010100011"
|
|||
|
* string(2) "Σ"
|
|||
|
*/</code></pre>
|
|||
|
|
|||
|
<h2 id="Search_algorithms" for="main-toc">Search algorithms</h2>
|
|||
|
|
|||
|
<p>The <code>Hoa\Ustring</code> library provides sophisticated
|
|||
|
<strong>search</strong> algorithms on strings through the
|
|||
|
<code>Hoa\Ustring\Search</code> class.</p>
|
|||
|
<p>We will study the <code>Hoa\Ustring\Search::approximated</code> algorithm
|
|||
|
which searches a sub-string in a string up to <strong><em>k</em>
|
|||
|
differences</strong> (a difference is an addition, a deletion or a
|
|||
|
modification). Let's take the classical example of a DNA representation: We
|
|||
|
will search all the sub-strings approximating <code>GATAA</code> with
|
|||
|
1 difference (maximum) in <code>CAGATAAGAGAA</code>. So, we will write:</p>
|
|||
|
<pre><code class="language-php">$x = 'GATAA';
|
|||
|
$y = 'CAGATAAGAGAA';
|
|||
|
$k = 1;
|
|||
|
$search = Hoa\Ustring\Search::approximated($y, $x, $k);
|
|||
|
$n = count($search);
|
|||
|
|
|||
|
echo 'Try to match ', $x, ' in ', $y, ' with at most ', $k, ' difference(s):', "\n";
|
|||
|
echo $n, ' match(es) found:', "\n";
|
|||
|
|
|||
|
foreach ($search as $position) {
|
|||
|
echo ' • ', substr($y, $position['i'], $position['l'), "\n";
|
|||
|
}
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Try to match GATAA in CAGATAAGAGAA with at most 1 difference(s):
|
|||
|
* 4 match(es) found:
|
|||
|
* • AGATA
|
|||
|
* • GATAA
|
|||
|
* • ATAAG
|
|||
|
* • GAGAA
|
|||
|
*/</code></pre>
|
|||
|
<p>This methods returns an array of arrays. Each sub-array represents a result
|
|||
|
and contains three indexes: <code>i</code> for the position of the first
|
|||
|
character (byte) of the result, <code>j</code> for the position of the last
|
|||
|
character and <code>l</code> for the length of the result (simply
|
|||
|
<code>j</code> - <code>i</code>). Thus, we can compute the results by using
|
|||
|
our initial string (here <code class="language-php">$y</code>) and its
|
|||
|
indexes.</p>
|
|||
|
<p>With our example, we have four results. The first is <code>AGATA</code>,
|
|||
|
being <code>GATA<em>A</em></code> with one moved character, and
|
|||
|
<code>AGATA</code> exists in <code>C<em>AGATA</em>AGAGAA</code>. The second
|
|||
|
result is <code>GATAA</code>, our sub-string, which well and truly exists in
|
|||
|
<code>CA<em>GATAA</em>GAGAA</code>. The third result is <code>ATAAG</code>,
|
|||
|
being <code><em>G</em>ATAA</code> with one moved character, and
|
|||
|
<code>ATAAG</code> exists in <code>CAG<em>ATAAG</em>AGAA</code>. Finally, the
|
|||
|
last result is <code>GAGAA</code>, being <code>GA<em>T</em>AA</code> with one
|
|||
|
modified character, and <code>GAGAA</code> exists in
|
|||
|
<code>CAGATAA<em>GAGAA</em></code>.</p>
|
|||
|
<p>Another example, more concrete this time. We will consider the
|
|||
|
<code>--testIt --foobar --testThat --testAt</code> string (which represents
|
|||
|
possible options of a command line), and we will search <code>--testot</code>,
|
|||
|
an option that should have been given by the user. This option does not exist
|
|||
|
as it is. We will then use our search algorithm with at most 1 difference.
|
|||
|
Let's see:</p>
|
|||
|
<pre><code class="language-php">$x = 'testot';
|
|||
|
$y = '--testIt --foobar --testThat --testAt';
|
|||
|
$k = 1;
|
|||
|
$search = Hoa\Ustring\Search::approximated($y, $x, $k);
|
|||
|
$n = count($search);
|
|||
|
|
|||
|
// …
|
|||
|
|
|||
|
/**
|
|||
|
* Will output:
|
|||
|
* Try to match testot in --testIt --foobar --testThat --testAt with at most 1 difference(s)
|
|||
|
* 2 match(es) found:
|
|||
|
* • testIt
|
|||
|
* • testAt
|
|||
|
*/</code></pre>
|
|||
|
<p>The <code>testIt</code> and <code>testAt</code> results are true options,
|
|||
|
so we can suggest them to the user. This is a mechanism user by
|
|||
|
<code>Hoa\Console</code> to suggest corrections to the user in case of a
|
|||
|
mistyping.</p>
|
|||
|
|
|||
|
<h2 id="Conclusion" for="main-toc">Conclusion</h2>
|
|||
|
|
|||
|
<p>The <code>Hoa\Ustring</code> library provides facilities to manipulate
|
|||
|
strings encoded with the Unicode format, but also to make sophisticated search
|
|||
|
on strings.</p>
|
|||
|
|
|||
|
</yield>
|
|||
|
</overlay>
|