GT.Mバージョン5.2-000技術情報

GT.M Unicode標準規格 ™ サポート

注意事項

GT.M™ は、Fidelity Information Services, Inc. のトレードマークです。

Unicode™ - "ユニコードは Unicode, Inc のトレードマークです。"

GT.M and its documentation are provided pursuant to license agreements containing restrictions on their use. They are the copyrighted intellectual property of Fidelity National Information Services, Inc. and Sanchez Computer Associates, LLC (collectively "Fidelity") and are protected by U.S. copyright law. They may not be copied or distributed in any form or medium, disclosed to third parties, or used in any manner not authorized in said license agreement except with prior written authorization from Fidelity.

このドキュメントにおけるGT.Mの説明および操作方法は、システムを構成するさまざまな機能に関するものが含まれます。It should not be construed as a commitment of Fidelity. FISは、この文書の情報は発表日現在において正確であると考えていますが、情報は予告なしに変更される場合があります。Fidelity is not responsible for any inadvertent errors.

2010年8月16日

改訂履歴
Revision 1.3	2010年8月20日
制限事項、プロセスの起動時に$PRINCIPAL デバイスのエンコーディングは決定されます" と呼ばれる新しいセクションは、"UTF-16は$PRINCIPALデバイスでサポートされていません" に置換されました。
Revision 1.2	2010年3月22日
GT.Mプログラマガイドへのリンク切れを修正。
Revision 1.1	2007年1月19日
最初の公開バージョン

連絡先情報

GT.Mグループ
Fidelity National Information Services, Inc.
2 West Liberty Boulevard, Suite 300
Malvern, PA 19355,
United States of America

GT.M Support: +1 (610) 578-4226
Switchboard: +1 (610) 296-8877
Fax: +1 (484) 595-5101
Website: http://fis-gtm.com
Email: gtmsupport@fnis.com

イントロダクション

V5.2-000のリリースから、GT.MはUnicode ™バージョン5.0.0のサポートを提供しています。Unicode ™ 標準のリリースとISO/IEC-10646のリリースは、相互に追跡しあっています（詳細については http://www.unicode.org/faq/unicode_iso.html 参照してください）。

この技術情報の目的は、GT.M言語の特徴、実用的な例を使ってV5.2-000のユーティリティプログラム、ディスカッションの要約、ベストプラクティスなどの拡張を記述するためにあります。

ユニコード ™とGT.Mを理解することは、GT.M.のユニコード™連係機能を使用するためには必須条件です。詳細はユニコード ™を参照してください：

ユニコード ® コンソーシアム（ http://www.unicode.org ）は、Unicode文字の間の動作と関係を定義する国際化の分野での標準を開発します。
ユニコード ™の Wikipediaのエントリ（ http://en.wikipedia.org/wiki/Unicode ）は、エンコーディング、グリフ、符号化文字集合、コードポイント、サロゲート文字、照合順序、UTF - 8など、優れたリソースです。

この技術情報には、5つのパートがあります：

理論操作：このセクションでは、GT.Mのユニコード ™ についてサポートの背景にある体系と、それをサポートする拡張の概要について説明します。特に、GT.MデータベースエンジンとGT.Mのユニコード™関連の機能の変更が無いこのコンセプトは、データベースファイルに格納されているバイト文字で解釈するために、簡単な別の方法があります。
M言語：、このセクションでは、M言語コマンドの拡張機能、文字列処理関数、GT.MがUTF-8文字セットで動作する方法の説明などをカバーします。Unicode™ 文字列、I/O などについて説明します。オペレーションとユーティリティプログラムの理論と共に、このセクションでは、Unicode™を使用してアプリケーションを開発するために、アプリケーション開発者が必要とする情報を提供します。
ユーティリティプログラム：このセクションでは、MUPIP、DSE、LKEでの変更をカバーします。
ディスカッションとベストプラクティス：このセクションでは、M文字セットとUTF- 8の間のデータ交換のためのベストプラクティス、V5.2-000の制限と最大値、および、GT.Mを配置のためのUnicode™ベースのアプリケーションを設計開発する10のルール、について説明します。

動作理論

原理

新しいGT.Mの機能をデザインするときには、GT.Mチームは上位互換性への貢献があります。 GT.Mの以前の製品リリースにデプロイされている不変の既存アプリケーションは、V5.2- 000でも不変の挙動を示します。

GT.Mデータベースエンジンへ、または、エンジンでデータが保存され操作される方法に変更はありません。GT.Mは、Mのグローバル変数とローカル変数のインデックスと値を、基準の番号または任意のバイトシーケンスのどちらかに常に可能にすることができ、これはUnicode™をサポートしている方法では変更できません。

M ソースプログラムで使用される文字セットへの変更もありません。Mソースプログラムは、常にASCII になっています（標準ASCII は - $C(0)から$C(127) - Unicode標準によって指定されたUTF-8エンコーディングの適切なサブセットです）。GT.Mは、コメントと文字列リテラル内の一部の非ASCII文字を受け入れます。

GT.MのUnicode™関連の機能は、入出力の代替オプションの方法があり、そして、グローバル変数とローカル変数のインデックスと値で任意のバイトのシーケンスを文字列として解釈します。Unicode ™をサポートするGT.M内での変更は、M言語仕様に対する主な拡張です。原理的には簡単ですが、これらは基本的には以前からある根深い仮定を確定的に変更します。次に例を示します。

文字の中の文字列の長さは、バイト単位で文字列の長さと同じではありません。文字の中のUnicode™文字の文字列の長さは、常により小さいか、バイト単位の長さに等しくなります。
端末上の文字列の表示幅は、文字の中の文字列の長さが異なる場合 - たとえば、Unicode™を使用し、複雑なグリフは、Unicode™ 文字列内のエンコードされる文字のUTF-8の順番で、実際には、グリフまたはコンポーネントシンボルのシリーズで構成されます。

グリフが複数の文字で構成されている場合を、Unicode™ の中の文字列は、正規と非正規の形式を持つことができます。フォームは、概念的に同じかもしれないが、それらはUnicode™ で文字の異なる文字列です。

GT.Mは、異なりそして等しくないものとして同じ文字列について正規と非正規のバージョンを扱います。FIS recommends that applications be written to ensure that, for core processing, strings always have a canonical form. Where conformance to a canonical representation of input strings cannot be assured, application logic linguistically and culturally correct for each language must convert non-canonical strings to canonical strings used as indices (global subscripts) to ensure appropriate collation.

Applications may operate on some binary data - for example, some strings in the database may be digitized images of signatures, others may include escape sequences for laboratory instruments. Furthermore, since M applications have traditionally overloaded strings by storing different data items as pieces of the same string, the same string may contain both Unicode™ and binary data. GT.M now has functionality to allow a process to manipulate Unicode™ strings as well as binary data including strings containing both Unicode™ and binary data.

When strings are interpreted as Unicode™, GT.M uses the UTF-8 representation internally. GT.M input / output operations can optionally automatically convert to and from UTF-16, UTF-16LE and UTF-16BE .

GT.Mのデザイン哲学は、物事をシンプルに保つことですが、必要以上にはシンプルにしません。There are areas of processing where the use of Unicode ™ adds complexity. これらは、典型的に長さの解釈と文字の解釈とが互いに影響しあうことが生じます。次に例を示します。

A sequence of bytes is never illegal when considered as binary data, but can be illegal when treated as a Unicode ™string. The detection and handling of illegal Unicode™ strings adds complexity, especially when binary and Unicode™ data reside in different pieces of the same string.
Since binary data may not map to graphic characters in Unicode™ , the ZWRite format must represent such characters differently. A sequence of bytes that is output by a process interpreting it as Unicode™ may require processing to be correctly input to a process that is interpreting that sequence as binary, and vice versa. Therefore, when performing IO operations, including MUPIP EXTRACT and MUPIP LOAD operations in ZWR format, ensure that processes have the compatible environment variables and /or logic to generate the desired output and correctly read & process the input.
Application logic managing input / output that interact with human beings or non-GT.M applications requires even closer scrutiny. 例えば、ファイルの固定長レコードは、常にバイト単位で定義されています。In Unicode™-related operation, an application may output data such that a character would cross a record boundary (for example, a record may have two bytes of space left, and the next UTF8 character may be three bytes long), in which case GT.M fills the record with one or more pad characters. パッド入りのレコードがUTF-8として読み込まれる時に、末尾のパッド文字(埋め込み文字)は、GT.Mによってストリップされ、アプリケーションコードへは提供されません。

For some languages (such as Chinese), the ordering of strings according to Unicode™ code-points (character values) may not be the linguistically or culturally correct ordering. そのような言語の中でサポートしているアプリケーションでは、照合モジュールの開発が必要です - GT.MはネイティブなM照合をサポートしていますが、特定の自然言語のためのビルド済みの照合モジュールは含まれていません。

どのような文字か？グリフまたはUnicode ™ コード位置？

Glyphs are the visual representation of text elements in writing systems and Unicode™ code-points are the underlying data. Internally, GT.M stores UTF-8 encoded strings as sequences of Unicode™ code-points. A Unicode™ compatible output device - terminal, printer or application - renders the characters as sequences of glyphs that depict the sequence of code-points, but frequently there is not a one-to-one correspondence between characters and glyphs.

例えば、デーヴァナーガリー書記大系 (Devanagari：サンスクリット語とヒンディー語を書くときに使われる音節主音的な書記法) から次の単語を考えてみましょう。

अच्छी

画面やプリンタ上は、それが4列に表示されます。Internally GT.M stores it as a sequence of 5 Unicode™ code-points:

#	文字	Unicode™ code-point	文字の意味
1	अ	U+0905	デーヴァナーガリー字体 A
2	च	U+091A	デーヴァナーガリー字体 CA
3	्	U+094D	デーヴァナーガリー文字の符号 VIRAMA
4	छ	U+091B	デーヴァナーガリー字体 CHA
5	ी	U+0940	デーヴァナーガリー母音符号 II

デーヴァナーガリーの書記体系 (U+0900 to U+097F) は、英語のアルファベットの使用とは対照的に音節の表現に基づいています。したがって、それは特定の音節を表すために子音のhalf-form (ハーフフォーム)を使用しています。上記の例では、half-form (ハーフフォーム)を使用しています (U+091A) 。

Although the half-form form consonant is a valid text element in the context of the Devanagari writing system, it does not map directly to a character in the Unicode™ Standard. これは、デーヴァナーガリーSIGN VIRAMA とデーヴァナーガリーLETTER CHA とデーバナーガリー文字LETTER CAを組み合わせることによって得られます。

च

्

छ

च्छ

画面やプリンタ上で、ターミナルのフォントは、半子音のグリフイメージを検出し、次の表示位置でそれを表示します。内部的にGT.Mは、それを表示するために必要な列数を計算するために、デーヴァナーガリー書記体系におけるICUのグリフに関連する規則を使用しています。As as result, GT.M advances $X by 1 when it encounters the combination of the 3 Unicode™ code-points that represent the half-form consonant.

GT.Mプロンプトでこのサンプルを表示するには、次のコマンドシーケンスを入力します。

GTM>W $ZCHSET
UTF-8
GTM>SET DS=$CHAR($$FUNC^%HD("0905"))_$CHAR($$FUNC^%HD("091A"))_$CHAR($$FUNC^%HD("094D"))_$CHAR($$FUNC^%HD("091B"))_$CHAR($$FUNC^%HD("0940"))
GTM>WRITE $ZWIDTH(DS); 4 columns are required to display local variable DS on the screen. 
4 
GTM>WRITE $LENGTH(DS); DS contains 5 characters or Unicode code-points. 
5

Therefore, for all writing systems supported by Unicode™, a character is a code-point for string processing, network transmission, storage, and retrieval of Unicode™ data whereas a character is a glyph for displaying on the screen or printer. これは、多くの他の一般的なプログラミング言語についても当てはまります。Users must keep this distinction in mind throughout the application development life-cycle.

ICU

While GT.M provides a framework for handling characters in Unicode™, it relies on the ICU (International Components for Unicode) library for language specific information.

ICU is a widely used, defacto standard package (see http://icu.sourceforge.net and http://www.ibm.com/software/globalization/icu/ for more information) that GT.M relies on for most operations that require knowledge of the Unicode™ character sets, such as text boundary detection, character string conversion between UTF-8 and UTF-16, and calculating glyph display widths.


	Unless Unicode™ support is sought for a process (that is, unless the environment variable $gtm_chset is “UTF8”), GT.M processes do not need ICU. In other words, existing, non-Unicode™, applications continue to work on supported platforms without ICU.

ICUのバージョン番号は、整数であるメジャー番号とマイナー番号とミリ番号とマイクロ番号の major.minor.milli.micro 形式です。機能とAPIの互換性では異なる可能性がある別のメジャーおよび/またはマイナーバージョン番号を持つ2つのバージョンは、保証されていません。ミリやマイクロのバージョンの違いは、機能やAPIの互換性を保つためのメンテナンスリリースです。ICU reference releases are defined by major and minor version numbers, where the minor version number is even. For example, as of this writing (January, 2007), the latest ICU reference release is version 3.6. When ICU is packaged and distributed with an operating system, the operating system distribution may add its own version information. For example, as of this writing, the Debian GNU/Linux Testing version of package libicu36, which provides ICU 3.6, is 3.6-2.

An operating system's distribution generally include an ICU library tailored to the OS and hardware, therefore a GT.M distribution does not provide any ICU. However, in order to support Unicode™ functionality, GT.M requires an appropriate version of ICU to be installed on the system. Each version of GT.M requires a specific reference release version of ICU. GT.M V5.2-000 requires ICU 3.6. 各GT.Mリリースのリリースノートには、必要な基準のリリースのバージョン番号だけでなく、リリース前にGT.Mをテストするために使用されたミリ、マイクロバージョン番号を識別します。一般的に、特定のICUリファレンスで必要とされるバージョン番号と、そのGT.Mバージョンのリリースノートで識別されるものよりも大きいミリとマイクロバージョン番号を持つICUの任意のバージョンを使用しても安全でなければなりません。

ICUは、プロセス内で複数のスレッドをサポートし、ICUのバイナリライブラリは、マルチスレッドをサポートするか、あるいは、サポートしないかのどちらか一方を、ソースコードからコンパイルすることができます。対照的に、GT.Mは、GT.Mプロセス内でマルチスレッドをサポートしていません。On some platforms, such as the Debian GNU/Linux Testing (Etch) release, the stock ICU library, which is usually compiled to support multiple threads, may work unaltered with GT.M. On other platforms, it may be required to rebuild ICU from its source files with support for multiple threads turned off. Refer to the release notes for each GT.M release for details about the specific configuration tested and hence formally supported. In general, the GT.M Group's preference for ICU binaries used for each GT.M version are, in decreasing order of preference:

オペレーティングシステムのディストリビューション付属のストックされたICUバイナリ。
A binary distribution of ICU from the download section of the ICU project page (http://icu.sourceforge.net/download/3.6.html#ICU4C).
ICUのバージョンが、ローカルで、マルチスレッドの設定が無効の状態でオペレーティングシステムのディストリビューションで提供されるソースコードからコンパイルされた。
ICUのバージョンが、ローカルで、マルチスレッドの設定が無効の状態で、ICUのプロジェクトページからソースコードからコンパイルされた。

GT.Mは、ICUに動的にリンクするPOSIX関数 dlopen() を使用します。In the event you have other applications that require ICU compiled with threads, place the different builds of ICU in different locations, and use the dlopen() search path feature (e.g, the LD_LIBRARY_PATH environment variable on Linux) to enable each application to link with its appropriate ICU.

ICUのコンパイル

Below are sample instructions to to download ICU, configure it not to use multi-threading, and compile it for various platforms. Note that download sites, versions of compilers, and milli and micro releases of ICU may well change subsequent to the writing of these instructions, and make these instructions obsolete. Therefore, these procedures must be considered examples, not gospel.

Compiling ICU version 3.6 on x86 Linux

As of this writing (January, 2007), ICU version 3.6 can be compiled on x86 Linux with the following configuration:

Operating System	バージョン	Compilers
Linux	Red Hat Enterprise Linux 4 Update 2	gcc 3.4.4, GNU make (3.77+), ANSI C compiler

手順

［参考］ Ensure that system environment variable <s0>$PATH</s0> includes the location of all the compilers mentioned above.
Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

At the shell prompt, execute the following commands:

gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
cd icu/source/
chmod +x runConfigureICU configure install-sh       
runConfigureICU Linux --disable-64bit-libs --disable-threads 
gmake
gmake check
gmake install>

［参考］ ICUの位置をポイントするために、環境変数LD_LIBRARY_PATHを使用します（AIXではLIBPATH）。GT.M uses the environment variable LD_LIBRARY_PATH to search for dynamically linked libraries to be loaded.

［参考］ ICU is now installed in <s0>/usr/local</s0>.


	By default, ICU is installed on /usr/local directory. If you need to install ICU on a different directory type: runConfigureICU Linux --prefix=<install_path> --disable-64bit-libs --disable-threads Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on HP PA-RISC HP-UX

As of this writing (January, 2007), ICU version 3.6 can be compiled on PA-RISC HP-UX with the following configuration:

Operating System	バージョン	Compilers
HP-UX	HP-UX 11.11	aCC A.03.50, cc B.11.11.08, GNU make (3.77+)

手順：

［参考］ Ensure that system environment variable <s0>$PATH</s0> includes the location of all the compilers mentioned above.
Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C
Add the following line in the configuration file source/config/mh-hpux-acc to include the appropriate C++ runtime libraries:
```
DEFAULT_LIBS = -lstd_v2 -lCsup_v2 -lcl
```

At the shell prompt, execute the following commands:

gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
cd icu/source/
chmod +x runConfigureICU configure install-sh       
runConfigureICU HP-UX/ACC --disable-64bit-libs --disable-threads
gmake
gmake check
gmake install

［参考］ ICUの位置をポイントするために、環境変数LD_LIBRARY_PATHを使用します（AIXではLIBPATH）。［参考］ Set the environment variable <s0>$LD_LIBRARY_PATH</s0> to point to the location of ICU. HP-UX uses the environment variable <s1>$LD_LIBRARY_PATH</s1> to search for dynamically linked libraries to be loaded.

［参考］ ICU is now installed in <s0>/usr/local</s0>.


	By default, ICU is installed in the /usr/local directory. If you need to install ICU on a different directory type: runConfigureICU HP-UX/ACC --prefix=<install_path> --disable-64bit-libs -disable-threads Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on Solaris

As of this writing (January, 2007), ICU version 3.6 can be compiled on Solaris with the following configuration:

Operating System	バージョン	Compiler
Solaris	Solaris 9 (SunOS 5.9)	Sun Studio 8 (Sun C++ 5.5), GNU make (3.77+), ANSI C compiler

手順：

［参考］ Ensure that system environment variable <s0>$PATH</s0> includes the location of all the compilers mentioned above.
Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C
Add the following line in the configuration file source/config/mh-solaris to include the appropriate C++ runtime libraries:
```
DEFAULT_LIBS = -lCstd -lCrun -lm -lc
```

At the shell prompt, execute the following commands:

gunzip -d < icu4c-3_6-src.tgz | tar -xf -
cd icu/source/
chmod +x runConfigureICU configure install-sh       
runConfigureICU Solaris --disable-64bit-libs --disable-threads
gmake
gmake check
gmake install

［参考］ ICUの位置をポイントするために、環境変数LD_LIBRARY_PATHを使用します（AIXではLIBPATH）。Solaris uses the environment variable LD_LIBRARY_PATH to search for dynamically linked libraries to be loaded.

ICU is now installed in the/usr/local directory.


	By default, ICU is installed in the /usr/local directory. If you need to install ICU on a different directory type: runConfigureICU Solaris --prefix=<install_path> --disable-64bit-libs -disable-threads Then execute the gmake commands, and set the environment variable LD_LIBRARY_PATH to point to the appropriate location.

Compiling ICU version 3.6 on AIX

As of this writing (January, 2007), ICU version 3.6 can be compiled on AIX with the following configuration:

Operating System	バージョン	Compilers
AIX	AIX 5.2 (PowerPC 64-bit)	VisualAge 6, GNU make (3.77+), ANSI C compiler

手順：

［参考］ Ensure that system environment variable <s0>$PATH</s0> includes the location of all the compilers mentioned above.
Download the source code of ICU version 3.6 for C from http://icu.sourceforge.net/download/3.6.html#ICU4C

At the shell prompt, execute the following commands:

gunzip -d < icu4c-3_6-src.tgz | tar -xf - 
cd icu/source/
chmod +x runConfigureICU configure install-sh       
runConfigureICU AIX --disable-64bit-libs --disable-threads
gmake
gmake check
gmake install

Set the environment variable LIBPATH to point to the location of ICU. AIX uses the environment variable LIBPATH to search for dynamically linked libraries to be loaded.

［参考］ ICU is now installed in <s0>/usr/local</s0>.


	By default, ICU is installed on /usr/local directory. If you need to install ICU on a different directory type: runConfigureICU AIX --prefix=<install_path> --disable-64bit-libs -disable-threads Then, execute the gmake commands, and set the environment variable LIBPATH to point to the appropriate location.


	AIX includes the release (or minor version) number in the name of the ICU library which can change based on updates from IBM. Users must set the environment variable gtm_icu_minorver to the release number so that GT.M uses that number to activate and access the ICU. If gtm_icu_minorver is not defined, GT.M assumes the release number to be 0.

M言語

The string processing functions now manipulate Unicode™ strings, binary data, or both together in the same process. Input / Output operations now can perform conversion to and from the following character encodings:

UTF-8
UTF-16
UTF-16LE
UTF-16BE


	Any aspect of GT.M not described in this technical bulletin is unchanged from preceding releases.

MモードとUTF-8のモード

GT.M process can start in two modes - M mode and UTF-8 mode. Any process with the environment variable gtm_chset set to "M" at process entry operates in M mode and exhibits the same behavior as pre-Unicode™ versions. As noted in the Theory of Operation above, unaltered existing applications deployed on the previous production release of GT.M default to "M" mode and exhibit unaltered behavior from earlier releases. GT.M database engine functions identically in M mode and UTF-8 mode.

The changes to GT.M for the support of Unicode™ pertain to the interpretation of strings. A process starts in UTF-8 mode and interprets strings encoded in UTF-8, if at process startup:

the environment variable gtm_chset has a value of "UTF-8", and
the environment variable LC_CTYPE is set to a locale with UTF-8 support, for example, "zh_CN.utf8"

Note that support for Unicode™ is enabled for the process, not for the database. The indexes and values in the database are simply sequences of bytes and therefore it is possible for one process to interpret a global node as encoded in UTF-8 and for another to interpret the same data as a binary stream. ASCII (codes 0-127) is a subset of Unicode™, and so is available in both modes.

パターンマッチ演算子（？）

GT.M allows the pattern string literals to contain the characters in Unicode™ . Additionally, GT.M extends the M standard pattern codes (patcodes) A, C, N, U, L, P and E to the Unicode™ character set. For characters in Unicode™, these patcodes are:

A: All alphabetic characters including upper case, lower case and caseless alphabetic characters.
C: All control characters
E: All characters
L: All lower case characters
U: All upper case characters
N: All digits as specified by the intrinsic special variable $ZPATNUMERIC. If $ZPATNUMERIC is UTF-8, N recognizes all numeric characters as defined by Unicode™ . If $ZPATNUMERIC is "M", N recognizes only ASCII digits (ASCII 48-57) as numeric characters. The default value of the intrinsic special variable $ZPATNUMERIC is M. For a process started in UTF-8 mode, $ZPATNUMERIC takes its value from the environment variable gtm_patnumeric. (see section Environment Variables for more details on configuring this variable).
P: All punctuation characters

For characters in Unicode™ , GT.M assigns patcodes based on the default classification of the Unicode™ character set by the ICU library. Note that the above patcodes do not cover all types of characters in the Unicode™ character set. There are several special Unicode™ character classes (such as title case characters) that do not satisfy any of the patcodes above except “E”. The patcode E can be used to match any character in Unicode™ including the characters not covered by the patcodes above as well as malformed characters (if VIEW “NOBADCHAR” setting is enabled). If VIEW “BADCHAR” is enabled, the pattern match operator triggers the BADCHAR error if it encounters an illegal UTF-8 byte sequence in the string.

コマンド

Job

Jobコマンドは、MのプロセスがDOで生成されるように同じ環境でバックグラウンドプロセスを生成します。そのため、もし親プロセスがUTF-8モードで動作している場合、JobプロセスもUTF-8モードで動作します。In the event that a background process must have a different mode from the parent, create a shell script to alter the environment as needed, and spawn it with a ZSYstem command, e.g., ZSYstem "/path/to/shell/script &".

View "[NO]BADCHAR"

In pre-Unicode™ releases, and in M mode, the concept of an illegal character does not exist - all 256 combinations of the 8 bits in a byte are legal characters. In UTF-8 mode, there are certain sequences of bytes that are illegal characters. For example, $ZCHAR(192) is an illegal character because it is a sequence of 2-bytes whose second byte is missing (U+0000).

The [NO]BADCHAR keyword argument for the VIEW command enables or disables the triggering of an error when character-oriented functions encounter malformed byte sequences (illegal characters).

At process startup, GT.M initializes BADCHAR from the environment variable gtm_badchar. Set the environment variable gtm_badchar to a non-zero number or "YES" (or “Y”) to enable VIEW "BADCHAR". Set the environment variable gtm_badchar to 0 or "NO" or "FALSE" (or ”N” or “F”) to enable VIEW "NOBADCHAR". By default, GT.M enables VIEW "BADCHAR".

If VIEW "BADCHAR" is enabled, functions generate the BADCHAR error when they encounter malformed byte sequences. With this setting, GT.M detects and clearly reports potential application program logic errors as soon as they appear. As the an illegal UTF-8 character in the argument of a character-oriented function likely indicates a logic issue, FIS recommends the use of VIEW “BADCHAR” in production environments.

When all strings consist of well-formed characters, the value of VIEW [NO]BADCHAR has no effect whatsoever. If VIEW "NOBADCHAR" is enabled, the same functions treat malformed byte sequences as valid characters. During the migration of an application to add support for Unicode™, illegal character errors are likely to be frequent and indicative of application code that is yet to be modified. VIEW "NOBADCHAR" は、それら存在が開発を妨げる時間に、これらのエラーを抑制します。

ZSHow

ZSHOWコマンドは、現在のGT.MM環境に関する情報を表示します。Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

UTF-8モードで、ZSHOWのコマンドは、次のようにバイト指向とディスプレイ指向の挙動を示します。

デバイス（ ZSHOW "*" ）をターゲットとするZSHOWは、WIDTHのデバイスパラメータで指定された表示列の数に応じて出力を整列します。
ローカル (ZSHOW "*":lcl) を対象とするZSHOWは、2048キロバイト制限内に完全に収まる最後の文字で2048キロバイトを超えるデータを切り捨てます。
グローバル (ZSHOW "*":^CC) を対象とするZSHOWは、そのレコードのサイズに完全に収まる最後の文字でターゲットグローバルの最大レコードサイズを超えるデータを切り捨てます。

I/Oコマンド

As with other areas of functionality, when the environment variable gtm_chset is not set, or is set to "M", there is no change to GT.M I/O behavior. When gtm_chset is set to "UTF-8", GT.M supports Unicode™ I/O.

Even when a process internally stores and manipulates strings encoded in UTF-8, it may nevertheless need to perform I/O on a series of individual bytes, that is, 8-bit octets; a series of bytes that encode characters in UTF-16 with an explicit little endian encoding (UTF-16LE); or a series of bytes that encode characters in UTF-16 with an explicit big endian encoding (UTF-16BE). GT.M allows a process to explicitly specify the encoding by deviceparameters in the OPEN and USE commands. This encoding determines the mode (M mode or UTF-8 mode) of the device.


	GT.M determines the encoding for $PRINCIPAL from gtm_chset and does not allow the process to change it.

While it is not possible for a byte to be an illegal character when performing I/O on 8-bit octets, when performing I/O on characters in Unicode™ , it is certainly possible for a sequence of bytes to be an illegal character in Unicode™. GT.M READ and WRITE commands check for legal characters and raise the BADCHAR error if they detect a sequence of bytes not corresponding to a legal character. Application code must avoid illegal characters in I/O streams or specify M-mode as VIEW "NOBADCHAR" does not suppress BADCHAR error reporting in I/O.

In M mode, except when FILTER= is in use, a character always has a width of 1. Characters encoded with Unicode™, however, can have different widths according to the current device to which they are applied. For example, the character 新 in the CJK Ideograph occupies 2 display columns on the screen or printer whereas the width of the same character is 1 code-point when it is transmitted through sockets. GT.M handle these differences by defining measurements characteristics of all deviceparameters when they are applied to certain devices.

The RECORDSIZE of a fixed length record for a GT.M sequential disk device is always specified in bytes, rather than characters. In M mode, GT.M only pads a fixed length record when the file is closed and the last record is less than the RECORDSIZE; when READing a padded fixed length record, GT.M returns full record including any PAD characters.

In UTF-8 mode, there are three cases that cause GT.M to insert PAD characters when WRITEing. When READing GT.M attempts to strip any PAD characters. This stripping only works properly if the RECORDSIZE and PAD are the same for the READ as when the WRITEs occurred. WRITE inserts PAD characters when:

The file is closed and the last record is less than the RECORDSIZE. Records are padded (for FIXED) by WRITE ! as well as when the file is closed.
$X exceeds WIDTH before the RECORDSIZE is full
The next character won't fit in the remaining RECORDSIZE. The additional functionality described below supports Unicode™-related operation.

Open

O[PEN][:tvexpr] expr[:[(keyword[=expr][:...])] [:numexpr]][,...]

OPENコマンドは、GT.Mプロセスとデバイス間の接続を作成します。Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode, the OPEN command recognizes ICHSET, OCHSET, and CHSET as three additional deviceparameters to determine the encoding of the the input / output devices. 次のセクションでは、これらのデバイスパラメータを説明します。

In M mode, the OPEN command ignores ICHSET, OCHSET, CHSET, and PAD device parameters.

If an I/O device uses a multi-byte character encoding, every READ and WRITE operation of that device checks for the well-formed characters according to the specified character encoding with ICHSET or OCHSET. If the I/O commands encounter an illegal sequence of bytes, they always trigger a run-time error; a VIEW “NOBADCHAR” does not prevent such errors. Strings created by $ZCHAR() and other Z equivalent functions may contain illegal sequences. The only way to input or output such illegal sequences is to specify character set “M” with one of these deviceparameters.

Open デバイスパラメータ

OCHSET=expr Applies to: All devices

Establishes the character encoding of the output device. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE.

If the value for OCHSET is not specified, GT.M assumes the value of the intrinsic variable $ZCHSET as the default character set for all the input / output devices and "M" if $ZCHSET is not specified.

If expr is set to a value other than "M", "UTF-8", "UTF-16", "UTF-16LE" or ""UTF-16BE"", GT.M triggers a run-time error .


	UTF-16, UTF-LE, and UTF-16BE are not supported for $Principal and Terminal devices. Please refer to the limitations section for more details.

例：

GTM>SET file1="mydata.out"
GTM>SET expr="UTF-16LE"
GTM>OPEN file1:(chset=expr)
GTM>USE file1 WRITE "新年好",!
GTM>CLOSE file1

This example opens a new file called mydata.out and writes the chinese characters "新年好" in the UTF-16LE encoding.

ICHSET=expr Applies to: All devices

Establishes the character encoding of the input device. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE.

If the value for ICHSET is not specified, GT.M assumes the value of the intrinsic variable $ZCHSET as the default character set for all the input / output devices and "M" if $ZCHSET is not specified.

If expr is set to a value other than "M", "UTF-8", "UTF-16", "UTF-16LE" or ""UTF-16BE"", GT.M triggers a run-time error .


	UTF-16, UTF-LE, and UTF-16BE are not supported for $Principal and Terminal devices. Please refer to the limitations section for more details.

CHSET=expr Applies to: SD FIFO TRM and SOC

Establishes a common encoding for both input and output devices. The value of the expression can be M, UTF-8, UTF-16, UTF-16LE, or UTF-16BE. For more information, refer to ICHSET and OCHSET.

RECORDSIZE=expr Applies to: SD FIFO

RECORDSIZE overrides the default record size for a disk and specifies the maximum record size in bytes. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

For SD in UTF-8 mode, GT.M treats RECORDSIZE as a byte limit at which to wrap or truncate output depending on [NO]WRAP. For any character set other than "M", GT.M ignores RECORDSIZE for a device which is already open if any I/O has been done.

If the character set is M or UTF-8, the default RECORDSIZE is 32K-1bytes.

If the character set is UTF-16, UTF-16LE or UTF16BE, the RECORDSIZE must always be in multiples of 2. For these character sets, the default RECORDIZE is 32K- 4 bytes.

[NO]FIXED Applies to: SD FIFO

シーケンシャルディスクファイルの固定長レコードフォーマットを選択します。FIXEDは、レコードの実際の長さを指定しません。レコードの長さを指定するために RECORDSIZE を使用してください。Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

In UTF-8 mode with FIXED format, GT.M I/O enforces a more record-oriented view of the file, treating each record as RECORDSIZE bytes long. A READ ignores any PAD bytes found at the end of a record and does not return them to the application.

A READ X gets the remainder of the current record if any characters remain, otherwise it reads an entire new record.

A READ #len returns up to len characters from the current record if any characters remain otherwise it reads up to len characters from a new record. All characters returned are from a single record.

A READ *X returns the code-point for a single character. If there is a character in the current record, READ * returns it, otherwise it fetches a new record and returns a single character from it.

WRITE when WRAP is not enabled writes up to WIDTH - $X display columns. WRITE uses PAD bytes at the end of the record to produce an output record of RECORDSIZE bytes. Note that a Unicode™ code-point never splits across records. A combining character may end up in the subsequent record if it does not fit in the current record.

WRITE when WRAP is enabled starts new records as required with no more than WIDTH characters per record. WRITE uses PAD bytes at the end of the record to produce an output record of RECORDSIZE bytes; without writing any partial characters in Unicode™ .

In both of the above WRITE cases where the command has multiple arguments, WRITE handles each argument individually except in the case of a sequence of literals, which it combines into a single argument.

WRITE !writes WIDTH - $X spaces followed by PAD bytes as required to pad the record to RECORDSIZE bytes.

PAD=expr Applies to: SD FIFO

For FIXED format sequential files and when the character set is not "M", if a multi-byte character (when CHSET is UTF-8) or a surrogate pair (when CHSET is UTF-16) does not fit into the record (either logical as given by WIDTH or physical as given by RECORDSIZE) the WRITE command pads the bytes specified by the PAD deviceparameter to fill out the physical record. READ ignores the pad bytes when found at the end of the record. The value for PAD is given as an integer in the range 0-127 (the ASCII characters). The default PAD byte value is $ZCHAR(32) or <SPACE>.

例：

GTM>Set a="准祝新年在上海"
GTM>Set encoding="UTF-8"
GTM>Set filename="bom"_encoding_".txt" 
GTM>Open filename:(newversion:FIXED:RECORDSIZE=8:PAD=66:chset=encoding)
GTM>Use filename
GTM>Write a
GTM>Close filename
GTM>Halt
$ cat bomUTF-8.txt 
准祝BB新年BB在上BB海

In the above example, the local variable a is set to a string of three-byte characters. PAD=66 sets padding byte value to $CHAR(66)

Read

R[EAD][:tvexpr] (glvn|*glvn|glvn#intexpr)[:numexpr]|strlit|fcc[,...]

READコマンドは、READの引数として指定されたグローバルまたはローカル変数へ、現在のデバイスからの入力を転送します。利便性のために、READもまた、制限された出力を現デバイスに実行する引数を受け入れます。Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

UTF-8モードで、READコマンドは、入力デバイスの文字エンコーディングとしてOPENデバイスで指定された文字セットの値を使用しています。もし文字セットが "M"または"UTF-8"に指定されている場合、データは変換なしで読み込まれます。もし文字セットが"UTF-16"、"UTF-16LE"、"UTF-16BE"の場合、データは指定されたエンコーディングで読み込み、UTF-8に変換されます。もしREADコマンドが、不正な文字または選択された表現以外の文字を検出した場合、それは実行時エラーがトリガされます。The READ command recognizes all Unicode™ line terminators for non-FIXED devices. See “Line Terminators” section for more details.


	In M mode, characters and bytes have a one-to-one relationship and therefore READ can be used to read bit-streams of non-character data.

Read # コマンド

番号記号(#) と非ゼロの整数式が変数名の直後に続く時は、整数式はREADコマンドの入力として受け入れられる文字の最大数を特定します。In UTF-8 mode, this can occur in the middle of a sequence of combining code-points (some of which are typically non-spacing). 入力デバイス上の任意の表示が発生する時、固定長のREAD ( READ #) で返される文字を表すものではありません。

Read * コマンド

In UTF-8 mode, the READ * command accepts one character in Unicode™ of input and puts the numeric code-point value for that character into the variable.

In M mode, the READ * command reads a single byte and returns the numeric byte value. If character set UTF-8 is specified, the READ * command reads one to four bytes, depending on the encoding and returns the numeric code-point value of the character. If ICHSET specifies "UTF-16", "UTF-16LE" or "UTF-16BE", the READ * command reads a byte pair or two byte pairs (if it is a surrogate pair) and returns the numeric code-point value.

例：

GTM>Set filename="mydata.out"; assume that mydata.out contains "新年好"
GTM>Open filename:(readonly:ichset="UTF-16LE")
GTM>Use filename
GTM>Read *x
GTM>Close filename
GTM>Write $char(x)
新

In the above example, the READ * command reads the first character of the file mydata.out according to the encoding specified by ICHSET

Write

W[RITE][:tvexpr] expr|*intexpr|fcc[,...]

WRITEコマンドは、現デバイスへその引数で指定された文字ストリームを転送します。Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

UTF-8モードで、WRITEコマンドは、出力デバイスの文字エンコーディングとしてOPENデバイスで指定された文字セットを使用します。もし文字セットが"M"または"UTF-8"を指定している場合、GT.Mは変換なしでデータを書き込みます。もし文字セットが "UTF-16"、"UTF-16LE"、"UTF-16BE" を指定している場合、データはUTF-8でエンコードされていると仮定し、WRITEは文字セットのデバイスパラメータで指定された文字エンコーディングに変換します。

例：

GTM>Set filename="mydata.out"
GTM>Set T16LE="准备庆祝新年在上海"
GTM>Open filename:(chset="UTF-16LE")
GTM>Use filename
GTM>Write T16LE
GTM>Close filename

The above example creates a file mydata.out in UTF-16LE character set.

If a WRITE command encounters an illegal character, it triggers a run-time error irrespective of the setting of VIEW "BADCHAR".

In M mode, the WRITE command ignores the OCHSET deviceparameter .

Write * コマンド

WRITEコマンドの引数が整数式に続いて先頭にアスタリスク(*) で構成するとき、WRITEコマンドは、その整数式のコードポイント値によって表される文字を出力します。

With character set M specified at device OPEN, the WRITE * command transfers the character (byte) associated with the numeric value of the integer expression. With character UTF-8 specified at device OPEN, the WRITE command outputs the character associated with the numeric code-point value. If character set "UTF-16", "UTF-16LE" or "UTF-16BE" is specified, WRITE * transforms the character code to the mapping specified by that character set.

カーソルポジション変数

$X

$X is a special intrinsic variable that determines the current column position of the cursor for the current device. $Xは、現在の出力レコード内の仮想カーソルの水平位置を指定し、0〜65,535の範囲の整数値が含まれています。$X= 0 はレコードまたは行の左端の位置を表します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

For UTF-8 mode and TRM and SD output, $X increases by the display-columns of a given string that is written to the current device.

例：

GTM> Write $ZCHSET
UTF-8
GTM>Set a="准祝"
GTM>Use $Principal:WIDTH=40
GTM>Write a,$X
准祝4
GTM>

In the above example, the Use command set the width of $Principal device to 40 display columns. $X returns 4 because each character in local variable a occupied 2 display positions.

Use デバイスパラメータ

WIDTH=intexpr Applies to: TRM SOC NULL SD FIFO

デバイスの論理レコードのサイズを設定し、WRAPを有効にします。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

UTF-8モードとTRMとSD出力では、WIDTH デバイスパラメータは表示列を指定し、ストリームを視覚的に表現する切り捨て( truncation )とWRAPingを制御するために $X を使用します。

In M mode if WIDTH is set to 0, GT.M uses the default WIDTH of the TRM and SOC devices. USE x:WIDTH=0 is equivalent to USE x:(WIDTH=<device-default>:NOWRAP. For SD and FIFO devices in M mode, setting WIDTH to 0 is not allowed.

In UTF-8 mode, WIDTH=0 disables formatting control based on comparison of WIDTH and $X but does not affect the control of WIDTH over the behavior when the output exceeds RECORDSIZE.

例：

GTM>Set a="准备庆祝新年在上海"
GTM>Set encoding="UTF-8"
GTM>Set filename="my"_encoding_".txt" 
GTM>Open filename:(newversion:chset=encoding)
GTM>Use filename:WIDTH=4
GTM>Write a
GTM>Close filename
GTM>Halt
$ cat myUTF-8.txt 
准备
庆祝
新年
在上
海

GT.Mフォーマットの制御文字、FILTERと、デバイスのWIDTHとWRAPもまた$Xで影響を受けます。

In UTF-8 mode and SOC output, the WIDTH deviceparameter specifies the number of characters in Unicode™.

[NO]WRAP Applies to: TRM SOC NULL SD FIFO

自動記録の終了を有効または無効にします。現在のレコードのサイズ($X)が最大 WIDTHに達し、かつ、デバイスがWRAPを有効になっている時に、ルーチンがWRITE ! コマンドを発行したかのように、GT.Mは新しいレコードを開始します。command. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

For UTF-8 mode and SD output, WRAP or truncation occur when WRITEs exceed either WIDTH(display-columns) or RECORDSIZE (bytes).

ラインターミネータ

For non FIXED format sequential files and terminal devices for which character set is not M, all the standard Unicode™ line terminators terminate the logical record. These are U+000A (LF), U+0000D (CR), U+000D followed by U+000A (CRLF), U+0085 (NEL), U+000C (FF), U+2028 (LS) and U+2029 (PS). For these devices, LF is used to terminate a record on output though if FILTER=CHARACTER is enabled, all of the terminators are recognized to maintain the values of $X and $Y.

ユニコード ™ バイトオーダマーカー（BOM）

When the ICHSET for a device is not "M", if BOM (U+FEFF) is at the beginning of the initial input for a file or data stream, GT.M uses it to determine the endian if the ICHSET is UTF-16 and checks for agreement with ICHSET UTF-16BE or UTF-16LE.

If character set for a device is UTF-16, GT.M uses BOM (U+FEFF) to determine the endians. . For this to happen, the BOM must be at at the beginning of the initial input for a file or data stream. If there is no BOM present, GT.M assumes big endianess.

If the character set of a device is UTF-8, GT.M checks for and ignores a BOM on input.

If the BOM does not match the character set specified at device OPEN, GT.M triggers an error. READ does not return BOM to the application and the BOM is not counted as part of the first record.

If the output character set for a device is UTF-16 (but not UTF-16BE or UTF-16LE,) GT.M writes a BOM before the initial output. The application code does not need to explicitly write the BOM.

デバイスパラメータの概要

The measurement characteristics of some deviceparameters change when they are applied to certain devices. For example, terminal WIDTH is measured in display-columns whereas socket WIDTH is measured in code-points. The following tables lists the units of measurement (byte, code-point, or display-column) for $X and deviceparameters for TRM, SD, SOC, and FIFO. All deviceparameters that are not described in this section remain unchanged from preceding releases. "-" denotes that the deviceparameter has no effect for that device.

Terminal (TRM) Device
Device	$X	RECORDSIZE	WIDTH	PAD
TRM	Display-column	Byte	Display-column	-
SD	Display-column	Byte	Display-column	Code-point
SOC and FIFO	Code-point	Byte	Code-point	-


	In M mode, display-columns, characters and bytes are all equivalent In all UTF-16 I/O modes, RECORDSIZE must be even and PAD characters are two bytes GT.M implements SD output in a fashion that supports copying files to display devices such as printers and terminals.

文字列処理関数

A multi-byte character can be made up of a base character, composite character, or a pre-composed character of various letter/diacritic combinations. In UTF-8 mode, all string processing functions identify each character as a distinctive unit of writing in the context of a particular writing method. However, in M mode, GT.M unconditionally treats characters as strings of octets (8-bit bytes).

To provide additional flexibility for performing byte-oriented operations in a process started in UTF-8 mode, GT.M provides "Z equivalents" of the traditional string processing functions. These functions are:

$ZA[SCII](expr[,intexpr])
$ZC[HAR](intexpr[,…])
$ZE[XTRACT](expr[,intexpr1[,intexpr2]])
$ZF[IND](expr1,expr2[,intexpr])
$ZJ[USTIFY](expr,intexpr1[,intexpr2])
$ZL[ENGTH](expr1[,expr2])
$ZP[IECE](expr1,expr2[,intexpr1[,intexpr2]])
$ZTR[ANSLATE](expr1[,expr2[,expr3]])

These Z equivalent functions exhibit the same behavior as their traditional M counterparts operating in M mode. For example, in UTF-8 mode, the length of a string in characters is less than or equal to the lengths of strings in bytes. In this mode, the $LENGTH() function considers sequences of bytes to be strings of characters encoded in UTF-8 and returns the number of characters. The new Z equivalent function $ZLENGTH() considers sequences of bytes to be simply strings of octets (8-bit bytes) and returns the number of bytes just as $LENGTH() does when operating in M mode. All Z equivalent functions are independent of the value of $ZCHSET or VIEW [NO]BADCHAR.

The Z equivalent functions come in handy when applications need to process binary data including blobs, binary byte streams, bit-masks, and so on while simultaneously operating in UTF-8 mode.

In addition to the Z equivalent functions, GT.M now provides the following Z functions:

$ZCONVERT() function
- Two argument form: $ZCO[NVERT](expr1,expr2)
- Three argument form: $ZCO[NVERT](expr1,expr2,expr3)
$ZSUB[STR](expr,intexpr1[,intexpr2])
$ZW[IDTH](expr)


	Unlike the Z equivalent functions, the new Z functions do not have traditional M counterparts. They provide new functionality related to Unicode™.

The following sections describe the behavior of all the string-processing functions in UTF-8 mode and M mode.

$ASCII()

The $ASCII() function returns the integer code for a character in a string. Refer to the GT.M Programmer's Guide for a complete description and M-mode examples.

With character set UTF-8 specified, the $ASCII() function returns a decimal representation of the integer Unicode™ code-point value of a character in the given string.

In the Unicode™ Standard, the code-point is the hexadecimal integer that appears after the “U+” in the definition of each character) and may be as large as 1114109, corresponding to U+10FFFD.


	Although it seems counter-intuitive for a function called $ASCII() to return Unicode™-point values, the M standard has retained the name of $ASCII() for all character sets, which minimizes the application code changes needed to add the support for Unicode™.

Examples of $ASCII() in UTF-8 mode

GTM>W $ZCHSET
UTF-8
GTM>W $ASCII("新") 
26032 
GTM> W $$FUNC^%DH("26032")
000065B0

In the above example, 26032 is the integer equivalent of the hexadecimal value 65B0. U+65B0 is a character in the CJK Ideograph block of the Unicode™ Standard.

$Char()

The $CHAR() function returns a string of one or more characters corresponding to integer codes specified in its argument(s). Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $CHAR() function returns a string composed of characters represented by the integer equivalents of the Unicode™ code-points specified in its argument(s).

With VIEW NOBADCHAR enabled, the $CHAR() function ignores all expressions that do not correspond to valid Unicode™ code-points,the $CHAR() function never returns a string with illegal or invalid characters .

With VIEW BADCHAR enabled, the $CHAR() function triggers a run-time error if any expression evaluates to a code-point value that is not a character in Unicode™ According to the Unicode™ Standard version 5.0, invalid code-points include the following sets:

The "too big" code-points (those greater than the maximum U+10FFFF).
The "surrogate" code-points (in the range [U+D800, U+DFFF]) which are reserved for UTF-16 encoding.
The "non-character" code-points that are always guaranteed to be not assigned to any valid characters. This set consists of [U+FDD0, U+FDEF] and all U+nFFFE and U+nFFFF (for each n from 0x0 to 0x10).

Example of $CHAR() in UTF-8 mode

GTM>W $ZCHSET 
UTF-8
GTM> W $CHAR(26032)
新
GTM> W $CHAR(65)
A

In the above example, the integer value 26032 is the Unicode™ character "新" in the CJK Ideograph block of Unicode™.

The output of the $CHAR() function for values of integer expression(s) from 0 through 127 does not vary with choice of the character encoding scheme. これは、7ビットASCII は UTF-8 文字エンコーディング方式の適切なサブセットであるためです。値が128〜255となる$CHAR() 関数によって返される文字の表現は、それぞれの文字のエンコーディング方式によって異なります。

When compiling a program with VIEW "BADCHAR" and a literal argument for the $CHAR() function specifies an illegal character, the GT.M compiler triggers a BADCHAR error and embeds that error in the object in case the object is every used. When compiling a program with VIEW "NOBADCHAR" and a literal argument for $CHAR() specifies an illegal character, the GT.M compiler does not trigger the BADCHAR error nor can the GT.M run-time system detect the error. Therefore, application developers must ensure a routine is compiled and executed with appropriately chosen (usually matching) settings of VIEW "BADCHAR".

$Extract()

$EXTRACT() 関数は、与えられた文字列の部分文字列を返します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $EXTRACT() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $EXTRACT() function triggers a run-time error when it encounters a character in the reserved range of the Unicode™ Standard, but it does not process the characters that fall after the span specified by the arguments.


	For byte-oriented operations, use $ZEXTRACT(), as $EXTRACT() in NOBADCHAR mode interprets its string arguments as character, rather than byte-oriented and only returns byte-oriented results when all characters in its arguments are encoded in a single byte.

Examples of $EXTRACT() in UTF-8 mode

例：

GTM>FOR i=0:1:4 WRITE !,$EXTRACT("新年好",i),"<" 
<
新<
年<
好<
<
GTM>

This loop displays the result of $EXTRACT(), specifying no ending character position and a beginning character position "before, " first and second positions, and "after" the string.

例：

GTM>FOR i=0:1:4 WRITE !,$E("新年好",1,i),"<" 
<
新<
新年<
新年好<
新年好<
GTM>

This loop displays the result of $EXTRACT() specifying a beginning character position of 1 and an ending character position "before, "first and second positions, and "after" the string.

例：

TRIM(x) 
	NEW i,j
	FOR j=$L(x):-1:0 S nx=$E(x,1,j) Q:$EXTRACT(x,j)'=" " 
	FOR i=1:1:j S fx=$E(nx,i,$L(x)) Q:$EXTRACT(x,i)'=" " 
	QUIT fx 
GTM>SET str=" 新年好 "
GTM>WRITE $LENGTH(str)
5 
GTM>WRITE $LENGTH($$TRIM^trim(str))
3

この外部関数は、その引数から余分な先頭と末尾のスペースを削除するには、$EXTRACT() を使用しています。

$Find()

$FIND() 関数は、その位置の文字列内の部分文字列の出現を整数文字位置を返します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $FIND() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $FIND() function triggers a run-time error when it encounters a malformed character, but it does not process the characters that fall after the span specified by the arguments.


	The $FIND() function must never be used for byte-oriented operations.

Examples of $FIND() in UTF-8 mode

例：

GTM> WRITE $FIND("新年好","年") 
3 
GTM>

This example uses the $FIND() function to WRITE the position of the first occurrence of the character "年". The return of 3 gives the position after the "found" substring.

例：

GTM> WRITE $FIND("准备庆祝新年在上海:准备庆祝新年在上海","上",9) 
19 
GTM>

This example uses $FIND() to WRITE the position after the next occurrence of the character "上 " starting in character position nine.

例：

GTM> SET t=1 FOR  SET t=$FIND("准备庆祝新年在上海:准备庆祝新年在上海","祝新",t) Q:'t  W !,t 
6 
16 
GTM>

This example uses a loop with $FIND() to locate all occurrences of "祝新" in "准备庆祝新年在上海:准备庆祝新年在上海". The $FIND() returns 6 and 16 giving the positions after the two occurrences of "祝新".

$Justify()

$JUSTIFY 関数はフォーマットされた文字列を返します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $JUSTIFY() function interprets the string argument as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $JUSTIFY() function triggers a run-time error when it encounters a malformed character.

Examples of $JUSTIFY() in UTF-8 mode

例：

GTM> WRITE $JUSTIFY("新年好",10),!,$JUSTIFY("准备庆祝新年在上海",5) 
       新年好
准备庆祝新年在上海
GTM>

The above example uses the $JUSTIFY() to display "新年好" in a field of 10 spaces and "准备庆祝新年在上海" in a field of 5 spaces. Because the length of "准备庆祝新年在上海" exceeds five spaces, the result overflows the specification.

例：

GTM> WRITE "1234567890",!,$JUSTIFY(10.545,10,2) 
1234567890 
     10.55 
GTM>

これは、スペース10個のフィールドに右側に揃えられた丸め値をWRITEするように $JUSTIFY() を使用しています。結果が切り上げされていることに注意してください。

例：

GTM> WRITE "1234567890",!,$JUSTIFY(10.544,10,2) 
1234567890 
     10.54 
GTM>

繰り返し、これは、スペース10個のフィールドに右側に揃えられた丸め値をWRITEするように $JUSTIFY() を使用しています。結果が切り捨てられることに注意してください。

例：

GTM> WRITE "1234567890",!,$JUSTIFY(10.5,10,2) 
1234567890 
     10.50 
GTM>

もういちど繰り返し、これは、スペース10個のフィールドに右側に揃えられた丸め値をWRITEするように $JUSTIFY() を使用しています。結果は2ヶ所にゼロで埋められていることに注意してください。

例：

GTM> WRITE $JUSTIFY(.34,0,2)     
0.34 
GTM>

This example uses $JUSTIFY() to ensure the fraction has a leading zero. 丸めが$JUSTIFYが実行する唯一の関数である場合のゼロの2番目の引数の使用に注意してください。

$Length()

The $LENGTH() function returns the length of a string measured in characters, or in "pieces" separated by a delimiter specified by one of its arguments. Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $LENGTH() function interprets the string argument(s) as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $LENGTH() function triggers a run-time error when it encounters a malformed character.

Examples of $LENGTH() in UTF-8 mode

例：

GTM> WRITE $LENGTH("新年好") 
3 
GTM>

This uses $LENGTH() to WRITE the length in characters of the string "新年好".

例：

GTM> SET x="新年好/准备庆祝新年在上海/准备庆" 
GTM> WRITE $LENGTH(x,"/") 
3 
GTM>

これは、"/" で区切られていることとして、文字列内の部分文字の数をWRITEするために$LENGTH() を使用しています。

例：

GTM> WRITE $LENGTH("/新/年好/","/") 
4 
GTM>

これも、"/" で区切られていることとして、文字列内の部分文字の数をWRITEするために$LENGTH() を使用しています。Notice that GT.M. counts both the empty beginning and ending pieces in the string because they are both delimited.

$Piece()

$PIECE() 関数は、区切り文字は、1つまたは複数の文字で構成された、指定した文字列が部分文字列の区切りを返します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the $LENGTH() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $PIECE() function triggers a run-time error when it encounters a malformed character, but it does not process the characters that fall after the span specified by the arguments.

Examples of $PIECE() in UTF-8 mode

例：

GTM< FOR i=0:1:4 WRITE !,$PIECE("新 年 好"," ",i),"<" 

<
新<
年<
好<
<
GTM>

This loop displays the result of $PIECE(), specifying a space as a delimiter, a piece position "before," first second, third and "after" the string.

例：

GTM< FOR i=-1:1:4 WRITE !,$PIECE("新 年 好"," ",i,i+1),"<" 
<
新<
新 年<
年 好<
好<
<
GTM>

この例では、それぞれの繰り返しで、2つの部分文字を表示することを除いて、前の例と似ています。両方の部分文字を表示する3番目の繰り返しにおいて出力の途中に区切り文字（スペース）に注意してください。

例：

F p=1:1:$L(x,"/") W ?p-1*10,$piece(x,"/",p)

This loop uses $LENGTH() and $PIECE() to display all the pieces of x in columnar format.

例：

GTM> s $P(x,".",25)="" W x

これは、ピリオドの区切り文字で、変数 x の25番目をnullにSETします。これは、null に続く24個のピリオドの文字列を生成します。

$TRanslate()

$TRANSLATE()関数は、引数の最初の文字を置換または削除の結果は、そのほかの引数のパターンで指定された文字列を返します。Refer to the GT.M Programmer's Guide for a complete description and M mode examples.

With character set UTF-8 specified, the algorithm of the $TRANSLATE() function interprets the string arguments as UTF-8 encoded. With VIEW "BADCHAR" enabled, the $TRANSLATE() function triggers a run-time error when it encounters a malformed character.

Examples of $TRANSLATE() in UTF-8 mode

例：

GTM> WRITE $TR("新年好","年好","1") 
新1 
GTM>

As "新" (the first character in the first expression) does not exist in the second expression ("好年"), it appears unchanged in the result.
As "年" (the second character in the first expression) holds the second position in the second expression ("好年"), and there is no second character in the third expression, $TRANSLATE() replaces occurrences of "年" with a null, effectively deleting it from the result.
As "好" (the third character in the first expression) holds the first position in the second expression ("好年"), $TRANSLATE() replaces occurrences of "年" with 1, which is in the first, and corresponding, position of the third expression. The translated result is "新1".

例：

GTM>WRITE $TR("新","X新","年好") 
好

This $TRANSLATE() example finds the position of first occurrence of the first expression in the second expression. Because the character "新" is in the second position, the output of the $TRANSLATE() function displays the character in the second position of the third expression.

例：

GTM> WRITE $TR("新年好","好新") 
年
GTM>

As the $TRANSLATE() has only two parameters in this example, it finds the characters in the first expression that also exist in the second expression and deletes them from the result.

$Z 相当の関数

GT.M provides a number of functions that are analogous to the standard functions except that they support byte-oriented operations. In M mode, these functions are exactly equivalent to the standard function. In UTF-8, these functions provide a means to operate on arbitrary strings containing bytes that do not necessarily represent valid code-points. For code to operate properly in both modes, the $Z equivalent functions must always be used for operations that are byte-oriented rather than character-oriented.

$ZASCII()

The $ZASCII() function returns the numeric byte value (0 through 255) of a given sequence of octets (8-bit bytes) .

$ZASCII 関数のフォーマットは次のとおりです：

$ZA[SCII](expr[,intexpr])

The expression acts as the sequence of octets (8-bit bytes) from which $ZASCII() extracts the byte it decodes.
オプションの整数式(intexpr) は、$ZASCII() がデコードするバイトの表現内の位置が含まれています。もしこの引数がない場合、$ZASCII() は最初のバイト位置に基づいて結果を返します。 $ZASCII() は1（1）でのバイト位置のナンバリング（文字列の最初のバイトがポジション1（1）にある）を開始します。
もし明示的または暗黙的な位置が、式の先頭より前または式の最後より後にあれば、$ZASCII() はマイナス1（-1）の値を返します。

$ZASCII() は、バイトのシーケンスの中にバイトを調べる手段を提供します。Used with $ZCHAR(), $ZASCII() also provides a means to perform arithmetic operations on the byte values associated with a sequence of octets.

Example of $ZASCII()

例：

GTM>FOR i=0:1:4 WRITE !,$ZA("新",i) 

-1
230
150
176
-1

This loop displays the result of $ZASCII() specifying a byte position before, first, second and third positions, and after the sequence of octets (8-bit bytes) represented by 新 . In the above example, 230, 150, and 176 represents the numeric byte value of the three-byte in the sequence of octets (8-bit bytes) represented by 新.

$ZChar()

$ZCHAR() 関数は、その引数で指定された数値バイト値（0〜255）に対応する1つ以上のバイトのバイトシーケンスを返します。

$ZCHR()関数のフォーマット：

$ZC[HAR](intexpr[,...])

The integer expression(s) specify the numeric byte value of the bytes(s) $ZCHAR() returns.

GT.Mは引数の数を最大254に制限します。$CHAR() provides a means of producing byte sequences. Used with $ZASCII(), $ZCHAR() can also perform arithmetic operations on the byte values of the bytes associated with a sequence of octets (8-bit bytes).

$ZCHAR() の例

GTM> $ZCHAR(230,150,176,7) 
新
GTM>

This example uses $ZCHAR() to WRITE the byte sequence represented by 新 and signal the terminal "bell."

$ZExtract()

The $ZEXTRACT() function returns a byte sequence of a given sequence of octets (8-bit bytes) .

$EXTRACT関数のフォーマット：

$ZE[XTRACT](expr[,intexpr1[,intexpr2]])

式(expr)は、バイトシーケンスを派生元とする$ZEXTRACT()からオクテットの列（8ビットバイト）を指定します。
The first optional integer expression (second argument) specifies the starting byte position in the byte string expr of the substring result. If the starting position is beyond the end of the expression, $ZEXTRACT() returns the null string. If the starting position is zero (0) or negative, $ZEXTRACT() starts at the first byte position in the expression; if this argument is omitted, $ZEXTRACT() returns the first byte of the expression. $ZEXTRACT() は、1から始まるバイト位置に番号をつけます（オクテット（8ビットバイト）のシーケンスの最初のバイトは、1に位置します）。
2番目のオプションの整数式 (intexpr2)（第3引数）は、結果の最終バイト位置を指定します。もし最終位置が式の最後を超えている場合、$ZEXTRACT() は式の最後のバイトで停止します。If the ending position precedes the starting position, $ZEXTRACT() returns null . もしこの引数が省略された場合、$ZEXTRACT() は1バイトを返します。
$ZEXTRACT() は、バイト位置に基づいて文字を操作するためのツールを提供します。
$ZEXTRACTはバイトで操作するので、UTF-8文字セットに従っている整形式でない文字列をつくり出します。

$ZEXTRACT() の例

例：

GTM>FOR i=0:1:9 WRITE !,$ASCII($ZEXTRACT("新年好",i)),"<"
-1<
230<
150<
176<
229<
185<
180<
229<
165<
189<

This loop displays the numeric byte sequence of the sequence of octets ("新年好").

$ZFind()

$ZFIND ()関数は、オクテット（8ビットバイト）のシーケンス内のバイトのシーケンスの発生位置の整数のバイト位置を返します。

The format of the $ZFIND() function is:

$ZF[IND](expr1,expr2[,intexpr])

最初の式(expr)は、バイトシーケンスを検索する$ZFIND() の中でオクテット（8ビットバイト）のシーケンスを指定します。
2番目の式(expr2)は、$ZFIND()が検索するためのバイトシーケンスを指定します。
オプションの整数式(intexpr)は、$ZFIND() が見つける検索の開始位置を識別します。もしこの引数が、無いあるいはゼロ（0）または負の場合、$ZFIND() は、オクテット（8ビットバイト）のシーケンスの最初の位置から検索を開始します。
もし $ZFIND() がバイトシーケンスを見つけたならば、その最後のバイトの後の位置を返します。もしバイトシーケンスの末尾が、オクテット (expr1) のシーケンスの末尾と一致する場合、expr1のバイトの長さに1 ($L(expr1)+1) を加えたものに等しい整数を返します。
もし $ZFIND()がバイトシーケンスを検出しない場合は、ゼロ（0）を返します。

$ZFIND() は、バイトシーケンスを検索するツールを提供しています。包含演算子( [ ) と2つの引数の$ZLENGTH() は、関連する関数を提供する他のツールです。

Examples of $ZFIND()

例：

GTM> WRITE $ZFIND("新年好",$ZCHAR(150)) 
3 
GTM>

この例では、数値のバイトコード150の最初に出現する位置をWRITEする $ZFIND() 関数を使用しています。 3のリターンは "見つかった" バイトの後に位置を与える 3 を返します。

例：

GTM> WRITE $ZFIND("新年好",$ZCHAR(229),5) 
8 
GTM>

この例では、バイト位置 5 から始まるバイトコード229の次に出現する位置をWRITEする $ZFIND() 関数を使用しています。

例：

[参考]GTM>Set t=1 For  Set t=$ZFind("新年好",$ZChar(230,150,176),t) Quit:'t  Write !,t

4
GTM>

[参考]This example uses a loop with $ZFIND() to locate all the occurrences of the byte sequence $ZCHAR(230,150,176) in the sequence of octets ("新年好").、、、、、この例では、オクテット ("新年好") のシーケンスで、バイトシーケンス $ZCHAR(230,150,176) がすべて発生する位置を見つけるために$ZFIND() をループで使用しています。$ZFIND() は、バイトシーケンス $ZCHAR(230,150,176) の出現する位置の後を示す4を返します。

$ZJustify()

$JUSTIFY() 関数は、フォーマットされた固定長のバイトシーケンスを返します。

$ZJUSTIFY()関数のフォーマット：

$ZJ[USTIFY](expr,intexpr1[,intexpr2])

式(expr) は、$ZJUSTIFY()でフォーマットされたオクテットのシーケンスを指定します。

最初の整数式 ,intexpr1（2番目の引数）は、バイトシーケンスの結果の最小サイズを指定します。もし最初の整数式(,intexpr1)が式(expr)の長さより大きい場合、$ZJUSTIFY() は、先頭にスペースを追加することにより、指定された長さのバイトシーケンスに式の右側に揃えます。そうでない場合、$ZJUSTIFY() は、2つ目の整数引数によって指定されない限り、変更されずに式を返します。

The optional second integer expression (third argument) specifies the number of digits to follow the decimal point in the result, and forces $ZJUSTIFY() to evaluate the expression as numeric. If the numeric expression has more digits than this argument specifies, $ZJUSTIFY() rounds to obtain the result. If the expression had fewer digits than this argument specifies, $ZJUSTIFY() zero-fills to obtain the result.
2番目の引数が指定され、最初の引数が-1から1の間の小数部に評価されている場合、$ZJUSTIFY() は、小数点(.) の前のゼロ（0）に先導している番号を返します。

$ZJUSTIFY()は、固定長のバイトシーケンスを作成するには、オクテットのシーケンスを埋め尽くします。しかし、もし指定された式の長さ指定されたバイトサイズを超える場合、$ZJUSTIFY() は、結果を切り捨てません（3番目の引数に基づいてまだ丸めるかもしれないが）。必要なときに、$ZEXTRACT() で切り捨てを実行します。

$ZJUSTIFY() は、オプションで小数点の後の結果の一部を丸めます。3番目の引数がない場合には、$ZJUSTIFY() は、式の評価を制限しません。3番目（丸め）引数の存在することで、$JUSTIFY() は、数値として式を評価します。丸めのアルゴリズムは、以下のように理解することができます：

必要な場合、丸めアルゴリズムは、丸め引数で指定された以外の少なくとも一つ以上の数字を持つように複数の0（ゼロ）と右側の式を拡張します。
その後、それは、丸め引数で指定された数字の後の桁の位置に、5（5）を追加します。
最後に、指定された桁数になるように結果を切り捨てます。アルゴリズムは、超過する桁が最後に保持する桁の半分以上を指定する時は切り上げて、半分より少ない場合は切り捨てます。

Example of $ZJUSTIFY()

例：

GTM> WRITE "123456789012345",! WRITE $ZJUSTIFY("新年好",15),!,$ZJUSTIFY("新年好",5) 
123456789012345
      新年好
新年好
GTM>

This uses $ZJUSTIFY() to display the sequence of octets represented by "新年好" in fields of 15 space octets and 5 space octets. Because the byte length of "新年好" is nine, it exceeds 5 spaces, the result overflows the specification.

$ZLength()

The $ZLENGTH() function returns the length of a sequence of octets measured in bytes, or in "pieces" separated by a delimiter specified by one of its arguments.

$ZLENGTH()関数のフォーマット：

$ZL[ENGTH](expr1[,expr2])

最初の式(expr1)は、$ZLENGTH() が "計測" の対象とするオクテットのシーケンスを指定します。

オプションの2番目の式(expr2)は、計測を定義するデリミタを指定します；もしこの引数が欠落している場合は、$ZLENGTH() は、オクテットのシーケンスのバイト数を返します。

If the second argument is present and not null, $ZLENGTH() returns one more than the count of the number of occurrences of the second byte sequence in the first byte sequence; if the second argument is null , the M standard specifies that $ZLENGTH() returns a zero (0).

$ZLENGTH() provides a tool for determining the lengths of a sequence of octets in two ways--bytes and pieces. 1つの引数がバイト数を返すにもかかわらず、$ZLENGTH() の2つの引数は、既存の部分文字の数を返します。

$ZLength() の例

例：

GTM> WRITE $LENGTH("新年好") 
9 
GTM>

This uses $ZLENGTH() to WRITE the length in bytes of the sequence of octets "新年好".

例：

GTM> SET x="新"_$ZCHAR(63)_"年"_$ZCHAR(63)_"好" 
GTM> WRITE $ZLENGTH(x,$ZCHAR(63))
2 
GTM>

これは、$ZCHAR(63)のバイトコードで区切らて、オクテットのシーケンスに部分文字の個数をWRITEする$ZLENGTH() を使用しています。

例：

GTM>SET x=$ZCHAR(63)_"新"_$ZCHAR(63)_"年"_$ZCHAR(63)_"好_$ZCHAR(63)" 
GTM>WRITE $ZLENGTH(x,$ZCHAR(63)
4
GTM>

これも、$ZCHAR(63)のバイトコードで区切らて、オクテットのシーケンスに部分文字の個数をWRITEする$ZLENGTH() を使用しています。GT.M は、始まりと終わりで区切られているので文字列の中で、部分文字の始まりと終わりの両方の空をカウントすることに、注意してください。

$ZPiece()

$ZPIECE()関数は、1つ以上のバイトで構成される、指定されたバイトシーケンスによって、バイトの区切りのシーケンスを返します。In M, $ZPIECE() returns a logical field from a logical record.

$ZPIECE() 関数のフォーマット：

$ZP[IECE](expr1,expr2[,intexpr1[,intexpr2]])

最初の式(expr1) は、$ZPIECE() がその結果を計算する元となるオクテットのシーケンスを指定します。

2番目の式(expr2)は、"境界" 位置を決定する区切りバイトシーケンスを指定します；もしこの引数が空文字列の場合、$ZPIECE() は空の文字列を返します。

もし2番目の式(expr2)が最初の式のどこにも現れない場合は、$ZPIECE() は、最初の式の全体を返します（2つ目の整数式で空の文字列を返すように強制されない限り）。
オプションの最初の整数式（3番目の引数）は、返すための開始部分文字を指定します；もしこの引数が欠落している場合、$ZPIECE() は最初の部分文字を返します。
オプションの2番目の整数式（4番目の引数）は、返すための最後の部分文字を指定します。もしこの引数が欠落している場合、それがヌル文字列を返す場合には、最初の整数式がゼロ（0）または負でない限りは、$ZPIECE() は部分文字1つのみを返します。もしこの引数が最初の整数の式より小さい場合、$ZPIECE() はnull を返します。
もし2番目の整数式が最初の式で部分文字の実際の数を超える場合、$ZPIECE() は、最初に整数式によって選択されたデリミタの後の式のすべてを返します。
$ZPIECE() の結果は、"外部:outside" の区切り文字を決して含んでいません；ただし、2番目の整数の引数が複数の部分文字を指定しているとき、結果は、区切り文字の "内側：inside" の出現を含んでいます。

$ZPIECE() は、長さが可変であるそれぞれが複数のエレメントまたはフィールドが含まれている値を効果的に使用するツールを提供します。

アプリケーションは、通常、ストレージのオーバーヘッドを最小限に抑えるために、$ZPIECE() 区切り文字（2番目の引数）に単一のバイトを使用し、そして、実行時の効率を向上させます。データ値は区切り文字を決して含まないので、区切り文字を選択する必要があります。編集のチェックでこの習慣を強制することの障害は、データ値内の部分文字列の位置に予期しない変更が発生する可能性があります。キャレット記号（^）、バックスラッシュ（\）、アスタリスク（*）文字は、良く利用する目に見える区切り文字の例です。複数バイトのデリミタは、フィールドの内容とのコンフリクトの可能性を減らすことができます。しかし、それらはストレージの効率性を低下させ、単一バイトの区切り文字よりも効率性は少なく処理されます。一部のアプリケーションでは、データに現れる区切り文字の可能性は減らせ、目に見える区切り文字で提供される読みやすさが犠牲になりますが、制御文字が使用できます。

SETコマンドの引数は、その等号（=）の左辺の$ZPIECE() のフォーマットを持っている何かを持つことができます。この構文は、オクテットのシーケンス内のそれぞれの部分を容易にメンテナンスが可能です。また、区切り文字のバイトシーケンスを生成するために使用することもできます。SET $ZPIECEの詳細については、コマンドの章の <a0>“Set”</a0> を参照してください。

$ZPIECE() の例

例：

GTM>FOR i=0:1:3 WRITE !,$ZPIECE("新"_$ZCHAR(64)_"年",$ZCHAR(64),i),"<" 

<
新<
年<
<
GTM>

このループは、区切り文字として$ZCHAR(64)、部分文字列の位置の "前"、1番目と2番目、そして、"後" のオクテットのシーケンスを指定して、$ZPIECE() の結果を表示します。

例：

GTM>FOR i=-1:1:3 WRITE !,$ZPIECE("新_$ZCHAR(64)_"年",$ZCHAR(64),i,i+1),"<" 
<
新<
新 年<
年<
<
GTM>

例：

F p=1:1:$ZL(x,"/") W ?p-1*10,$zpiece(x,"/",p)

これは、カラムフォーマットで、x のすべての部分文字を表示するために $ZLENGTH() と $ZPIECE() を使用しています。

例：

GTM> s $P(x,$ZCHAR(64),25)="" W x 
新年好@@@@@@@@@@@@@@@@@@@@@@@@

これは、$ZCHAR(64)の区切り文字で、変数 x の25番目をnullにSETします。[参考]これは、null に続く24個のアットマーク(@)のバイトシーケンスを生成します。

$ZTRanslate()

$ZTRANSLATE()関数は、その他の引数のパターンによって指定された、その引数の最初のバイトを、置換したり省略したりする結果として、バイトシーケンスを返します。

$ZTRANSLATE()関数のフォーマット：

$ZTR[ANSLATE](expr1[,expr2[,expr3]])

最初の式(expr1)は、$ZTRANSLATE() 操作をする対象の文字列を指定します。もし他の引数が省略されている場合は、 $ZTRANSLATE() は、この式expr1を返します。

オプションの第2式(expr2)は、$ZTRANSLATE()に置換するためのバイトを指定します。もしバイトが2番目の式に複数回出現する場合、最初の出現は変換を制御し、そして、$TRANSLATE() は後続の出現を無視します。もしこの引数が省略された場合、$ZTRANSLATEは、変更せずに、最初の式を返します。

The optional third expression specifies the replacement byte sequence for the second expression that corresponds by position. もしこの引数を省略するか、または、2番目の式よりも短い場合は、$ZTRANSLATE は、3番目の式に対応する位置に置換がない2番目の式の中で発生するすべてのバイトを削除します。

$ZTRANSLATE() は、暗号化などタスクのためのツールを提供しています。

$ZTRANSLATEのアルゴリズムは、以下のように理解することができます：

$ZTRANSLATE() は、マッチするものを探し2番目の式にバイトによりそのバイトを比較し、最初の式で各バイトを評価します。もし2番目の式での一致がない場合、式の結果は変更していないバイトが含まれています。

一致するバイトが位置している時に、$ZTRANSLATE() はオリジナルの式の適切な置換を識別するために、2番目の式の中で一致の位置を使用します。もし2番目の式が3番目の式より多くのバイトがある場合は、$ZTRANSLATE()は null でオリジナルのバイトを置き換え、それによって、結果からそれを削除します。この原理の拡張によって、もし3番目の式が欠落している場合は、$ZTRANSLATE() は2番目の式で発生する最初の式からすべてのバイトを削除します。

$ZTRANSLATE() の例

例：

GTM>set hiraganaA="あ" ; # $ZCHAR(227,129,130)
GTM>set temp1=$ZCHAR(130)
GTM>set temp2=$ZCHAR(140)
GTM>W hiraganaA
あ
GTM>W $ZTRANSLATE(hiraganaA,temp1,temp2)
が
GTM>

In the above example, $ZTRANSLATE() replaces byte $ZCHAR(130) in first expression (あ) and matching the first (and only) byte in the second expression with byte $ZCHAR(140) - the corresponding byte in the third expression. 変換の結果は "が" です。

新しい $Z 関数

$ZCOnvert()

$ZCONVERT() 関数は、異なったエンコーディングに変換された文字列として、第1引数を返します。2つの引数の形式は、文字セット内の大文字小文字のエンコーディングを変更します。3つの引数の形式は、符号化方式を変更します。

The format for the $ZCONVERT() function is:

$ZCO[NVERT](expr1, expr2,[expr3])

The first expression is the string to convert. If the expression contains a code-point value that is not in the character set, $ZCONVERT() generates a run-time error.

In the two argument form, the second expression specifies a code that determines the form of the result. In the three-argument form, the second expression specifies a code that controls the character set interpretation of the first argument. If the expression does not evaluate to one of the defined codes corresponding to a valid code for the number of available arguments, $ZCONVERT() generates a run-time error.
The optional third expression specifies the a code that determines the character set of the result. If the expression does not evaluate to one of the defined codes $ZCONVERT() generates a run-time argument. The three-argument form is not supported in M mode.

The valid (case insensitive) character codes for expr2 in the two-argument form are:

U converts the string to UPPER-CASE. "UPPER-CASE" refers to words where all the characters are converted to their "capital letter" equivalents. Characters that are already in UPPER-CASE "capital letter" are retained unchanged.

L converts the string to lower-case. "lower-case" refers to words where all the letters are converted to their “small letter” equivalents. Characters that are already in lower-case or have no lower-case equivalent are retained unchanged.
T converts the string to title case. "Title case" refers to a string where the first character of each word is in the upper-case and the remaining ones in the lower-case. Characters that are already in the “Title case” are retained unchanged. “T” (title case) is not supported in M mode.

The valid (case insensitive) codes for character set encoding for expr2 and expr3 in the three-argument form are:

"UTF-8"-- a multi-byte variable length encoding form of Unicode™.
"UTF-16LE"-- a multi-byte 16-bit encoding form of Unicode™ in little-endian.
"UTF-16BE"-- a multi-byte 16-bit encoding form of Unicode™ in big-endian.

When UTF-8 mode is enabled, GT.M uses the ICU Library to perform case conversion. As mentioned in the Theory of Operation section, the case conversion of the strings occurs according to Unicode™ code-point values. This may not be the linguistically or culturally correct case conversion, for example, of the names in the telephone directories. Therefore, application developers must ensure that the actual case conversion is linguistically and culturally correct for their specific needs. The two-argument form of the $ZCONVERT() function in M mode does not use the ICU Library to perform operation related to the case conversion of the strings.

$ZCONVERT() の例

例：

GTM>W $ZCONVERT("Happy New Year","U")
HAPPY NEW YEAR

例：

GTM>W $ZCHSET 
M
GTM> W $ZCONVERT("HAPPY NEW YEAR","T")
%GTM-E-BADCASECODE, T is not a valid case conversion code

例：

GTM>S T8="准备庆祝新年在上海"
GTM>W $L(T8)
9
GTM>S T16=$ZCONVERT(T8,"UTF-8","UTF-16LE")
GTM>W $L(T16) 
%GTM-E-BADCHAR, $ZCHAR(198) is not a valid character in the UTF-8 encoding form
GTM>S T16=$ZCONVERT(T16,"UTF-16LE","UTF-8")
GTM>W $L(T16)
9

In the above example, $LENGTH() function triggers an error because it takes only UTF-8 encoding strings as the argument.

$ZSUBstr()

$ZSUBSTR() 関数は、バイトのシーケンスから適切にエンコードされた文字列を返します。

$ZSUB[STR] (expr ,intexpr1 [,intexpr2])

The first expression is an expression of the byte string from which $ZSUBSTR() function derives the character sequence.

The second expression is the starting byte position (counting from 1 for the first position) in the first expression from where $ZSUBSTR() begins to derive the character sequence.
The optional third expression specifies the number of bytes from the starting byte position specified by the second expression that contribute to the result. If the third expression is not specified, the $ZSUBSTR() function returns the sequence of characters starting from the byte position specified by the second expression up to the end of the byte string.
The $ZSUBSTR() function never returns a string with illegal or invalid characters. With VIEW NOBADCHAR enabled, the $ZSUBSTR() function ignores all byte sequences within the specified range that do not correspond to valid Unicode™ code-points, With VIEW BADCHAR enabled, the $ZSUBSTR() function triggers a run-time error if the specified byte sequence contains a code-point value that is not in the character set.

The $ZSUBSTR() function is a new function introduced in conjunction with Unicode™ support. Like the $ZCONVERT() function and the $ZWIDTH() function, it does not have a traditional M equivalent.

Examples of $ZSUBSTR()

例：

GTM>W $ZCHSET
M
GTM>set char1="a" ; one byte character 
GTM>set char2="ç"; two-byte character
GTM>set char3="新"; three-byte character
GTM>set y=char1_char2_char3
GTM>W $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5)
0

With character set M specified, the expression $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5) evaluates to 0 or "false" because the expression $ZSUBSTR(y,1,5) returns more characters than $ZSUBSTR(y,1,3).

例：

GTM>W $ZCHSET
UTF-8
GTM>set char1="a" ; one byte character 
GTM>set char2="ç"; two-byte character
GTM>set char3="新"; three-byte character
GTM>set y=char1_char2_char3
GTM>W $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5)
1

With character set UTF-8 specified, the expression $ZSUBSTR(y,1,3)=$ZSUBSTR(y,1,5) evaluates to 1 or "true" because the expression $ZSUBSTR(y,1,5) returns a string made up of char1 and char2 excluding the three-byte char3 because it was not completely included in the specified byte-length.

In many ways, the $ZSUBSTR() function is similar to the $ZEXTRACT() function. For example, $ZSUBSTR(expr,intexpr1) is equivalent to $ZEXTRACT(expr,intexpr1,$L(expr)). Note that this means when using the M character set, $ZSUBSTR() behaves identically to $EXTRACT() and $ZEXTRACT().

The differences are as follows:

$ZSUBSTR() cannot appear on the left of the equal sign in the SET command where as $ZEXTRACT() can

In both the modes, the third expression of $ZSUBSTR() is a byte, rather than character, position within the first expression.
$EXTRACT() operates on characters, irrespective of byte length.
$ZEXTRACT() operates on bytes, irrespective of multi-byte character boundaries.
$ZSUBSTR() is the only way to extract as valid UTF-8 encoded characters from a given byte string. It operates on characters in Unicode™ so that its result does not exceed the given byte length.

$ZWidth()

$ZW[IDTH] (expr)

$ZWIDTH() 関数は、画面やプリンタ上へ文字列を表示するために必要な列の数を返します。

The expression is the string which $ZWIDTH() evaluates for display length. If the expression contains a code-point value that is not a valid character in Unicode™ , $ZWIDTH() generates a run-time error.

If the expression contains any non-graphic characters, the $ZWIDTH() function does count not those characters.
If the string contains any escape sequences containing graphical characters (which they typically do), $ZWIDTH() includes those characters in calculating its result, as it does not do escape processing. In such a case, the result many be larger than the actual display width.


	The ZWIDTH() function triggers a run-time error if it encounters a malformed byte sequence irrespective of the setting of "BADCHAR".

With character set UTF-8 specified, the $ZWIDTH() function uses the ICU's glyph-related conventions to calculate the number of columns required to represent the expression.

$ZWIDTH() の例

例：

GTM>S NG=$CHAR($$FUNC^%HD("200B"))GTM>S S=$CHAR(26032)_NG_$CHAR(26376) 
GTM>W STR
新月
GTM>W $ZWIDTH(STR)
4
GTM>

In the above example, the local variable NG contains a non-graphic character which does not display between two double-width characters in Unicode™.

例：

GTM> W $ZWIDTH("Get ready to celebrate the new year in Shanghai")
47
GTM>S A="新年好"
GTM>W "123456",!,A
123456
新年好
GTM>W $ZWIDTH(A)
6

In the above example, the $ZWIDTH() function returns 6 because each character in A occupies 2 columns when they are displayed on the screen or printer.

固有の特殊変数

$X

For complete description and UTF-8 mode examples, refer to the Cursor Position Variable section earlier in this document.

$ZPATN[umeric]

$ZPATN[UMERIC] is a read-only intrinsic special intrinsic variable that determines how GT.M interprets the patcode “N” used in the pattern match operator. With $ZPATNUMERIC="UTF-8", the patcode “N” matches any numeric character as defined by Unicode™. With $ZPATNUMERIC="M", GT.M restricts the patcode “N” to match only ASCII digits 0-9 (that is, ASCII 48-57). When a process starts in UTF-8 mode, special intrinsic variable $ZPATNUMERIC takes its value from the environment variable gtm_patnumeric. GT.M initializes the special intrinsic variable $ZPATNUMERIC to "UTF-8" if gtm_patnumeric is defined to "UTF-8". If gtm_patnumeric is not defined or set to a value other than "UTF-8", GT.M initializes $ZPATNUMERIC to "M".

$ZPATNUMERIC cannot appear on the left of an equal sign in a SET command. That is: GT.M populates it at process initialization from gtm_patnumeric and does not allow the process to change the value.

$ZCH[set]

The read-only special intrinsic variable $ZCHSET takes its value from the environment variable gtm_chset. アプリケーションは、$ZCHSETの値によってGT.Mプロセスによって使用される文字セットを取得することができます。$ZCHSET can have only two values --"M", or "UTF-8” and it cannot appear on the left of an equal sign in the SET command.

Note that behavior for 7-bit ASCII characters is the same in both "M" and "UTF-8”. Customers operating in M mode are expected to use various ISO-Latin character sets.

GT.M only supports Unicode™ on Unix platforms. In OpenVMS, GT.M always gives special intrinsic variable $ZCHSET the value "M" and ignores the value of the environment variable gtm_chset even if it is defined.

$ZPROMpt

$ZPROM [PT] は、ダイレクトモードの現在のプロンプトを指定する文字列の値が含まれています。デフォルトは、GTM> がダイレクトモードのプロンプトです。Mルーチンは、SETコマンドによって$ZPROMPTを変更することができます。$ZPROMPTは31バイトを超えることはできません。If an attempt is made to assign $ZPROMPT to a longer string, GT.M takes only the first 31 bytes and truncates the rest . もし31バイト目が有効なUTF-8文字の終わりではない場合、指定されたUTF-8の文字セットでは、GT.Mは完全に31バイトの制限内に収まる最後の文字の末尾に$ZPROMPT値を切り捨てます。

ユーザ定義の照合順序

As noted in the Theory of Operation section, applications that use characters in Unicode™ may need to implement their own collation functions. For instructions on defining a collation system, please refer to the Chapter 10: Internationalization of the GT.M Programmer's Guide.

By default, GT.M sorts string subscripts in the default order of the Unicode™ numeric code-point ($ASCII()) values. Since this implied ordering may or may not be linguistically or culturally correct for a specific application, an implementation of an algorithm such as the Unicode™ Collation Algorithm (UCA) may be required. Note that implementation of collation in GT.M requires the implementation of two functions, f(x) and g(y). f(x) transforms each input sequence of bytes into an alternative sequence of bytes for storage. Within the GT.M database engine, M nodes are retrieved according to the byte order in which they are stored. For each y that can be generated by f(x), g(y) is an inverse function that provides the original sequence of bytes; in other words, g(f(x)) must be equal to x for all x that the application processes. For example, for the People's Republic of China, it may be appropriate to convert from UTF-8 to Guojia Biaozhun (国家标准), the GB18030 standard, for example, using the libiconv library. The following requirements are important:

Unambiguous transformation routines: The transform and its inverse must convert each input string to a unique sequence of bytes for storage, and convert each sequence of bytes stored back to the original string.
Collation sequence for all expected character sequences in subscripts: GT.M does not validate the subscript strings passed to/from the collation routines. If the application design allows illegal UTF-8 character sequences to be stored in the database, the collation functions must appropriately transform, and inverse transform, these as well.
Handle different string lengths for before and after transformation: If the lengths of the input string and transformed string differ, and, for local variables, if the output buffer passed by GT.M is not sufficient, follow the procedure described below:
Global Collation Routines: The transformed key must not exceed 255 bytes, the maximum key size. GT.M allocates a temporary buffer of size 255 bytes in the output string descriptor (of type DSC_K_DTYPE_T) and passes it to the collation routine to return the transformed key.
Local Collation Routines: GT.M allocates a temporary buffer in the output string descriptor based on the size of the input string. Both transformation and inverse transformation must check the buffer size, and if it is not sufficient, the transformation must allocate sufficient memory, set the output descriptor value (val field of the descriptor) to point to the new memory , and return the transformed key successfully. Since GT.M copies the key from the output descriptor into its internal structures, it is important that the memory allocated remain available even after the collation routines return. Collation routines are typically called throughout the process lifetime, therefore, GT.M expects the collation libraries to define a large static buffer sufficient to hold all key sizes in the application. Alternatively, the collation transform can use a large heap buffer (allocated by the system malloc() or GT.M gtm_malloc()). Application developers must choose the method best suited to their needs.

コンパイルとリンク

To properly handle embedded literals for the same source code, depending on whether $ZCHset is "M" or "UTF-8", GT.M generates different object code. GT.M uses $ZROutines to match object code to source code. If there is no object code, GT.M automatically generates an object in the mode of the current process. If the object code exists and does not match the mode of the current process, GT.M issues an error. This means, when both M and UTF-8 processes are using the same source code, the objects must be stored in separate directories or libraries and have differing $ZROutines values that locate the appropriate object code.

環境変数

The following table summarizes the Unicode™ related environment variables.

Unix Environment Variables

Variable Name

説明

gtm_chset

Use this environment variable to initialize the value of the special intrinsic variable $ZCHSET. To enable a process to start in UTF-8 mode, the environment variable gtm_chset must be set to "UTF-8".

If the environment variable gtm_chset is not defined, or defined to a value other than "UTF-8", the GT.M processes starts in M mode and assumes each character is encoded in a single-byte. This is the default behavior. The default value of "M" for gtm_chset minimizes the changes to applications coded before Unicode™ support.

gtm_badchar

Use this environment variable to initialize the value of VIEW “BADCHAR”. If gtm_badchar is defined and evaluates to “TRUE” (or ”T”) or “YES” (or “Y” or a non-zero integer, VIEW “BADCHAR” is enabled. Otherwise, VIEW “NOBADCHAR” is enabled. By default, VIEW “BADCHAR” is enabled. For more details please refer to the VIEW command section.

gtm_patnumeric

Use this environment variable to initialize the value of the special intrinsic variable $ZPATNUMERIC in UTF-8 mode.

If the value of special intrinsic variable $ZCHSET is M, GT.M ignores the value of the environment variable gtm_patnumeric and initializes $ZPATNUMERIC to "M".

LC_CTYPE

ICU uses the environment variable LC_CTYPE to determine the locale behavior. In an installation using multiple Unicode™ encoded languages, all processes may have gtm_chset as UTF-8, but might have different LC_CTYPE settings. Using an LC_CTYPE setting that does not match the application assumptions, particularly previously stored data, may cause undesirable results. The process of “setting” LC_CTYPE depends on the shell in use (setenv LC_CTYPE in tch). The action associated with the NONUTF8CHSET error explains how to to start a user down a path to a successful recovery.


	If LC_CTYPE is a character set with non-UTF-8 support, GT.M fails to startup and reports the NONUTF8CHSET error. Note that the LC_ALL environment variable overrides all the LC_* (locale) variables. GT.M only requires an appropriate setting for LC_CTYPE, other applications or work in the system may dictate whether LC_ALL is appropriate.

ユーティリティプログラム

グローバルディレクトリエディタ(GDE)

As noted in the previous sections, a process operating in M mode exhibits unaltered behavior. There is no change in the GDE utility in M mode. In the UTF-8 mode, the changes to the GDE objects are as follows:

GDE オブジェクト	Allowed format	説明
File name	Unicode™	GDE allows the name of a file to include characters in Unicode™
Global variables and Region and Segment	ASCII	As there are no changes to the GT.M database format, GDE takes ASCII names for global variables, Regions, and Segments.
GDE commands/qualifier	ASCII	As there are no changes to the GT.M database engine, GDE takes only ASCII names for all the GDE commands and Qualifiers.
GDE Logs and the output generated by the LOG command	Unicode™	GDE considers a text file to be encoded in UTF-8 when it is executed via the “@” command.
The global directory file (.gld) format	ASCII	File names in a global directory containing non-ASCII characters may not be displayed properly in a non-Unicode™ environment.

MUPIP

The MUPIP utility now handles Unicode™ data. Both ZWR and GO format of EXTRACT use the ZWRITE format specified in Data Interchange section. In UTF-8 mode MUPIP EXTRACT, MUPIP JOURNAL -EXTRACT and MUPIP JOURNAL -LOSTTRANS write sequential output files in the UTF-8 character encoding form. For example, in UTF-8 mode if ^A has the value of 准备好庆祝新年在纽约, the sequential output file of the MUPIP EXTRACT command is:

09-OCT-2006  04:27:53 ZWR
GT.M MUPIP EXTRACT UTF-8
^A="准备好庆祝新年在纽约。"

Similarly, the MUPIP LOAD command considers a sequential file as encoded in UTF-8 if the environment variable gtm_chset is set to UTF-8.


	MUPIP EXTRACTコマンドと対応するMUPIP LOADコマンドは、環境変数 gtm_chset に同じ設定で実行されることを確認します。M ユーティリティプログラム %GOと%GI は、モードマッチングのために同じ要件があります。

MUPIP EXTRact

The MUPIP EXTRACT command adds the label "UTF-8" in the header label of the file extracted in the UTF-8 mode as follows:

MUPIP コマンド	UTF-8モード	M Mode
MUPIP EXTRACT (both ZWR and GO)	GT.M MUPIP EXTRACT UTF-8	GT.M MUPIP EXTRACT
MUPIP EXTRACT (BINARY)	GDS BINARY EXTRACT LEVEL 42006082413413901024002560006400000UTF-8 GT.M MUPIP EXTRACT	GDS BINARY EXTRACT LEVEL 42006082413413901024002560006400000GT.M MUPIP EXTRACT
MUPIP JOURNAL EXTRACT	GDSJEX03 UTF-8	GDSJEX03
LOST TRANSACTION EXTRACT	GDSJEX03 ROLLBACK PRIMARY INSTANCE1 UTF-8	GDSJEX03 ROLLBACK PRIMARY INSTANCE1

In UTF-8 mode, MUPIP LOAD triggers the LOADINVCHSET error if the header label of an extract file does not contain " UTF-8" as a suffix.

All MUPIP command qualifiers that require file names, keys, or data (for example, MUPIP SET -FILE, MUPIP INTEG -SUBSCRIPT, MUPIP REORG -SELECT qualifiers) accept characters in Unicode™ in UTF-8 mode. Database replication instance names must be ASCII. Although GT.M does not trigger an error if the name of a database replication instance is in Unicode™, FIS recommends the use of ASCII characters for naming all the database replication instances.

If the environment gtm_chset is not defined or is set to M, the MUPIP utility writes the byte-equivalent values of the globals containing characters in Unicode™ in the sequential output file. For example, if ^A has the value of 准备好庆祝新年在纽约, the sequential output file of the MUPIP EXTRACT command is:

09-OCT-2006  04:25:52 ZWR
GT.M MUPIP EXTRACT
^A=""_$C(135,134)_""_$C(135)_"好"_$C(134)_""_$C(157)_""_$C(150)_"?年"_$C(156)_"?纽约"_$C(128,130)

In both modes, if EXTRACT encounters an illegal character, it places $ZCH representation in the sequential output file.


	MUPIP EXTRACT or MUPIP LOAD respectively produce and accept only abbreviated forms of $CHAR() and $ZCHAR(), that is, $C() and $ZCH() .

DSE & LKE

If the environment variable gtm_chset is set to UTF-8, the DSE DUMP command prints graphic characters for visualization. DSE does not write non-graphic characters and malformed characters to the interpreted output, but instead represents such characters by a dot character.

例：

dse dump -block=9
File    /home/V52/mumps.dat
Region  DEFAULT
Block 9   Size 24   Level 0   TN 9 V5
Rec:1  Blk 9  Off 10  Size 14  Cmpc 0  Key ^DD
      10 : | 14  0  0  9 44 44  0  0 E5 A4 AA E9 98 B3 E7 9A 84 E5 B9 B4|
           |  .  .  .  .  D  D  .  .       太       阳       的       年|

However, in M mode, DSE DUMP print dot characters for all non-ASCII characters and malformed characters.

In UTF-8 mode, DSE and LKE accept characters in Unicode™ in all their command qualifiers that require file names, keys, or data (such as DSE -KEY, DSE -DATA and LKE -LOCK qualifiers).

LKE SHOW now represents canonical numeric subscripts without quotes.

例：

GTM>l ^A(1)
GTM>zsy
$ lke
LKE> show -all
DEFAULT
^A(1) Owned by PID= 8102 which is an existing process
LKE>GTM>l ^A(1)

Mユーティリティルーチン

The %UTF2HEX and %HEX2UTF M utility routines provide conversions between UTF-8 and hexadecimal code-point representations. Both these utilities run in only in UTF-8 mode; in M mode, they both trigger a run-time error.

%UTF2HEX

GT.M の %UTF2HEX ユーティリティは、UTF-8でエンコーディングされたGT.M文字列の内部エンコーディングのバイトの16進表記で返します。このルーチンは、対話的または非対話的のどちらの使用でのエントリポイントがあります。

DO ^%UTF2HEX converts the string stored in %S to the hexadecimal byte notation and stores the result in %U.

DO INT^%UTF2HEX converts the interactively entered string to the hexadecimal byte notation and stores the result in %U.

$$FUNC^%UTF2HEX(s) returns the hexadecimal byte representation of the character string s.

例：

GTM> SET %S=”AÄB”
GTM> DO ^%UTF2HEX
GTM> ZWRITE %U
%U=”41C38442”
GTM> W $$FUNC^%UTF2HEX(“ABC”)
414243
GTM>

Note that %UTF2HEX provides a similar functionality as the UNIX binary dump utility (od -x).

%HEX2UTF

GT.Mの %HEX2UTFユーティリティは、16進表記で指定されたバイトストリームからGT.MのUTFエンコードの文字列へ変換します。このルーチンは、対話的または非対話的のどちらの使用でのエントリポイントがあります。

DO ^%HEX2UTF converts the hexadecimal byte stream stored in %U into a GT.M character string and stores the result in %S.

DO INT^%HEX2UTF converts the interactively entered hexadecimal byte stream into a GT.M character string and stores the result in %S.

$$FUNC^%HEX2UTF (s) returns the GT.M character string given the hexadecimal byte stream representation in s.

例：

GTM> SET %U=”41C3A441”
GTM> DO ^%HEX2UTF
GTM> ZWRITE %S
%S=”AäA”
GTM> W $$FUNC^%HEX2UTF(“414243”)
ABCS
GTM>

ディスカッションとベストプラクティス

データ交換

The support for Unicode™ in GT.M only affects the interpretation of data in databases, and not databases themselves, a simple way to convert from a ZWR format extract in one mode to an extract in the other is to load it in the database using a process in the mode in which it was generated, and to once more extract it from the database using a process in the other mode.

If a sequence of 8-bit octets contains bytes other than those in the ASCII range (0 through 127), an extract in ZWR format for the same sequence of bytes is different in "M" and "UTF-8" modes. In "M" mode, the $C() values in a ZWR format extract are always equal to or less than 255. In "UTF-8" mode, they can have larger values - the code-points of legal characters in Unicode™ can be far greater than 255.

Note that the characters written to the output device are subject to the OCHSET transformation of the controlling output device. If OCHSET is "M", the multi-byte characters are written in raw bytes without any transformation.

Each multi-byte graphic character (as classified by $ZCHSET) is written directly to the device converted to the encoding form specified by the OCHSET of the output device.
Each multi-byte non-graphic character (as classified by $ZCHSET) is written in $CHAR(nnnn) notation, where nnnn is the decimal character code (that is, code-point up to 1114111 if $ZCHSET=”UTF-8” or up to 255 if $ZCHSET="M").
If $ZCHSET="UTF-8" and a subscript or data contains a malformed UTF-8 byte sequence, ZWRITE treats each byte in the sequence as a separate malformed character. Each such byte is written in $ZCHAR(nn[,…]) notation, where each nn is the corresponding byte in the illegal UTF-8 byte sequence.

Note that attempts to use ZWRITE output from a system as input to another system using a different character set may result in errors or not yield the same state as existed on the source system. Application developers can deal with this by defining and using one or more pattern tables that declare all non-ASCII characters (or any useful subset thereof) to be non-graphic (see ). For more details on defining pattern tables, please refer to "Pattern Code Definition" section of "Internationalization" chapter in the GT.M Programmer's Guide.

制限

ユーザー定義のパターンコードはサポートされない。

Although the M standard patcodes (A,C,L,U,N,P,E) are extended to work with Unicode™, application developers can neither change their default classification nor define the non-standard patcodes ((B,D,F-K,M,O,Q-T,V-X) beyond the ASCII subset. これは、パターンテーブルより大きいコードの文字が含まれていないことを意味します最大ASCIIコード127。

文字列の正規化

In GT.M, strings are not implicitly normalized. Unicode™ normalization is a method of computing canonical representation of the character strings. Normalization is required if the strings contain combination characters (such as accented characters consisting of a base character followed by an accent character) as well as precomposed characters. The Unicode™ standard assigned code-points to such precomposed characters for backward compatibility with legacy code sets. For the applications containing both versions of the same character (or combining characters), Unicode™ recommends one of the normal forms. Because GT.M does not normalize strings, the application developers must develop the functionality of normalizing the strings, as needed, in order for string matching and string collation to behave in a conventional and wholesome fashion. In such a case, edit checks can be used that only accept a single representation when multiple representations are possible.

プロセスの起動時に$PRINCIPAL デバイスのエンコーディングは決定されます。

At process start-up, GT.M implicitly OPENs $PRINCIPAL before any application code is executed, using the encoding specified by $gtm_chset. $PRINCIPAL is never OPENed by any application code. ichset, ochset and chset device parameters are characteristics of the OPEN command rather than the USE command, since an IO device cannot conveniently switch encoding in mid-stream. Therefore, the character set of $PRINCIPAL is determined for the process, and cannot be changed.

One implication of this restriction on $PRINCIPAL (including Terminal, Sequential File and Socket devices) is that UTF-16, UTF-16LE and UTF-16BE encodings are never supported for $PRINCIPAL.

UTF-16はターミナルデバイスをサポートしません。

Due to the uncommon usage and lack of support for UTF-16 by UNIX terminals and terminal emulators, GT.M does not support UTF-16, UTF-16LE and UTF-16BE encodings for Terminal I/O devices. Note that UNIX platforms use UTF-8 as the defacto character encoding for Unicode™. The terminal connections from remote hosts (such as Windows) must communicate with GT.M in UTF-8 encoding.

[アメリカ]英語でエラーメッセージ

GT.M has no facility for a translation of product error messages or on-line help into languages other than [American] English. All error message text (except the messages arguments that could include Unicode™ data) is in the [American] English language.

パフォーマンスとキャパシティ

With the use of "UTF-8" as GT.M’s internal character encoding, the additional requirements for CPU cycles, excluding collation algorithms, should not increase significantly compared with the identical application using the "M" character set. Additional memory requirements for "UTF-8" vary depending on the application as well as the actual character set used. For example, applications based on Latin-1 (2-byte encoded) characters may require up to twice the memory and those based on Chinese/Japanese (3-byte encoded) characters may require up to three times the memory compared to an identical application using "M" characters. The additional disk-space and I/O performance trade-offs for "UTF-8" also vary based on the application and the characters used.

外部ルーチンと交換される引数の文字は、外部ルーチンで検証する必要があります。

GT.M does not check for illegal characters in a string before passing it to an external routine or in a returned value before assigning it to a GT.M variable. This is because such checks add parameter-processing overhead. The application must ensure that the strings are in the encoding form expected by the respective routines. More robustly, external routines must interpret passed strings based on the value of the intrinsic variable $ZCHSET or the environment variable gtm_chset. The external routines can perform validation if needed.

上限

In the prior versions of GT.M, the restrictions on certain objects were put in place with the assumption that a character is represented by a single byte. With support for Unicode™ enabled in GT.M, the following restrictions are now in terms of bytes—not characters.

Mの名前の長さ

The maximum length of an M identifier is restricted to 31 bytes. Since identifier names are restricted to be in ASCII, programmers can define M names up to 31 characters long.

M文字列の長さ

The maximum length of an M string is restricted to 1,048,576 bytes. Therefore, depending on the characters used, the maximum number of characters could be reduced from 1,048,576 (1M) characters to as few as 262,144 (256K) characters.

Mソース行の長さ

The maximum length of a program or indirect source line is restricted to 2,048 bytes. Application developers must be aware of this byte limit if they consider using multi-byte source comments or string literals in a source line.

データベースキーとレコードのサイズ

The maximum allowed size for database keys (both global and nref keys) is 255 bytes, and for database records is 32K bytes. Application developers must be aware that the keys or data containing multi-byte characters in Unicode™ are limited at a smaller number of characters than the number of available bytes.

10の鉄則

Adhere to the following rules of thumb to design and develop Unicode™-based applications for deployment on GT.M.

GT.M functionality related to Unicode™ becomes available only in UTF-8 mode.
[At least] in UTF-8 mode, byte manipulation must use Z* equivalent functions.
In M mode, standard functions are always identical to their Z equivalents.
Use the same character set for all globals names and subscripts in an instance.
Define a collation system according to the linguistic and cultural tenets of the language used.
Create the application logic to ensure strings used as keys are canonical.
Specify CHSET=”M” or otherwise handle illegal characters during the I/O operations.
Communicate with any external routines using a compatible character encoding form.
Compile and run programs in the same setting of $ZCHSET and "BADCHAR".
Read the technical bulletin and the GT.M Programmer's Guide carefully. When in doubt, consult GTM Support (gtm.support@fnf.com).