eci: Add support for all ECIs (Big5, Korean, UCS-2BE)

2024-11-16 20:57:25 +13:00 · 2021-01-11 18:11:41 +00:00
parent 9795049322
commit 7fe930b4dc
53 changed files with 51324 additions and 907 deletions
--- a/docs/manual.txt
+++ b/docs/manual.txt
@ -196,7 +196,7 @@ output file will be out.gif.

 The data input to Zint is assumed to be encoded in Unicode (UTF-8) format. If
 you are encoding characters beyond the 7-bit ASCII set using a scheme other than
-Unicode then you will need to set the appropriate input options as shown in
+UTF-8 then you will need to set the appropriate input options as shown in
 section 4.11 below.

 Non-printing characters can be entered on the command line using the backslash
@ -449,11 +449,11 @@ example for PNG images a scale of 5 will increase the X-dimension to 10 pixels.
 4.10 Input modes
 ----------------
 By default all input data is assumed to be encoded in Unicode (UTF-8) format.
-Many barcode symbologies encode data using Latin-1 (ISO-8859-1) character
-encoding, so input is converted from Unicode to Latin-1 before being put in the
+Many barcode symbologies encode data using Latin-1 (ISO/IEC 8859-1) character
+encoding, so input is converted from UTF-8 to Latin-1 before being put in the
 symbol. In addition QR Code, Micro QR Code, Rectangular Micro QR Code, Han Xin
 Code and Grid Matrix can encode Japanese or Chinese characters which are also
-converted from Unicode. If Zint encounters characters which can not be encoded
+converted from UTF-8. If Zint encounters characters which can not be encoded
 using the default character encoding then it will take advantage of the ECI
 (Extended Channel Interpretations) mechanism to encode the data. Be aware that
 not all barcode readers support ECI mode, so this can sometimes lead to
@ -476,8 +476,8 @@ Identification Code (HIBC LIC). For HIBC Provider Applications Standard
 (HIBC PAS), preface the data with a slash "/".

 The --binary option encodes the input data as given. Automatic code page
-translations to ECI pages is disabled. This may be used for raw binary or binary
-encrypted data.
+translations to ECI pages is disabled, and no validation of the data's encoding
+takes place. This may be used for raw binary or binary encrypted data.
 This switch plays together with the built-in ECI logic and examples may
 be found in that section.

@ -497,7 +497,7 @@ The ECI information is added to your code symbol as prefix data.
 The ECI value may be specified with the --eci switch, followed by the value in
 the column "ECI Code".
 The ECI value of 0 does not encode any ECI information in the code symbol. In
-this case, the default encoding applies for the data which is "ISO-8859-1 -
+this case, the default encoding applies for the data which is "ISO/IEC 8859-1 -
 Latin alphabet No. 1".

 The first row of the table (ECI code 3) is the default value and does not lead
@ -505,65 +505,59 @@ to any ECI information being included in the symbol.

 The input data should be UTF-8 formatted. Zint automatically translates the
 data into the target encoding.
-The rows marked with a star (*) do not do this transformation. The data must be
-specified as binary data (--binary switch) with the data in the encoding given
-by the "Character Encoding Scheme" column.
-The row marked with a double star (**) only does this transformation for QR
-Code, Micro QR Code and Rectangular Micro QR Code.
-The row marked with a triple star (***) only does this transformation for Han
-Xin Code and Grid Matrix. Han Xin Code can encode GB 18030. Grid Matrix can
-encode the subset GB 2312.
+
+The row marked with a star (*) translates GB 2312 codepoints, except when using
+Han Xin Code, which translates GB 18030 codepoints, a superset of GB 2312.

 Note: the "--eci 3" specification should only be used for special purposes.
 Using this parameter, the ECI information is explicitly added to the code
 symbol. Nevertheless, for ECI Code 3, this is not required, as this is the
 default encoding, which is also active without any ECI information.

--------------------------------------------------------
+------------------------------------------------------------
 ECI Code  |  Character Encoding Scheme
--------------------------------------------------------
-3         |  ISO-8859-1 - Latin alphabet No. 1
-4         |  ISO-8859-2 - Latin alphabet No. 2
-5         |  ISO-8859-3 - Latin alphabet No. 3
-6         |  ISO-8859-4 - Latin alphabet No. 4
-7         |  ISO-8859-5 - Latin/Cyrillic alphabet
-8         |  ISO-8859-6 - Latin/Arabic alphabet
-9         |  ISO-8859-7 - Latin/Greek alphabet
-10        |  ISO-8859-8 - Latin/Hebrew alphabet
-11        |  ISO-8859-9 - Latin alphabet No. 5
-12        |  ISO-8859-10 - Latin alphabet No. 6
-13        |  ISO-8859-11 - Latin/Thai alphabet
-15        |  ISO-8859-13 - Latin alphabet No. 7
-16        |  ISO-8859-14 - Latin alphabet No. 8 (Celtic)
-17        |  ISO-8859-15 - Latin alphabet No. 9
-18        |  ISO-8859-16 - Latin alphabet No. 10
-20 **     |  Shift-JIS (JISX 0208 amd JISX 0201)
+------------------------------------------------------------
+3         |  ISO/IEC 8859-1 - Latin alphabet No. 1
+4         |  ISO/IEC 8859-2 - Latin alphabet No. 2
+5         |  ISO/IEC 8859-3 - Latin alphabet No. 3
+6         |  ISO/IEC 8859-4 - Latin alphabet No. 4
+7         |  ISO/IEC 8859-5 - Latin/Cyrillic alphabet
+8         |  ISO/IEC 8859-6 - Latin/Arabic alphabet
+9         |  ISO/IEC 8859-7 - Latin/Greek alphabet
+10        |  ISO/IEC 8859-8 - Latin/Hebrew alphabet
+11        |  ISO/IEC 8859-9 - Latin alphabet No. 5 (Turkish)
+12        |  ISO/IEC 8859-10 - Latin alphabet No. 6 (Nordic)
+13        |  ISO/IEC 8859-11 - Latin/Thai alphabet
+15        |  ISO/IEC 8859-13 - Latin alphabet No. 7 (Baltic)
+16        |  ISO/IEC 8859-14 - Latin alphabet No. 8 (Celtic)
+17        |  ISO/IEC 8859-15 - Latin alphabet No. 9
+18        |  ISO/IEC 8859-16 - Latin alphabet No. 10
+20        |  Shift JIS (JIS X 0208 amd JIS X 0201)
 21        |  Windows-1250 - Latin 2 (Central Europe)
 22        |  Windows-1251 - Cyrillic
 23        |  Windows-1252 - Latin 1
 24        |  Windows-1256 - Arabic
-25 *      |  UCS-2 Unicode (High order byte first)
-26        |  Unicode (UTF-8)
-27        |  ISO-646:1991 7-bit character set
-28 *      |  Big5 (Taiwan) Chinese Character Set
-29 ***    |  GB (PRC) Chinese Character Set
-30 *      |  Korean Character Set (KSX1001:1998)
--------------------------------------------------------
+25        |  UCS-2BE (High order byte first) (Unicode BMP)
+26        |  UTF-8 (Unicode)
+27        |  ISO/IEC 646:1991 7-bit character set (ASCII)
+28        |  Big5 (Taiwan) Chinese Character Set
+29 *      |  GB (PRC) Chinese Character Set
+30        |  Korean Character Set (KS X 1001:2002)
+899       |  8-bit binary data
+------------------------------------------------------------

 Three examples:
-Ex1: The Euro sign can be encoded in ISO-8859-15.
-The Euro sign has the ISO-8859-15 codepoint hex A4.
+Ex1: The Euro sign U+20AC can be encoded in ISO/IEC 8859-15.
+The Euro sign has the ISO/IEC 8859-15 codepoint hex A4.
 It is encoded in UTF-8 as the hex sequence: e2 82 ac
 Those 3 bytes are contained in the file "utf8euro.txt"
 This command will generate the corresponding code:

 zint.exe -b 71 --square --scale 10 --eci 17 -i utf8euro.txt

-Ex2: The Chinese character with Unicode codepoint hex 5E38 can be encoded in
-Big5 encoding. The Big5 ECI is marked in the upper table to require input data
-in Big5 instead of UTF-8. The Big5 representation of this character is the two
-hex bytes: 9C 75 (contained in the file big5char.txt).
-The generation command for Data Matrix is:
+Ex2: The Chinese character with Unicode codepoint U+5E38 can be encoded in Big5
+encoding. The Big5 representation of this character is the two hex bytes: 9C 75
+(contained in the file big5char.txt). The generation command for Data Matrix is:

 zint -b 71 --square --scale 10 --eci 28 --binary -i big5char.txt

@ -2062,8 +2056,8 @@ When using automatic symbol sizes you can force Zint to use square symbols
 (versions 1-24) at the command line by using the option --square and when
 using the API by setting the value option_3 = DM_SQUARE.

-Data Matrix Rectangular Extension (ISO/IEC21471) codes may be generated with the
-following values as before:
+Data Matrix Rectangular Extension (ISO/IEC 21471) codes may be generated with
+the following values as before:

 ---------------------
 Input  |  Symbol Size
@ -2162,10 +2156,10 @@ Input  |  Symbol Size
 The maximum capacity of a (version 40) QR Code symbol is 7089 numeric digits,
 4296 alphanumeric characters or 2953 bytes of data. QR Code symbols can also be
 used to encode GS1 data. QR Code symbols can by default encode characters in
-the Latin-1 set and Kanji characters which are members of the Shift-JIS
+the Latin-1 set and Kanji characters which are members of the Shift JIS
 encoding scheme. In addition QR Code supports using other character sets using
 the ECI mechanism. Input should usually be entered as Unicode (UTF-8) with
-conversion to Shift-JIS being carried out by Zint. A separate symbology ID can
+conversion to Shift JIS being carried out by Zint. A separate symbology ID can
 be used to encode Health Industry Barcode (HIBC) data which adds a leading '+'
 character and a modulo-49 check digit to the encoded data.

@ -2183,8 +2177,8 @@ ZINT_FULL_MULTIBYTE | (N + 1) << 8.
 -------------------------------
 A miniature version of the QR Code symbol for short messages. ECC levels can be
 selected as for QR Code (above). QR Code symbols can encode characters in the
-Latin-1 set and Kanji characters which are members of the Shift-JIS encoding
-scheme. Input should be entered as a UTF-8 stream with conversion to Shift-JIS
+Latin-1 set and Kanji characters which are members of the Shift JIS encoding
+scheme. Input should be entered as a UTF-8 stream with conversion to Shift JIS
 being carried out automatically by Zint. A preferred symbol size can be
 selected by using the --vers= option or by setting option_2 although the actual
 version used by Zint may be different if required by the input data. The table
@ -2211,11 +2205,12 @@ ZINT_FULL_MULTIBYTE | (N + 1) << 8.
 6.6.4 Rectangular Micro QR Code (rMQR)
 --------------------------------------
 A rectangular version of QR Code. Like QR code rMQR supports encoding of GS1
-data, Latin-1 and Kanji characters in the Shift-JIS encoding scheme.
-It does not support other ISO 8859 character sets or Unicode. As with other
-symbologies data should be entered as UTF-8 with the conversion to Shift-JIS
-being handled by Zint. The amount of ECC codewords can be adjusted using
--secure=, however only ECC levels M and H are valid for this type of symbol.
+data, Latin-1 and Kanji characters in the Shift JIS encoding scheme. It does not
+support other ISO/IEC 8859 character sets or encodings. As with other
+symbologies data should be entered as UTF-8 with the conversion to Shift JIS
+being handled by Zint. The amount of ECC codewords can be adjusted using the
+--secure= option (API option_1), however only ECC levels M and H are valid for
+this type of symbol.

 -------------------------------------------------------------------------
 Input  |  ECC Level    |  Error Correction Capacity  |  Recovery Capacity
@ -2224,9 +2219,9 @@ Input  |  ECC Level    |  Error Correction Capacity  |  Recovery Capacity
 4      |  H            |  Approx 65% of symbol       |  Approx 30%
 -------------------------------------------------------------------------

-The preferred symbol sizes can be selected using the --vers= option as shown
-in the table below. Input values between 33 and 38 fix the height of the
-symbol while allowing Zint to determine the minimum symbol width.
+The preferred symbol sizes can be selected using the --vers= option (API
+option_2) as shown in the table below. Input values between 33 and 38 fix the
+height of the symbol while allowing Zint to determine the minimum symbol width.

 ---------------------------------
 Input  |  Version  |  Symbol Size
@ -2279,12 +2274,13 @@ using the --fullmultibyte switch or by setting option_3 to ZINT_FULL_MULTIBYTE.
 ------------------------------------------------
 A variation of QR Code used by Združenje Bank Slovenije (Bank Association of
 Slovenia). The size, error correction level and ECI are set by Zint and do not
-need to be specified. UPNQR is unusual in that it uses ISO-8859-2 formatted
-data. Zint will accept UTF-8 data and convert it to ISO-8859-2, or if your data
-is already ISO-8859-2 formatted use the --binary switch or if using the API set
-symbol->input_mode = DATA MODE;
+need to be specified. UPNQR is unusual in that it uses ISO/IEC 8859-2 formatted
+data. Zint will accept UTF-8 data and convert it to ISO/IEC 8859-2, or if your
+data is already ISO/IEC 8859-2 formatted use the --binary switch or if using the
+API set symbol->input_mode = DATA MODE;

-The following example creates a symbol from data saved as an ISO-8859-2 file:
+The following example creates a symbol from data saved as an ISO/IEC 8859-2
+file:

 zint -o upnqr.png -b 143 --border=5 --scale=3 --binary -i ./upn.txt

@ -2719,7 +2715,7 @@ are ignored.
 ================================
 7.1 License
 -----------
-Zint, libzint and Zint Barcode Studio are Copyright © 2020 Robin Stuart. All
+Zint, libzint and Zint Barcode Studio are Copyright © 2021 Robin Stuart. All
 historical versions are distributed under the GNU General Public License
 version 3 or later. Version 2.5 is released under a dual license: the encoding
 library is released under the BSD license whereas the GUI, Zint Barcode Studio,
@ -3085,11 +3081,11 @@ E   |  SO    |  RS   |  .      |  >  |  N  |  ^  |  n  |  ~
 F   |  SI    |  US   |  /      |  ?  |  O  |  _  |  o  |  DEL
 -------------------------------------------------------------

-A.2 Latin Alphabet No 1 (ISO 8859-1)
------------------------------------
+A.2 Latin Alphabet No 1 (ISO/IEC 8859-1)
+----------------------------------------
 A common extension to the ASCII standard, Latin-1 is used to expand the range
 of Code 128, PDF417 and other symbols. Input strings should be in Unicode
-format
+(UTF-8) format

 ------------------------------------------------------
 Hex |  8  |  9  |  A      |  B  |  C  |  D  |  E  |  F
@ -3109,6 +3105,6 @@ B   |     |     |  «      |  »  |  Ë  |  Û  |  ë  |  û
 C   |     |     |  ¬      |  ¼  |  Ì  |  Ü  |  ì  |  ü
 D   |     |     |  SHY    |  ½  |  Í  |  Ý  |  í  |  ý
 E   |     |     |  ®      |  ¾  |  Î  |  Þ  |  î  |  þ
-F   |     |     |  ¯      |  ¿  |  Ï  |  ß  |  î  |  ÿ
+F   |     |     |  ¯      |  ¿  |  Ï  |  ß  |  ï  |  ÿ
 ------------------------------------------------------