Command-line Unicode character info tool
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
pantonshire c274ba6f01 🐛 only show consumed bad bytes for invalid characters
Previously, the bytes displayed for invalid characters included bytes
from the byte stream that were peeked rather than consumed. This
resulted in certain bytes being displayed multiple times, since the
peeked byte could appear in the following character.

For example, `printf '\xce\x61' | utfdump_bin` would result in the byte
0xce being displayed twice, once at the end of the invalid character and
once as the valid character `a`.

This patch modifies `utfdump::utf8::Utf8Error` so it also stores the
number of consumed bad bytes, enabling the binary to output only the
consumed bad bytes.
3 years ago
bin 🐛 only show consumed bad bytes for invalid characters 3 years ago
lib 🐛 only show consumed bad bytes for invalid characters 3 years ago
.gitignore 🎉 start new encoder for unicode data 3 years ago
Cargo.lock update Cargo.lock 3 years ago
Cargo.toml remove core 3 years ago
LICENSE add LICENSE file 3 years ago
README.md update README to include link to releases page 3 years ago
data.py retrieve data from unicode.org in data.py 3 years ago
unicode_data_encoded.gz retrieve data from unicode.org in data.py 3 years ago

README.md

utfdump(1)

Display information about UTF-8 characters read from stdin.

Examples

$ printf '¿Cómo estás?' | utfdump
┌───┬────────┬───────────┬────────────────────────┬──────────┬───────────┐
│   │ Code   │ UTF-8     │ Name                   │ Category │ Combining │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ ¿ │ U+00bf │ 0xc2 0xbf │ INVERTED QUESTION MARK │ Po       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ C │ U+0043 │ 0x43      │ LATIN CAPITAL LETTER C │ Lu       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ o │ U+006f │ 0x6f      │ LATIN SMALL LETTER O   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ ◌́ │ U+0301 │ 0xcc 0x81 │ COMBINING ACUTE ACCENT │ Mn       │ 230       │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ m │ U+006d │ 0x6d      │ LATIN SMALL LETTER M   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ o │ U+006f │ 0x6f      │ LATIN SMALL LETTER O   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│   │ U+0020 │ 0x20      │ SPACE                  │ Zs       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ e │ U+0065 │ 0x65      │ LATIN SMALL LETTER E   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ s │ U+0073 │ 0x73      │ LATIN SMALL LETTER S   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ t │ U+0074 │ 0x74      │ LATIN SMALL LETTER T   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ a │ U+0061 │ 0x61      │ LATIN SMALL LETTER A   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ ◌́ │ U+0301 │ 0xcc 0x81 │ COMBINING ACUTE ACCENT │ Mn       │ 230       │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ s │ U+0073 │ 0x73      │ LATIN SMALL LETTER S   │ Ll       │ 0         │
├───┼────────┼───────────┼────────────────────────┼──────────┼───────────┤
│ ? │ U+003f │ 0x3f      │ QUESTION MARK          │ Po       │ 0         │
└───┴────────┴───────────┴────────────────────────┴──────────┴───────────┘

Usage

utfdump receives its input string from stdin and writes its outputs to stdout. The input string is assumed to be UTF-8 encoded.

Arguments:

Short Long Effect
-f --full-category-names Display category names in plain English, rather than using their abbreviated names

Download

Pre-built binaries are available in the GitHub releases.