Skip to content

unidecode executable should report the unexpected characters on STDERR #112

@mmokrejs

Description

@mmokrejs

Hi,
thank you for the handy utility. Unfortunately, the default expectation the input is utf-8 does not work with the file I received with assumingly mixed encodings. I have two suggestions for improvements:

  1. Please update the --encoding help text to explain it is about INPUT ENCODING:
$ unidecode  --help
usage: unidecode [-h] [-e ENCODING] [-c TEXT] [FILE]

Transliterate Unicode text into ASCII. FILE is path to file to transliterate. Standard input is used if FILE is omitted and -c is not specified.

positional arguments:
  FILE

options:
  -h, --help            show this help message and exit
  -e, --encoding ENCODING
                        Specify an encoding (default is utf-8)
  -c TEXT               Transliterate TEXT instead of FILE
$
  1. Report the problematic characters:
$ unidecode -e utf-16 spikenuc1207.fasta > spikenuc1207.unidecode.fasta   
Unable to decode input line 0: truncated data, start: 150, end: 151
$

There are some Polish and French characters in there, this at least did not die:

$ unidecode -e iso-8859-2 spikenuc1207.fasta > spikenuc1207.unidecode.fasta

Still some other junk is left in, like \u0003, \u00ed, \u00e9.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions