Hi,
thank you for the handy utility. Unfortunately, the default expectation the input is utf-8 does not work with the file I received with assumingly mixed encodings. I have two suggestions for improvements:
- Please update the
--encoding help text to explain it is about INPUT ENCODING:
$ unidecode --help
usage: unidecode [-h] [-e ENCODING] [-c TEXT] [FILE]
Transliterate Unicode text into ASCII. FILE is path to file to transliterate. Standard input is used if FILE is omitted and -c is not specified.
positional arguments:
FILE
options:
-h, --help show this help message and exit
-e, --encoding ENCODING
Specify an encoding (default is utf-8)
-c TEXT Transliterate TEXT instead of FILE
$
- Report the problematic characters:
$ unidecode -e utf-16 spikenuc1207.fasta > spikenuc1207.unidecode.fasta
Unable to decode input line 0: truncated data, start: 150, end: 151
$
There are some Polish and French characters in there, this at least did not die:
$ unidecode -e iso-8859-2 spikenuc1207.fasta > spikenuc1207.unidecode.fasta
Still some other junk is left in, like \u0003, \u00ed, \u00e9.
Hi,
thank you for the handy utility. Unfortunately, the default expectation the input is utf-8 does not work with the file I received with assumingly mixed encodings. I have two suggestions for improvements:
--encodinghelp text to explain it is about INPUT ENCODING:There are some Polish and French characters in there, this at least did not die:
Still some other junk is left in, like
\u0003,\u00ed,\u00e9.