Removing characters with sed
Clash Royale CLAN TAG#URR8PPP
Removing characters with sed
I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂ
in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ instead of the special characters.
Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂ
I want to replace all those special characters with space.
I tried sed 's/[^[:print:]]/ /g' file
but it does not remove those characters.My locale are listed below when I run locale -a
's/[^[:print:]]/ /g' file
locale -a
C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US
I even tried sed -e 's/[^ -~]/ /g' file
and it did not remove the characters.
sed -e 's/[^ -~]/ /g' file
I see that others stackflow answers used UTF-8
locale with GNU sed and this worked but I do not have that locale.
UTF-8
Also I am using ksh
.
ksh
Ã
▒
Ã
Ã
Possible dublicate unix.stackexchange.com/questions/201751/…
– Goro
4 hours ago
@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
– Auguster
4 hours ago
To actually show what the characeters are it is useful to show their hex values. Something like:
echo "fiancÃÂÃÂÃÂÃÂÃÂ" | od -tx1
, or, maybe if your sed supports it: echo "fiancÃÂÃÂÃÂÃÂÃÂ" | sed -n l
.– Isaac
3 hours ago
echo "fiancÃÂÃÂÃÂÃÂÃÂ" | od -tx1
echo "fiancÃÂÃÂÃÂÃÂÃÂ" | sed -n l
Possible duplicate of Match language range in shell, sed or awk
– Isaac
3 hours ago
2 Answers
2
You can use the command tr
as follows:
tr
tr -cd '[:print:]trn'
Explanation:
`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab
Examples based on Centos 7:
tris GNU and UTF-8 encoding
based on Centos 7:
is GNU and UTF-8 encoding
$ echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc
$ echo "get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ " | tr -cd '[:print:]trn'
get ^^^^^^
echo " Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
That did not work for me I tried echo
" Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -d '[:print:]'
and got output as some unreadable text– Auguster
5 hours ago
" Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -d '[:print:]'
LC_ALL=C tr ...
– Jeff Schaller
5 hours ago
LC_ALL=C tr ...
LC_ALL=C tr -cd '[:print:]' < input
works here– Jeff Schaller
5 hours ago
LC_ALL=C tr -cd '[:print:]' < input
echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
should return fiancÃÂÃÂÃÂÃÂÃÂ
as Â
is a printable character. GNU tr
doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove Â
(or whatever bytes those are made of) as ASCII has no such character in the first place.– Stéphane Chazelas
2 hours ago
echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fiancÃÂÃÂÃÂÃÂÃÂ
Â
tr
Â
Because CentOS
tr
is GNU tr
and you probably tried it in a UTF-8 locale where Ã
is made of 2 bytes and GNU tr
doesn't support multibyte characters. If you use LC_ALL=C
as suggested by Auguster, it will work (at removing those Ã
however they're encoded) regardless of whether tr
supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)– Stéphane Chazelas
2 hours ago
tr
tr
Ã
tr
LC_ALL=C
Ã
tr
If the current locale already uses UTF-8 as the charset (and file is written using that charset):
<file LC_ALL=C sed 's/[^ -~]//g'
Or, to include control characters in AIX sed:
<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Ã
and▒
look pretty printable to me. A UTF-8Ã
is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is alsoÃ
as it happens which is printable, 0x83 would be a control character in both though– Stéphane Chazelas
5 hours ago