Removing characters with sed

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Removing characters with sed



I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂ in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ instead of the special characters.


Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂ



I want to replace all those special characters with space.



I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a


's/[^[:print:]]/ /g' file


locale -a


C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US



I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.


sed -e 's/[^ -~]/ /g' file



I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.


UTF-8



Also I am using ksh.


ksh





à and look pretty printable to me. A UTF-8 à is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also à as it happens which is printable, 0x83 would be a control character in both though
– Stéphane Chazelas
5 hours ago


Ã



Ã


Ã





Possible dublicate unix.stackexchange.com/questions/201751/…
– Goro
4 hours ago






@Goro Yes at this point its is possibly a duplicate now that I understand to use C locale
– Auguster
4 hours ago





To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃÂÃÂÃÂÃÂÃÂ" | od -tx1, or, maybe if your sed supports it: echo "fiancÃÂÃÂÃÂÃÂÃÂ" | sed -n l.
– Isaac
3 hours ago


echo "fiancÃÂÃÂÃÂÃÂÃÂ" | od -tx1


echo "fiancÃÂÃÂÃÂÃÂÃÂ" | sed -n l





Possible duplicate of Match language range in shell, sed or awk
– Isaac
3 hours ago




2 Answers
2



You can use the command tr as follows:


tr


tr -cd '[:print:]trn'



Explanation:


`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
r -- return
t -- horizontal tab



Examples based on Centos 7:tris GNU and UTF-8 encoding


based on Centos 7:


is GNU and UTF-8 encoding


$ echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'
fianc

$ echo "get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ " | tr -cd '[:print:]trn'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -cd '[:print:]trn'
Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^





That did not work for me I tried echo " Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -d '[:print:]' and got output as some unreadable text
– Auguster
5 hours ago


" Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -d '[:print:]'





LC_ALL=C tr ...
– Jeff Schaller
5 hours ago


LC_ALL=C tr ...





LC_ALL=C tr -cd '[:print:]' < input works here
– Jeff Schaller
5 hours ago


LC_ALL=C tr -cd '[:print:]' < input





echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn' should return fiancÃÂÃÂÃÂÃÂàas  is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove  (or whatever bytes those are made of) as ASCII has no such character in the first place.
– Stéphane Chazelas
2 hours ago



echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]trn'


fiancÃÂÃÂÃÂÃÂÃÂ


Â


tr


Â





Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where à is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those à however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding)
– Stéphane Chazelas
2 hours ago



tr


tr


Ã


tr


LC_ALL=C


Ã


tr



If the current locale already uses UTF-8 as the charset (and file is written using that charset):


<file LC_ALL=C sed 's/[^ -~]//g'



Or, to include control characters in AIX sed:


<file LC_ALL=C sed "$(printf "s/[^[:print:]tr]//g")"






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

Creating a leaderboard in HTML/JS