How to handle weirdly combined websocket messages?

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



How to handle weirdly combined websocket messages?



I'm connecting to an external websocket api using the node ws library (node 10.8.0 on Ubuntu 16.04). I've got a listener which simply parses the json and passes it to the callback:


this.ws.on('message', (rawdata) =>
let data = null;
try
data = JSON.parse(rawdata);
catch (e)
console.log('Failed parsing the following string as json: ' + rawdata);
return;

mycallback(data);
);



I now receive errors in which the rawData looks as follows (I formatted and removed irrelevant contents):


rawData


�~A

"id": 1,
etc..
�~�
{
"id": 2,
etc..



I then wondered; what are these characters? Seeing the structure I initially thought that the first weird sign must be an opening bracket of an array ([) and the second one a comma (,) so that it creates an array of objects.


[


,



I then investigated the problem further by writing the rawdata to a file whenever it encounters a JSON parsing error. In an hour or so it has saved about 1500 of these error files, meaning this happens a lot. I cated a couple of these files in the terminal, of which I uploaded an example below:


rawdata


cat



enter image description here



A few things are interesting here:



I'm not very experience with websockets, but could it be that my websocket somehow receives a stream of messages that it concatenates together, with these weird signs as separators, and then randomly cuts off the last message? Maybe because I'm getting a constant very fast stream of messages?



Or could it be because of an error (or functionality) server side in that it combines those individual messages?



Does anybody know what's going on here? All tips are welcome!



[EDIT]



@bendataclear suggested to interpret it as utf8. So I did, and I pasted a screenshot of the results below. The first print is as it is, and the second one interpreted as utf8. To me this doesn't look like anything. I could of course convert to utf8, and then split by those characters. Although the last message is always cut off, this would at least make some of the messages readble. Other ideas still welcome though.



enter image description here





Since JSON is supposed to be encoded with one of the UTFs (8, 16, or 32), it's probably a good idea to decode the input properly. However, the characters expected at these positions all belong to the ASCII subset of UTF-8, so I doubt decoding would help you with this particular problem.
– lenz
Aug 6 at 10:42





The �~ character is "Replacement character" so if you're seeing this it's already too late to fix it. Can you try and convert to utf8 with a utf8 module (npm install utf8) then convert the string (utf8.encode(string))?
– bendataclear
Aug 8 at 13:45


npm install utf8


utf8.encode(string)





@bendataclear - I tried and added the results to the question above. Does this give you any hints?
– kramer65
Aug 8 at 14:32





@kramer65 - It looks like this is coming through as some other encoding (binary?), are you using the standard node websocket client (require('websocket').client)?
– bendataclear
Aug 8 at 14:41


require('websocket').client





@bendataclear - No I'm using the ws library: github.com/websockets/ws . I was also thinking binary, but why? And what to do with it? I tried a split after a utf8 conversion based on that weird string, but to my surprise that doesn't seem to work. Any other ideas?
– kramer65
Aug 8 at 15:25




6 Answers
6



As I read your question many times, and test and emulated your problem, I suggest you first make a docker. Because with a docker image both of us, you and I can have a consistent area that when Websocket send a message, both you and I get a consistent rawdata.


docker


docker


Websocket


rawdata



Here, on my computer, I tried many times to make my system to act like yours but it fails in every try.



Actually, when you access rawdata in the this.ws.on callback, it's so late to fix anything. you should config your server to pass correct encoded characters. each acting to rawdata cause to losing data for your application. In fact, your server should pass correct data, and it is so weird to me, node ws library by default use utf8 characters. Maybe, your characters that were built in your server are in another text encoding But seriously I suggest you read this question and this medium article. These links can help you to write your server with a config that passes utf8 text encoding string data.


rawdata


this.ws.on


rawdata


node ws library


utf8


utf8


string



You should settle an if condition to just pass utf8 string data.


if


utf8



Hope my answer helps you.





Simultaneously to the issue was also a 100% spike in CPU usage. In the end I fixed the code that spiked the CPU, and every since, the problem didn't occur anymore. So I guess it had something to do with the CPU spiking, making ws combine and malform the messages. Thanks for your suggestions though!
– kramer65
Aug 19 at 10:54





Which CPU? your client or server?
– AmerllicA
Aug 19 at 12:02





The CPU of the Client. I'm not in control of the server. For that reason I can't really set up a testing Docker image either.
– kramer65
Aug 19 at 12:07





@kramer65, Ok, so I don't deserve to the bounty reputation. My answer wasn't right, How can I give it back to you?
– AmerllicA
Aug 19 at 12:18



My assumption is that you're working only with English/ASCII characters and something probably messed the stream. (NOTE:I am assuming), there are no special characters, if it's so, then I will suggest you pass the entire json string into this function:




function cleanString(input)
var output = "";
for (var i=0; i<input.length; i++)
if (input.charCodeAt(i) <= 127)
output += input.charAt(i);


console.log(output);


//example
cleanString("�~�")



You can make reference to How to remove invalid UTF-8 characters from a JavaScript string?



EDIT



From an article by Internet Engineering Task Force (IETF),



A common class of security problems arises when sending text data
using the wrong encoding. This protocol specifies that messages with
a Text data type (as opposed to Binary or other types) contain UTF-8-
encoded data. Although the length is still indicated and
applications implementing this protocol should use the length to
determine where the frame actually ends, sending data in an improper



The "Payload data" is text data encoded as UTF-8. Note that a particular text frame might include a partial UTF-8 sequence; however, the whole message MUST contain valid UTF-8. Invalid UTF-8 in reassembled messages is handled as described in Handling Errors in UTF-8-Encoded Data, which states that When an endpoint is to interpret a byte stream as UTF-8 but finds that the byte stream is not, in fact, a valid UTF-8 stream, that endpoint MUST _Fail the WebSocket Connection_. This rule applies both during the opening handshake and during subsequent data exchange.



I really believe that you error (or functionality) is coming from the server side which combines your individual messages, so I will suggest come up with a logic of ensuring that all your characters MUST be converted from Unicode to ASCII by first encoding the characters as UTF-8. And you might also want to install npm install --save-optional utf-8-validate to efficiently check if a message contains valid UTF-8 as required by the spec.


npm install --save-optional utf-8-validate



You might also want to pass in an if condition to help you do some checks;


if


this.ws.on('message', (rawdata) => {
if (message.type === 'utf8') // accept only text



I hope this gets to help.





removing some random bytes won't make string a valid json
– Reith
Aug 14 at 20:49





Definitely, The answers about changing JSON, is not correct, because remove or adding some characters in callback function is too late. you should fix it in server side.
– AmerllicA
Aug 15 at 11:51


JSON





The answer has been edited
– antzshrek
Aug 16 at 13:20



Basing on what is visible in your screenshots, you are getting data from the BitMex Websocket API. On this assumption, I've tried to reproduce the problem by subscribing to some topics that don't require authentication (not the "position" topic that you're probably using) without success, therefore I don't have an explanation for the described behavior, even if I suspect that it can be a bug of the websocket library, that effectively builds the data message by joining an array of "fragments". It would be interesting to see the ASCII codes of the "strange" characters, rather than the "replacement characters" shown by the terminal console, so maybe running an "xxd" command on the output file and showing us the relevant portion could be useful. That said, my first suggestion is to try to disable the "perMessageDeflate" option of the websocket library (which is enabled by default), in the following way:


const ws = new WebSocket('wss://www.bitmex.com/realtime?subscribe=orderBookL2',
perMessageDeflate: false
);



My second advice is to consider using a different websocket library; for example, consider using the official Bitmex Websocket Adapters.





This is the only answer that actually relates to what OP asks and suggests something (using Bitmex websocket adapter) that possibly could be useful for OP. For reproducing, could it be that OP might consuming multiplexed stream without proper decoding?
– Reith
Aug 15 at 19:54




It seems your output is having some spaces, If you have any spaces or if you find any special characters please use Unicode to full fill them.



Here is the list of Unicode characters



This might help I think.



Those characters are known as "REPLACEMENT CHARACTER" - used to replace an unknown, unrecognized or unrepresentable character.



From: https://en.wikipedia.org/wiki/Specials_(Unicode_block)



The replacement character � (often a black diamond with a white question mark or an empty square box) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol. It is usually seen when the data is invalid and does not match any character



Checking the section 8 of the WebSocket protocol Error Handling:



8.1. Handling Errors in UTF-8 from the Server



When a client is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, then any bytes or sequences of bytes that are not valid UTF-8 sequences MUST be interpreted as a U+FFFD REPLACEMENT CHARACTER.



8.2. Handling Errors in UTF-8 from the Client



When a server is to interpret a byte stream as UTF-8 but finds that the byte stream is not in fact a valid UTF-8 stream, behavior is undefined. A server could close the connection, convert invalid byte sequences to U+FFFD REPLACEMENT CHARACTERs, store the data verbatim, or perform application-specific processing. Subprotocols layered on the WebSocket protocol might define specific behavior for servers.



Depends on the implementation or library in use how to deal with this, for example from this post Implementing Web Socket servers with Node.js:


socket.ondata = function(d, start, end)
//var data = d.toString('utf8', start, end);
var original_data = d.toString('utf8', start, end);
var data = original_data.split('ufffd')[0].slice(1);
if (data == "kill")
socket.end();
else
sys.puts(data);
socket.write("u0000", "binary");
socket.write(data, "utf8");
socket.write("uffff", "binary");

;



In this case, if a is found it will do:



var data = original_data.split('ufffd')[0].slice(1);
if (data == "kill")
socket.end();



Another thing that you could do is to update node to the latest stable, from this post OpenSSL and Breaking UTF-8 Change (fixed in Node v0.8.27 and v0.10.29):



As of these releases, if you try and pass a string with an unmatched surrogate pair, Node will replace that character with the unknown unicode character (U+FFFD). To preserve the old behavior set the environment variable NODE_INVALID_UTF8 to anything (even nothing). If the environment variable is present at all it will revert to the old behavior.



The problem which you have is that one side sends a JSON in different encoding as the other side it intepretes.



Try to solve this problem with following code:


const StringDecoder = require('string_decoder');

this.ws.on('message', (rawdata) =>
const decoder = new StringDecoder('utf8');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
);



Or with utf16:


utf16


const StringDecoder = require('string_decoder');

this.ws.on('message', (rawdata) =>
const decoder = new StringDecoder('utf16');
const buffer = new Buffer(rawdata);
console.log(decoder.write(buffer));
);



Please read: String Decoder Documentation






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard