How to tokenize html tags with spacy?
Clash Royale CLAN TAG#URR8PPP
How to tokenize html tags with spacy?
I need to tokenize html text with spacy. Or merge tags after tokenization. They can be any html tags, e.g.:
<br> <br/> <br > <n class="ggg">
There is an example of tag merging in documentation for
tag, but it can't work with all types of tags. If I write rule like:
['ORTH': '<', , 'ORTH': '>']
It will join some tags:
<br><p>
Or separate like:
<
n
class="ggg
"
>
I have tried to write custom tokenizer also, but I had problem with spaces.
I want every html tag to be a separate token, e.g.:
<br>
<br >
<n class="ggg">
I want to parse an html without breaking its structure. Then mark html tags as stopwords and work with text only. Also, if I delete html tags with html parser like BeautifulSoup I had other problems, like joined words: <span>word</span><span>word</span>. So I need a tokenizer
– Роман Коптев
Aug 12 at 14:33
1 Answer
1
IMHO, removing the HTML tags and converting to plain text is the correct way to go, rather than making html tags 'stop words', because some of those tags are actually valid words that can appear in text and should NOT be ignored (e.g., <body>
vs body
).
<body>
body
If you have a construct like
<span>word</span><span>word</span>
It renders as wordword
in a user agent and should in fact be interpreted as a single word. For example, one might give you an HTML page containing something like:
wordword
<p><strong>S</strong>oup .... </p>
This obviously renders as 'Soup' and should be taken as the word soup
and not as the words s
and oup
.
soup
s
oup
Now, if for whatever reason you must assume that any HTML tag boundary is a word separator (wrong, in most cases), you should do the following: use an HTML stream tokenizer, e.g., libxml2 and write handlers for startElement
and characters
only. The former should output a single space and the latter should output the characters as it gets them. This will convert your HTML input to plain text (just like an HTML tag remover would do), but also add a space after each element tag, so <span>word</span><span>word</span>
would get converted to: "(space)word(space)word". This might add multiple spaces when nested tags are present, but you can easily deal with this when you split the cleaned-up text into words for further processing.
startElement
characters
<span>word</span><span>word</span>
That is specific of my texts. They have not
<p><strong>S</strong>oup .... </p>
, but have many non white space separated tags. If I brake the document structure it will be very difficult to render results back as html. In spacy documentation is written, that it is nonbreaking processing that can save html tags.– Роман Коптев
Aug 12 at 15:08
<p><strong>S</strong>oup .... </p>
I am still somewhat at a loss as to what you're trying to do. Do you need to just look through the text (and extract tokens, such as words, from it) or you need to modify it, e.g., change some of the text, but leave the HTML tags intact? If it is the former, 'breaking' the document isn't a problem (you convert a copy of it, not the original). If you need the latter, you should still use a proper HTML parser (like
libxml2
), but make it output tags untouched and feed each text chunk separately through your filter before it is output.– Leo K
Aug 12 at 15:16
libxml2
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Please try to clarify what is it that you are trying to achieve. "tokenize" is too generic, how you tokenize depends on what is a "token" in the context of your specific problem. BTW, note that 'spacy' is meant for natural language processing and is likely not a good match for parsing a structured formal language like HTML. You may want to look at an actual HTML document parser instead.
– Leo K
Aug 12 at 14:29