how to count n grams from a column

So I am using N-grams for the first time. What I have done is I took a df with multiple rows and columns. I removed the stop words and tokenized them.
My Code is this

from nltk.corpus import stopwords stop = stopwords.words('english') # Exclude stopwords with Python's list comprehension and pandas.DataFrame testdf['issues_without_stopwords'] = testdf['issue'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) if x[0]])) testdf['questions_without_stopwords'] = testdf['question'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) # Remove Punctuations and Tokenize from nltk.tokenize import RegexpTokenizer tokenizer = RegexpTokenizer(r'w+') testdf['questions_tokenized'] = testdf['question'].apply(lambda x: tokenizer.tokenize(x)) testdf['issue_tokenized'] = testdf['issue'].apply(lambda x: tokenizer.tokenize(x)) testdf["Concate"] = testdf['issue_tokenized']+ testdf['questions_tokenized'] #Create your n-grams (1st method) def find_ngrams(input_list, n): return list(zip(*[input_list[i:] for i in range(n)])) df1 = testdf["Concate"].apply(lambda x: find_ngrams(x, 4)) from itertools import tee, islice from collections import Counter #Create your n-grams and count them in cell (2nd method) def ngrams(lst, n): tlst = lst while True: a, b = tee(tlst) l = tuple(islice(a, n)) if len(l) == n: yield l next(b) tlst = b else: break df2 = Counter(ngrams(df2["value"], 4))

I was then able to convert them into 4-gram.

This is my raw sample data:

issue question 0 Menstrual health How to get my period back 1 stomach pain any advise 2 Vaping I am having a tonsillectomy tomorrow 3 Mental health Ive been feeling sad most of the time 4 Kidney stone I was diagnosed with one Saturday at Er

What I want is a column with all the n grams and another column with its freq. something like this:

N - grams Freq [(n, gram, talha)] 2 [(talha, software, python)] 1

I also need to remove all the duplicate n grams, for example [(n, gram, talha)] and [(talha, gram, n)] should be counted as 2 but shown once (I just wanted to be clear I know I said freq before lol).

EDIT: To avoid confusion, this is what I get right now:

Concate 0 [('Menstrual', 'health', 'How', 'to'), ('health', 'How', 'to', 'get'), ('How', 'to', 'get', 'my')] 1 [('stomach', 'pain', 'any', 'advise')] 2 [('Vaping', 'with', 'nicotine', 'before'), ('with', 'nicotine', 'before', 'tonsillectomy')] 3 [('Mental', 'health', 'Ive', 'been'), ('health', 'Ive', 'been', 'feeling'), ('Ive', 'been', 'feeling', 'sad'), ('been', 'feeling', 'sad', 'most'), ('feeling', 'sad', 'most', 'of'), ('sad', 'most', 'of', 'the'), ('most', 'of', 'the', 'time'), ('of', 'the', 'time', 'and')] 4 [('Kidney', 'stone', 'I', 'was'), ('stone', 'I', 'was', 'diagnosed'), ('I', 'was', 'diagnosed', 'with'), ('was', 'diagnosed', 'with', 'one')]

I just did, hope this helps
– Talha Qadeer
Aug 8 at 19:09

Is that code complete? I don't see testdf being defined anywhere.
– Erik
Aug 8 at 19:16

testdf

Are you sure you want to consider [(n, gram, talha)] and [(talha, gram, n)] as equal? N-grams are usually defined as sequences of words, so order is significant.
– Erik
Aug 8 at 19:27

[(n, gram, talha)]

[(talha, gram, n)]

In your output example, shouldn't [talha, software, python] be [(talha, software, python)]?
– Erik
Aug 8 at 19:28

[talha, software, python]

[(talha, software, python)]

testdf was just me loading head of 5 from my data nothing else. Yes I want to see the 2 as equal because the order isn't very significant right now.
– Talha Qadeer
Aug 8 at 19:31

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

how to count n grams from a column

how to count n grams from a column

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

Store custom data using WC_Cart add_to_cart() method in Woocommerce 3