how to count n grams from a column

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



how to count n grams from a column



So I am using N-grams for the first time. What I have done is I took a df with multiple rows and columns. I removed the stop words and tokenized them.
My Code is this


from nltk.corpus import stopwords
stop = stopwords.words('english')

# Exclude stopwords with Python's list comprehension and pandas.DataFrame

testdf['issues_without_stopwords'] = testdf['issue'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) if x[0]]))
testdf['questions_without_stopwords'] = testdf['question'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))



# Remove Punctuations and Tokenize
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'w+')
testdf['questions_tokenized'] = testdf['question'].apply(lambda x: tokenizer.tokenize(x))
testdf['issue_tokenized'] = testdf['issue'].apply(lambda x: tokenizer.tokenize(x))
testdf["Concate"] = testdf['issue_tokenized']+ testdf['questions_tokenized']


#Create your n-grams (1st method)

def find_ngrams(input_list, n):
return list(zip(*[input_list[i:] for i in range(n)]))


df1 = testdf["Concate"].apply(lambda x: find_ngrams(x, 4))

from itertools import tee, islice
from collections import Counter

#Create your n-grams and count them in cell (2nd method)
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break

df2 = Counter(ngrams(df2["value"], 4))



I was then able to convert them into 4-gram.



This is my raw sample data:


issue question
0 Menstrual health How to get my period back
1 stomach pain any advise
2 Vaping I am having a tonsillectomy tomorrow
3 Mental health Ive been feeling sad most of the time
4 Kidney stone I was diagnosed with one Saturday at Er



What I want is a column with all the n grams and another column with its freq. something like this:


N - grams Freq

[(n, gram, talha)] 2

[(talha, software, python)] 1



I also need to remove all the duplicate n grams, for example [(n, gram, talha)] and [(talha, gram, n)] should be counted as 2 but shown once (I just wanted to be clear I know I said freq before lol).



EDIT: To avoid confusion, this is what I get right now:


Concate
0 [('Menstrual', 'health', 'How', 'to'), ('health', 'How', 'to', 'get'), ('How', 'to', 'get', 'my')]
1 [('stomach', 'pain', 'any', 'advise')]
2 [('Vaping', 'with', 'nicotine', 'before'), ('with', 'nicotine', 'before', 'tonsillectomy')]
3 [('Mental', 'health', 'Ive', 'been'), ('health', 'Ive', 'been', 'feeling'), ('Ive', 'been', 'feeling', 'sad'), ('been', 'feeling', 'sad', 'most'), ('feeling', 'sad', 'most', 'of'), ('sad', 'most', 'of', 'the'), ('most', 'of', 'the', 'time'), ('of', 'the', 'time', 'and')]
4 [('Kidney', 'stone', 'I', 'was'), ('stone', 'I', 'was', 'diagnosed'), ('I', 'was', 'diagnosed', 'with'), ('was', 'diagnosed', 'with', 'one')]





I just did, hope this helps
– Talha Qadeer
Aug 8 at 19:09





Is that code complete? I don't see testdf being defined anywhere.
– Erik
Aug 8 at 19:16


testdf





Are you sure you want to consider [(n, gram, talha)] and [(talha, gram, n)] as equal? N-grams are usually defined as sequences of words, so order is significant.
– Erik
Aug 8 at 19:27



[(n, gram, talha)]


[(talha, gram, n)]





In your output example, shouldn't [talha, software, python] be [(talha, software, python)]?
– Erik
Aug 8 at 19:28


[talha, software, python]


[(talha, software, python)]





testdf was just me loading head of 5 from my data nothing else. Yes I want to see the 2 as equal because the order isn't very significant right now.
– Talha Qadeer
Aug 8 at 19:31









By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard