how to count n grams from a column
Clash Royale CLAN TAG#URR8PPP
how to count n grams from a column
So I am using N-grams for the first time. What I have done is I took a df with multiple rows and columns. I removed the stop words and tokenized them.
My Code is this
from nltk.corpus import stopwords
stop = stopwords.words('english')
# Exclude stopwords with Python's list comprehension and pandas.DataFrame
testdf['issues_without_stopwords'] = testdf['issue'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop) if x[0]]))
testdf['questions_without_stopwords'] = testdf['question'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
# Remove Punctuations and Tokenize
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'w+')
testdf['questions_tokenized'] = testdf['question'].apply(lambda x: tokenizer.tokenize(x))
testdf['issue_tokenized'] = testdf['issue'].apply(lambda x: tokenizer.tokenize(x))
testdf["Concate"] = testdf['issue_tokenized']+ testdf['questions_tokenized']
#Create your n-grams (1st method)
def find_ngrams(input_list, n):
return list(zip(*[input_list[i:] for i in range(n)]))
df1 = testdf["Concate"].apply(lambda x: find_ngrams(x, 4))
from itertools import tee, islice
from collections import Counter
#Create your n-grams and count them in cell (2nd method)
def ngrams(lst, n):
tlst = lst
while True:
a, b = tee(tlst)
l = tuple(islice(a, n))
if len(l) == n:
yield l
next(b)
tlst = b
else:
break
df2 = Counter(ngrams(df2["value"], 4))
I was then able to convert them into 4-gram.
This is my raw sample data:
issue question
0 Menstrual health How to get my period back
1 stomach pain any advise
2 Vaping I am having a tonsillectomy tomorrow
3 Mental health Ive been feeling sad most of the time
4 Kidney stone I was diagnosed with one Saturday at Er
What I want is a column with all the n grams and another column with its freq. something like this:
N - grams Freq
[(n, gram, talha)] 2
[(talha, software, python)] 1
I also need to remove all the duplicate n grams, for example [(n, gram, talha)] and [(talha, gram, n)] should be counted as 2 but shown once (I just wanted to be clear I know I said freq before lol).
EDIT: To avoid confusion, this is what I get right now:
Concate
0 [('Menstrual', 'health', 'How', 'to'), ('health', 'How', 'to', 'get'), ('How', 'to', 'get', 'my')]
1 [('stomach', 'pain', 'any', 'advise')]
2 [('Vaping', 'with', 'nicotine', 'before'), ('with', 'nicotine', 'before', 'tonsillectomy')]
3 [('Mental', 'health', 'Ive', 'been'), ('health', 'Ive', 'been', 'feeling'), ('Ive', 'been', 'feeling', 'sad'), ('been', 'feeling', 'sad', 'most'), ('feeling', 'sad', 'most', 'of'), ('sad', 'most', 'of', 'the'), ('most', 'of', 'the', 'time'), ('of', 'the', 'time', 'and')]
4 [('Kidney', 'stone', 'I', 'was'), ('stone', 'I', 'was', 'diagnosed'), ('I', 'was', 'diagnosed', 'with'), ('was', 'diagnosed', 'with', 'one')]
Is that code complete? I don't see
testdf
being defined anywhere.– Erik
Aug 8 at 19:16
testdf
Are you sure you want to consider
[(n, gram, talha)]
and [(talha, gram, n)]
as equal? N-grams are usually defined as sequences of words, so order is significant.– Erik
Aug 8 at 19:27
[(n, gram, talha)]
[(talha, gram, n)]
In your output example, shouldn't
[talha, software, python]
be [(talha, software, python)]
?– Erik
Aug 8 at 19:28
[talha, software, python]
[(talha, software, python)]
testdf was just me loading head of 5 from my data nothing else. Yes I want to see the 2 as equal because the order isn't very significant right now.
– Talha Qadeer
Aug 8 at 19:31
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I just did, hope this helps
– Talha Qadeer
Aug 8 at 19:09