tm Bigrams workaround still producing unigrams

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



tm Bigrams workaround still producing unigrams



I am trying to use tm's DocumentTermMatrix function to produce a matrix with bigrams instead of unigrams. I have tried to use the examples outlined here and here in my function (here are three examples):


make_dtm = function(main_df, stem=F)
tokenize_ngrams = function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=tokenize_ngrams,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)


make_dtm = function(main_df, stem=F)
BigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)


make_dtm = function(main_df, stem=F)
BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)



Rather unfortunately, however, each of these three versions of the function produces the exact same output: a DTM with unigrams, rather than bigrams (image included for simplicity):



enter image description here



For your convenience, here is a subset of the data that I am working with:


x = data.frame("CaseName" = c("Attorney General's Reference (No.23 of 2011)", "Attorney General's Reference (No.31 of 2016)", "Joseph Hill & Co Solicitors, Re"),
"CaseID"= c("[2011]EWCACrim1496", "[2016]EWCACrim1386", "[2013]EWCACrim775"),
"CaseTranscriptText" = c("sanchez 2011 02187 6 appeal criminal division 8 2011 2011 ewca crim 14962011 wl 844075 wales wednesday 8 2011 attorney general reference 23 2011 36 criminal act 1988 representation qc general qc appeared behalf attorney general",
"attorney general reference 31 2016 201601021 2 appeal criminal division 20 2016 2016 ewca crim 13862016 wl 05335394 dbe honour qc sitting cacd wednesday 20 th 2016 reference attorney general 36 criminal act 1988 representation",
"matter wasted costs against company solicitors 201205544 5 appeal criminal division 21 2013 2013 ewca crim 7752013 wl 2110641 date 21 05 2013 appeal honour pawlak 20111354 hearing date 13 th 2013 representation toole respondent qc appellants"))




1 Answer
1



There are a few issues with your code. I'm just focusing on the last function you created as I don't use the tau or Rweka packages.



1 to use the tokenizer you need to specify tokenizer = ..., not tokenize = ...


tokenizer = ...


tokenize = ...



2 instead of Corpus you need VCorpus.


Corpus


VCorpus



3 after adjusting this in your function make_dtm, I was not happy with the results. Not everything specified in the control options is being processed correctly. I created a second function make_dtm_adjusted so you can see the differences between the 2.


make_dtm


make_dtm_adjusted


# OP's function adjusted to make it work
make_dtm = function(main_df, stem=F)
BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))
decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenizer=BigramTokenizer,
stopwords=T,
tolower=T,
removeNumbers=T,
removePunctuation=T,
stemming = stem))
return(decisions.dtm)


# improved function
make_dtm_adjusted = function(main_df, stem=F)
BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))

decisions <- tm_map(decisions, content_transformer(tolower))
decisions <- tm_map(decisions, removeNumbers)
decisions <- tm_map(decisions, removePunctuation)
# specifying your own stopword list is better as you can use stopwords("smart")
# or your own list
decisions <- tm_map(decisions, removeWords, stopwords("english"))
decisions <- tm_map(decisions, stripWhitespace)

decisions.dtm = DocumentTermMatrix(decisions, control = list(stemming = stem,
tokenizer=BigramTokenizer))
return(decisions.dtm)





Could you elaborate on the differences between VCorpus and Corpus? I've been using Corpus for quite a while now and haven't had any issues
– mgrogger
Aug 13 at 21:47





Sorry for the late reply, but the bottom answer on this question answers it quite good.
– phiver
Aug 19 at 7:40






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard