Text Analysis Using LDA and tm in R -


hey guys have little bit of trouble conduction lda because reason once ready conduct analysis errors. i'll best go through doing unfortunately not able provide data because data using proprietary data.

dataset <- read.csv("proprietarydata.csv")

first little bit of cleaning data$text , post class character

dataset$text <- as.character(dataset$text)  post <- gsub("[^[:print:]]"," ",data$post.content) post <- gsub("[^[:alnum:]]", " ",post) 

post ends looking this: `

`[1] "here string"  [2] "here string"  etc....` 

then created following function more cleaning:

createdtm <- function(x){ mycorpus <- corpus(vectorsource(x)) mycorpus <- tm_map(mycorpus,plaintextdocument) docs <- tm_map(mycorpus,tolower) docs <- tm_map(docs, removewords, stopwords(kind="smart")) docs <- tm_map(docs, removewords, c("the"," the","will","can","regards","need","thanks","please","http")) docs <- tm_map(docs, stripwhitespace) docs <- tm_map(docs, plaintextdocument) return(docs)}  predtm <- createdtm(post) 

this end returning corpus gives me every document:

[[1]] <<plaintextdocument (metadata: 7)>> here text string   [[2]] <<plaintextdocument (metadata: 7)>> here string 

then set myself ready lda creating documenttermmatrix

dtm <- documenttermmatrix(predtm) inspect(dtm)   <<documenttermmatrix (documents: 14640, terms: 39972)>> non-/sparse entries: 381476/584808604 sparsity           : 100% maximal term length: 86 weighting          : term frequency (tf)  docs           truclientrre truddy trudi trudy true truebegin truecontrol               terms docs           truecrypt truecryptas trueimage truely truethis trulibraryref               terms docs           trumored truncate truncated truncatememory truncates               terms docs           truncatetableinautonomoustrx truncating trunk trunkhyper               terms docs           trunking trunkread trunks trunkswitch truss trust trustashtml               terms docs           trusted trustedbat trustedclient trustedclients               terms docs           trustedclientsjks trustedclientspwd trustedpublisher               terms docs           trustedreviews trustedsignon trusting trustiv trustlearn               terms docs           trustmanager trustpoint trusts truststorefile truststorepass               terms docs           trusty truth truthfully truths tryd tryed tryig tryin tryng 

this looks odd me how have done this. end moving forward , following

run.lda <- lda(dtm,4) 

this returns first error

  error in lda(dtm, 4) :    each row of input matrix needs contain @ least 1 non-zero entry 

after researching error check out post remove empty documents documenttermmatrix in r topicmodels? assume have under control , excited follow steps in link then

this runs

rowtotals <- apply(dtm , 1, sum) 

this doesnt

dtm.new   <- dtm[rowtotals> 0] 

it returns:

  error in `[.simple_triplet_matrix`(dtm, rowtotals > 0) :    logical vector subscripting disabled object. 

i know might heat because of might isn't reproducible example. please feel free ask problem. it's best can do.

it shouldn't hard create minimal reproducible example. example

library(tm) library(topicmodels) raw <- c("hello","","goodbye") tm <- corpus(vectorsource(raw))  dtm <- documenttermmatrix(tm)  lda(dtm,4)  # error in lda(dtm, 4) :  #   each row of input matrix needs contain @ least 1 non-zero entry 

nor should difficult remember how correctly subset matrix (by specifying [row,col] not [index].

rowtotals <- apply(dtm , 1, sum) dtm <- dtm[rowtotals>0,] lda(dtm, 4)  #a lda_vem topic model 4 topics. 

please take time create reproducible examples. in doing discover own error , can fix it. @ least, others see problem more , eliminate unnecessary info.


Comments

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Python ctypes access violation with const pointer arguments -