Text Analysis Using LDA and tm in R -

January 15, 2012

hey guys have little bit of trouble conduction lda because reason once ready conduct analysis errors. i'll best go through doing unfortunately not able provide data because data using proprietary data.

dataset <- read.csv("proprietarydata.csv")

first little bit of cleaning data$text , post class character

dataset$text <- as.character(dataset$text)  post <- gsub("[^[:print:]]"," ",data$post.content) post <- gsub("[^[:alnum:]]", " ",post)

post ends looking this: `

`[1] "here string"  [2] "here string"  etc....`

then created following function more cleaning:

createdtm <- function(x){ mycorpus <- corpus(vectorsource(x)) mycorpus <- tm_map(mycorpus,plaintextdocument) docs <- tm_map(mycorpus,tolower) docs <- tm_map(docs, removewords, stopwords(kind="smart")) docs <- tm_map(docs, removewords, c("the"," the","will","can","regards","need","thanks","please","http")) docs <- tm_map(docs, stripwhitespace) docs <- tm_map(docs, plaintextdocument) return(docs)}  predtm <- createdtm(post)

this end returning corpus gives me every document:

[[1]] <<plaintextdocument (metadata: 7)>> here text string   [[2]] <<plaintextdocument (metadata: 7)>> here string

then set myself ready lda creating documenttermmatrix

dtm <- documenttermmatrix(predtm) inspect(dtm)   <<documenttermmatrix (documents: 14640, terms: 39972)>> non-/sparse entries: 381476/584808604 sparsity           : 100% maximal term length: 86 weighting          : term frequency (tf)  docs           truclientrre truddy trudi trudy true truebegin truecontrol               terms docs           truecrypt truecryptas trueimage truely truethis trulibraryref               terms docs           trumored truncate truncated truncatememory truncates               terms docs           truncatetableinautonomoustrx truncating trunk trunkhyper               terms docs           trunking trunkread trunks trunkswitch truss trust trustashtml               terms docs           trusted trustedbat trustedclient trustedclients               terms docs           trustedclientsjks trustedclientspwd trustedpublisher               terms docs           trustedreviews trustedsignon trusting trustiv trustlearn               terms docs           trustmanager trustpoint trusts truststorefile truststorepass               terms docs           trusty truth truthfully truths tryd tryed tryig tryin tryng

this looks odd me how have done this. end moving forward , following

run.lda <- lda(dtm,4)

this returns first error

  error in lda(dtm, 4) :    each row of input matrix needs contain @ least 1 non-zero entry

after researching error check out post remove empty documents documenttermmatrix in r topicmodels? assume have under control , excited follow steps in link then

this runs

rowtotals <- apply(dtm , 1, sum)

this doesnt

dtm.new   <- dtm[rowtotals> 0]

it returns:

  error in `[.simple_triplet_matrix`(dtm, rowtotals > 0) :    logical vector subscripting disabled object.

i know might heat because of might isn't reproducible example. please feel free ask problem. it's best can do.

it shouldn't hard create minimal reproducible example. example

library(tm) library(topicmodels) raw <- c("hello","","goodbye") tm <- corpus(vectorsource(raw))  dtm <- documenttermmatrix(tm)  lda(dtm,4)  # error in lda(dtm, 4) :  #   each row of input matrix needs contain @ least 1 non-zero entry

nor should difficult remember how correctly subset matrix (by specifying [row,col] not [index].

rowtotals <- apply(dtm , 1, sum) dtm <- dtm[rowtotals>0,] lda(dtm, 4)  #a lda_vem topic model 4 topics.

please take time create reproducible examples. in doing discover own error , can fix it. @ least, others see problem more , eliminate unnecessary info.

Search This Blog

My

Text Analysis Using LDA and tm in R -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

Why am I getting Internal .NET Framework Data Provider error 1025 when passing Method to where? -

postgresql - how to get points from linestring postgis -