Text Analysis Using LDA and tm in R -
hey guys have little bit of trouble conduction lda because reason once ready conduct analysis errors. i'll best go through doing unfortunately not able provide data because data using proprietary data.
dataset <- read.csv("proprietarydata.csv")
first little bit of cleaning data$text , post class character
dataset$text <- as.character(dataset$text) post <- gsub("[^[:print:]]"," ",data$post.content) post <- gsub("[^[:alnum:]]", " ",post)
post ends looking this: `
`[1] "here string" [2] "here string" etc....`
then created following function more cleaning:
createdtm <- function(x){ mycorpus <- corpus(vectorsource(x)) mycorpus <- tm_map(mycorpus,plaintextdocument) docs <- tm_map(mycorpus,tolower) docs <- tm_map(docs, removewords, stopwords(kind="smart")) docs <- tm_map(docs, removewords, c("the"," the","will","can","regards","need","thanks","please","http")) docs <- tm_map(docs, stripwhitespace) docs <- tm_map(docs, plaintextdocument) return(docs)} predtm <- createdtm(post)
this end returning corpus gives me every document:
[[1]] <<plaintextdocument (metadata: 7)>> here text string [[2]] <<plaintextdocument (metadata: 7)>> here string
then set myself ready lda creating documenttermmatrix
dtm <- documenttermmatrix(predtm) inspect(dtm) <<documenttermmatrix (documents: 14640, terms: 39972)>> non-/sparse entries: 381476/584808604 sparsity : 100% maximal term length: 86 weighting : term frequency (tf) docs truclientrre truddy trudi trudy true truebegin truecontrol terms docs truecrypt truecryptas trueimage truely truethis trulibraryref terms docs trumored truncate truncated truncatememory truncates terms docs truncatetableinautonomoustrx truncating trunk trunkhyper terms docs trunking trunkread trunks trunkswitch truss trust trustashtml terms docs trusted trustedbat trustedclient trustedclients terms docs trustedclientsjks trustedclientspwd trustedpublisher terms docs trustedreviews trustedsignon trusting trustiv trustlearn terms docs trustmanager trustpoint trusts truststorefile truststorepass terms docs trusty truth truthfully truths tryd tryed tryig tryin tryng
this looks odd me how have done this. end moving forward , following
run.lda <- lda(dtm,4)
this returns first error
error in lda(dtm, 4) : each row of input matrix needs contain @ least 1 non-zero entry
after researching error check out post remove empty documents documenttermmatrix in r topicmodels? assume have under control , excited follow steps in link then
this runs
rowtotals <- apply(dtm , 1, sum)
this doesnt
dtm.new <- dtm[rowtotals> 0]
it returns:
error in `[.simple_triplet_matrix`(dtm, rowtotals > 0) : logical vector subscripting disabled object.
i know might heat because of might isn't reproducible example. please feel free ask problem. it's best can do.
it shouldn't hard create minimal reproducible example. example
library(tm) library(topicmodels) raw <- c("hello","","goodbye") tm <- corpus(vectorsource(raw)) dtm <- documenttermmatrix(tm) lda(dtm,4) # error in lda(dtm, 4) : # each row of input matrix needs contain @ least 1 non-zero entry
nor should difficult remember how correctly subset matrix (by specifying [row,col]
not [index]
.
rowtotals <- apply(dtm , 1, sum) dtm <- dtm[rowtotals>0,] lda(dtm, 4) #a lda_vem topic model 4 topics.
please take time create reproducible examples. in doing discover own error , can fix it. @ least, others see problem more , eliminate unnecessary info.
Comments
Post a Comment