dictionary - R function to correct words by frequency of more proximate word -
i have table misspelling words. need correct using words more similar one, 1 have more frequency.
for example, after run
aggregate(customerid ~ province, ventas2, length) i
1 2 amba 29 3 baires 1 4 benos aires 1 12 buenas aires 1 17 buenos aires 4 18 buenos aires 7 19 buenos aires 3 20 buenos aires 11337 35 cordoba 2297 36 cordoba 1 38 cordobesa 1 39 corrientes 424 so need replace buenos aires, buenos aires, baires, buenos aires, buenos aires amba shouldn't replaced. cordobesa , cordoba should replaced cordoba, not corrientes.
how can in r?
thanks!
here's possibile solution.
disclaimer :
code seems works fine current example. don't assure current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) valid real (complete) data.
# recreating data data <- read.csv(text= 'city,occurr amba,29 baires,1 benos aires,1 buenas aires,1 buenos aires,4 buenos aires,7 buenos aires,3 buenos aires,11337 cordoba,2297 cordoba,1 cordobesa,1 corrientes,424',stringsasfactors=f) # simple pre-processing city strings: # - removing spaces # - turning strings uppercase cities <- gsub('\\s+','',toupper(data$city)) # string distance computation # n.b. here can play single components of distance costs d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1)) # assign original cities names distance matrix rownames(d) <- data$city # clustering cities hc <- hclust(as.dist(d),method='single') # plot cluster dendrogram plot(hc) # add cluster rectangles (just see clusters) # n.b. decided cut @ distance height < 5 # (read as: "i consider equal 2 strings needing # less 5 modifications pass 1 other") # can use value. rect.hclust(hc,h=4.9) # clusters ids clusters <- cutree(hc,h=4.9) # turn data.frame clusters <- data.frame(city=names(clusters),clusterid=clusters) # merge frequencies merged <- merge(data,clusters,all.x=t,by='city') # add citycorrected column merged data.frame ret <- by(merged, merged$clusterid, fun=function(grp){ idx <- which.max(grp$occur) grp$citycorrected <- grp[idx,'city'] return(grp) }) fixed <- do.call(rbind,ret) result :
> fixed city occurr clusterid citycorrected 1 amba 29 1 amba 2.2 baires 1 2 buenos aires 2.3 benos aires 1 2 buenos aires 2.4 buenas aires 1 2 buenos aires 2.5 buenos aires 4 2 buenos aires 2.6 buenos aires 7 2 buenos aires 2.7 buenos aires 3 2 buenos aires 2.8 buenos aires 11337 2 buenos aires 3.9 cordoba 1 3 cordoba 3.10 cordoba 2297 3 cordoba 3.11 cordobesa 1 3 cordoba 4 corrientes 424 4 corrientes cluster plot :

Comments
Post a Comment