dictionary - R function to correct words by frequency of more proximate word -

January 15, 2014

i have table misspelling words. need correct using words more similar one, 1 have more frequency.

for example, after run

aggregate(customerid ~ province, ventas2, length)

1                              2                     amba         29     3                   baires          1     4              benos aires          1      12            buenas aires          1      17           buenos  aires          4     18            buenos aires          7     19            buenos aires          3     20            buenos aires      11337     35                 cordoba       2297     36                cordoba           1     38               cordobesa          1     39              corrientes        424

so need replace buenos aires, buenos aires, baires, buenos aires, buenos aires amba shouldn't replaced. cordobesa , cordoba should replaced cordoba, not corrientes.

how can in r?

thanks!

here's possibile solution.

disclaimer :
code seems works fine current example. don't assure current parameters (e.g. cut height, cluster agglomeration method, distance method etc.) valid real (complete) data.

# recreating data data <-  read.csv(text= 'city,occurr amba,29 baires,1 benos aires,1 buenas aires,1 buenos  aires,4 buenos aires,7 buenos aires,3 buenos aires,11337 cordoba,2297 cordoba,1 cordobesa,1 corrientes,424',stringsasfactors=f)   # simple pre-processing city strings: # - removing spaces # - turning strings uppercase cities <- gsub('\\s+','',toupper(data$city))  # string distance computation # n.b. here can play single components of distance costs  d <- adist(cities, costs=list(insertions=1, deletions=1, substitutions=1)) # assign original cities names distance matrix rownames(d) <- data$city # clustering cities hc <- hclust(as.dist(d),method='single')  # plot cluster dendrogram plot(hc) # add cluster rectangles (just see clusters)  # n.b. decided cut @ distance height < 5 #      (read as: "i consider equal 2 strings needing #       less 5 modifications pass 1 other") #      can use value. rect.hclust(hc,h=4.9)  # clusters ids clusters <- cutree(hc,h=4.9)  # turn data.frame clusters <- data.frame(city=names(clusters),clusterid=clusters)  # merge frequencies merged <- merge(data,clusters,all.x=t,by='city')   # add citycorrected column merged data.frame ret <- by(merged,            merged$clusterid,           fun=function(grp){                 idx <- which.max(grp$occur)                 grp$citycorrected <- grp[idx,'city']                 return(grp)               })  fixed <- do.call(rbind,ret)

result :

> fixed               city occurr clusterid citycorrected 1             amba     29         1          amba 2.2         baires      1         2  buenos aires 2.3    benos aires      1         2  buenos aires 2.4   buenas aires      1         2  buenos aires 2.5  buenos  aires      4         2  buenos aires 2.6   buenos aires      7         2  buenos aires 2.7   buenos aires      3         2  buenos aires 2.8   buenos aires  11337         2  buenos aires 3.9        cordoba      1         3       cordoba 3.10       cordoba   2297         3       cordoba 3.11     cordobesa      1         3       cordoba 4       corrientes    424         4    corrientes

cluster plot :

enter image description here

Search This Blog

My

dictionary - R function to correct words by frequency of more proximate word -

Comments

Post a Comment

Popular posts from this blog

javascript - RequestAnimationFrame not working when exiting fullscreen switching space on Safari -

c# - How do I get the Nth largest element from a list with duplicates, using LINQ? -

jsf - How to ajax update an item in the footer of a PrimeFaces dataTable? -