java - Runtime of this implementation to find k most frequently recurring words in a string -
here's algorithm:
(a) make wordoccurrence class contains string sentence , int numberofoccurrences. let's assume sentence has n words. make class implement comparable, , make comparator numberofoccurrences first, , (arbitrarily,) alphabetical (natural) ordering of word second.
(b) iterate through sentence .split(‘ ‘) (or in place iteration save space). create new wordoccurrence object each unique word, updating occurrence , placing wordoccurrence objects treemap<wordoccurrence>.
(c) create new wordoccurrence object each unique word, placing wordoccurrence objects treemap<wordoccurrence> (and updating words' occurrences along way).
(d) call highestkey() on treemap, , place returned word resultant list. call lowerkey returned word (k - 1) times, placing words same resultant list.
(e) return resultant list.
my question: runtime of this? understanding:
step (b) takes o(n) time.
step (c) takes o(n*log(n)), each of n words, insertion o(log n).
step (d) takes o(k*log(n)), each call of highestkey() or lowerkey() takes o(log n) time.
so total runtime be: o(n + n*log(n) + k*log(n), o(n*log n).
is there tighter bound algorithm, or way o(n)?
thanks.
yours poor approach several reasons:
- swaddling in object going make constant factor here unreasonably high. you're creating @ least 2 objects per word in string; given average english word 6 letters, can figure out average blowup.
- you can use
treemap<string, integer>instead of usingwordoccurrencethingamajig. still horrible first reason, reduces amount of code have write. - your reasoning not take consideration fact keys strings , hence variable-length. not think affects anything, mark down if teaching assistant.
- for data structure, use hash table instead. can use hashmap in java instead , better. might @ trove or hashmap values
ints instead ofintegers.
also, there data structure called suffix tree. can build suffix tree on string in linear time , explore figure out distinct words , count of each. can linear-time selection find top k. loses hash table in practise, in theory avoids constant-time-lookup hash tables.
Comments
Post a Comment