public static void wordFreqV1() {
String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
Map<String, Integer> freqMap = new HashMap<>();
asList(text.split(" ")).forEach(s -> {
if (freqMap.containsKey(s)) {
Integer count = freqMap.get(s);
freqMap.put(s, count + 1);
} else {
freqMap.put(s, 1);
}
});
System.out.println(freqMap.toString());
}
Count word frequency in Java
Upasana | November 19, 2020 | 2 min read | 107 views
In this article we will calculate word frequency for each word in a given sentence using various approaches - plain java, java 8 streams, parallel streams, etc.
1. Using HashMap and a loop
This is the simplest and most verbose approach where we track the count of each word in a hashmap.
-
Split the sentence into word list
-
Loop on word list
-
If hashmap contains the given word, increment the frequency count
-
else put the word into hashmap and set its frequency as 1
-
2. Using Java 8 Map & compute
Java 8 provides compute
method on HashMap which takes a mapping function to compute the value. This will reduce the amount of code we had written in previous example.
public static void wordFreqV2() {
String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
Map<String, Integer> freqMap = new HashMap<>();
asList(text.split("[\\s.]")).forEach(s -> {
freqMap.compute(s, (s1, count) -> count == null ? 1 : count + 1);
});
System.out.println(freqMap.toString());
}
Using merge
instead of compute is even cleaner and more concise approach.
public static void wordFreqV2() {
String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
Map<String, Integer> freqMap = new HashMap<>();
asList(text.split("[\\s.]")).forEach(s -> {
freqMap.merge(s, 1, Integer::sum); (1)
});
System.out.println(freqMap.toString());
}
1 | Upon every occurrence of a given word, add 1 to the previous value. |
3. Using Java 8 parallel stream
We can leverage parallel computing (utilizing multiple cores) by creating a parallel stream which will compute the word frequency.
public static void textWordFreq() {
String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
ConcurrentMap<String, Integer> freqMap =
asList(text.split("[\\s.]"))
.parallelStream()
.filter(s -> !s.isEmpty())
.collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum));
System.out.println(freqMap.toString());
}
Showing Top 3 frequent words
We can keep track of top X frequently used words using a PriorityQueue that uses word frequency for its comparator.
PriorityQueue is nothing but a min-heap implementation in Java. We create a comparator that sorts the min-heap elements by their frequency. The lowest frequency word will be at the head of PQ. This way we can keep removing lowest frequency word from the min-heap (in O(log n) time) as higher frequency words arrive in.
public static void textWordFreq() {
String text = "Ann while Bob had had had had had had had had had had had a better effect on on the teacher";
ConcurrentMap<String, Integer> freqMap =
asList(text.split("[\\s.]"))
.parallelStream()
.filter(s -> !s.isEmpty())
.collect(Collectors.toConcurrentMap(w -> w.toLowerCase(), w -> 1, Integer::sum));
System.out.println(freqMap.toString());
//Priority queue that uses frequency as the comparator and size as 3
PriorityQueue<String> pq = new PriorityQueue<>(Comparator.comparingInt(freqMap::get)); (1)
for(String key: freqMap.keySet()) {
pq.add(key); (2)
if(pq.size() > 3) {
pq.poll(); (3)
}
}
System.out.println("Top 3 words by occurrences : " + pq);
}
1 | min-heap that sorts its elements based on the frequency of given key in frequency map i.e. the word with lowest frequency will be at top. |
2 | Adding a new element to the min-heap. |
3 | If min-heap has more than 3 elements, remove the one with lowest frequency by calling poll() method. |
Features of PriorityQueue
-
The elements of queue are ordered according to their natural ordering or by a comparator provided in constructor
-
The head of the queue is the least element with respect to the specified ordering.
-
PQ does not permit null elements
-
PQ is not thread safe, if multiple threads can modify the queue concurrently, use PriorityBlockingQueue class instead
-
If you need ordered traversal of its elements, consider using Arrays.sort(pq.toArray())
Time Complexity
min-heap approach has the following time-complexity in Big O notation:
-
Big O(log n) time for enqueing and dequeing methods -
offer()
,poll()
,remove()
andadd()
-
Big O(1) constant time for retrieval methods
peek()
,element()
andsize()
-
Big O(n) linear time for
remove(Object)
andcontains(Object)
That’s all for this article.
References
Top articles in this category:
- Fail-Safe vs Fail-Fast Iterator in Java Collections Framework
- What is volatile keyword in Java
- Producer Consumer Problem using Blocking Queue in Java
- Blocking Queue implementation in Java
- What is difference between sleep() and wait() method in Java?
- Diamond Problem of Inheritance in Java 8
- What is AtomicInteger class and how it works internally