1. Overview
According to Wikipedia, an anagram is a word or phrase formed by rearranging the letters of a different word or phrase.
We can generalize this in string processing by saying that an anagram of a string is another string with exactly the same quantity of each character in it, in any order.
In this tutorial, weβre going to look at detecting whole string anagrams where the quantity of each character must be equal, including non-alpha characters such as spaces and digits. For example, β!low-salt!β and βowls-lat!!β would be considered anagrams as they contain exactly the same characters.
2. Solution
Letβs compare a few solutions that can decide if two strings are anagrams. Each solution will check at the start whether the two strings have the same number of characters. This is a quick way to exit early since inputs with different lengths cannot be anagrams.
For each possible solution, letβs look at the implementation complexity for us as developers. Weβll also look at the time complexity for the CPU, using big O notation.
3. Check by Sorting
We can rearrange the characters of each string by sorting their characters, which will produce two normalized arrays of characters.
If two strings are anagrams, their normalized forms should be the same.
In Java, we can first convert the two strings into char[] arrays. Then we can sort these two arrays and check for equality:
boolean isAnagramSort(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
char[] a1 = string1.toCharArray();
char[] a2 = string2.toCharArray();
Arrays.sort(a1);
Arrays.sort(a2);
return Arrays.equals(a1, a2);
}
This solution is easy to understand and implement. However, the overall running time of this algorithm is O(n log n) because sorting an array of n characters takes O(n log n) time.
For the algorithm to function, it must make a copy of both input strings as character arrays, using a little extra memory.
4. Check by Counting
An alternative strategy is to count the number of occurrences of each character in our inputs. If these histograms are equal between the inputs, then the strings are anagrams.
To save a little memory, letβs build only one histogram. Weβll increment the counts for each character in the first string, and decrement the count for each character in the second. If the two strings are anagrams, then the result will be that everything balances to 0.
The histogram needs a fixed-size table of counts with a size defined by the character set size. For example, if we only use one byte to store each character, then we can use a counting array size of 256 to count the occurrence of each character:
private static int CHARACTER_RANGE= 256;
public boolean isAnagramCounting(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
int count[] = new int[CHARACTER_RANGE];
for (int i = 0; i < string1.length(); i++) {
count[string1.charAt(i)]++;
count[string2.charAt(i)]--;
}
for (int i = 0; i < CHARACTER_RANGE; i++) {
if (count[i] != 0) {
return false;
}
}
return true;
}
This solution is faster with the time complexity of O(n). However, it needs extra space for the counting array. At 256 integers, for ASCII thatβs not too bad.
However, if we need to increase CHARACTER_RANGE to support multiple-byte character sets such as UTF-8, this would become very memory hungry. Therefore, itβs only really practical when the number of possible characters is in a small range.
From a development point of view, this solution contains more code to maintain and makes less use of Java library functions.
5. Check with MultiSet
We can simplify the counting and comparing process by using MultiSet. MultiSet is a collection that supports order-independent equality with duplicate elements. For example, the multisets {a, a, b} and {a, b, a} are equal.
To use Multiset, we first need to add the Guava dependency to our project pom.xml file:
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.0.1-jre</version>
</dependency>
We will convert each of our input strings into a MultiSet of characters. Then weβll check if theyβre equal:
boolean isAnagramMultiset(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
Multiset<Character> multiset1 = HashMultiset.create();
Multiset<Character> multiset2 = HashMultiset.create();
for (int i = 0; i < string1.length(); i++) {
multiset1.add(string1.charAt(i));
multiset2.add(string2.charAt(i));
}
return multiset1.equals(multiset2);
}
This algorithm solves the problem in O(n) time without having to declare a big counting array.
Itβs similar to the previous counting solution. However, rather than using a fixed-size table to count, we take advantage of the MutlitSet class to simulate a variable-sized table, with a count for each character.
The code for this solution makes more use of high-level library capabilities than our counting solution.
6. Letter-based Anagram
The examples so far do not strictly adhere to the linguistic definition of an anagram. This is because they consider punctuation characters part of the anagram, and they are case sensitive.
Letβs adapt the algorithms to enable a letter-based anagram. Letβs only consider the rearrangement of case-insensitive letters, irrespective of other characters such as white spaces and punctuations. For example, βA decimal pointβ and βIβm a dot in place.β would be anagrams of each other.
To solve this problem, we can first preprocess the two input strings to filter out unwanted characters and convert letters into lower case letters. Then we can use one of the above solutions (say, the MultiSet solution) to check anagrams on the processed strings:
String preprocess(String source) {
return source.replaceAll("[^a-zA-Z]", "").toLowerCase();
}
boolean isLetterBasedAnagramMultiset(String string1, String string2) {
return isAnagramMultiset(preprocess(string1), preprocess(string2));
}
This approach can be a general way to solve all variants of the anagram problems. For example, if we also want to include digits, we just need to adjust the preprocessing filter.
7. Conclusion
In this article, we looked at three algorithms for checking whether a given string is an anagram of another, character for character. For each solution, we discussed the trade-offs between the speed, readability, and size of memory required.
We also looked at how to adapt the algorithms to check for anagrams in the more traditional linguistic sense. We achieved this by preprocessing the inputs into lowercase letters.
