VOOZH about

URL: https://towardsdatascience.com/side-by-side-comparison-of-strings-in-python-b9491ac858/

⇱ Side-by-side comparison of strings in Python | Towards Data Science


Side-by-side comparison of strings in Python

Implementation of a tool to compare texts side-by-side with Python enabling a better overview of differences

7 min read
👁 Photo by Vanessa Giaconi (source: Unsplash)
Photo by Vanessa Giaconi (source: Unsplash)

Currently I am working on a privacy filter for text in Python. During development I ran into a problem I face often; how to quickly compare two strings and evaluate the difference easy. This time, it is time to solve this problem.

The difflib library

Some internet research quickly revealed the existence of the difflib module. This default module contains several helpers for comparing sequences, like arrays and strings. All algorithms for comparing sequences is available with a few lines of code. For a brief moment, the idea of implementing my own algorithm crosses my mind, but fades away quickly.

The base of the difflib module is the class SequenceMatcher. This class implements the algorithm that compares sequences. It starts by finding the largest common sequences in both input sequences and keeps performing this task recursively on the other parts until no sequences are left. Due to its recursive nature, execution time can become quadratic, depending on the amount of differences in the sequences. For our goal, this performance is acceptable. It will not be used to compare books of texts, but only a few lines at a time.

Finding matching sequences

In order to match sequences we first have to create an instance of the SequenceMatcher class. The constructor takes the two sequences to match as parameters a and b. The parameters for junk detection are not used. When passing strings as arguments, a string is seen as a sequence of characters. The matcher will look for matching sequences of characters.

The SequenceMatcher class has several ways of returning the matching results, like finding the longest match, returning operations to transform a into b and to return all found matches. We will be using the latter:

The get_matching_blocks() method returns an iterable list of all matches, ordered by their appearance from the start of the sequence. Each match consists of the location of the match in sequence a, the location of the match in sequence b and the length of the match. In the example three matches are found; "It is ", " warm. It is ", "er.". The fourth returned match is the end of the sequences.

So we have found all matching sequences in the strings. But for comparing texts, the last match "er." only matches a part of a word. I prefer to match only whole words. Since the matcher takes sequences (strings are sequences of characters in Python) we should be adding words to the matcher. So let’s see what happens if we tokenize the string before executing the matcher:

The tokenize method is a very simple, rude implementation by splitting the string at the spaces (‘ ‘). The results are promising, only the sequences "It is" and "warm. It is" are returned. The partial word match is no longer found. Testing a few lines of text quickly pointed out the limited tokenization by only splitting at spaces. It fails at newlines and tabs so to step up the game a little the split() is replaced by a regular expression that splits the text at all whitespaces:

This gives the same results as the previous implementation but is more robust with other whitesplaces than the space itself.

Equalizing the strings

The matcher is telling is us which sequences match and thus also which do not match. With this information it is possible to make both sequences equal, adding the missing parts from the first sequence to the second sequence and vice versa.

Looking at the example above, we see that the first matching sequence starts at position 0 and has al length of 2. The second matching sequence starts at position 2 for sequence a and at position 3 for sequence b. So sequence b contains a part that is not in sequence a. We need to add a element to the sequence of a with the same length as the element in sequence b.

To determine if an element is missing in the matched sequences, we look at the starting position of a match, add the length of the matched sequence and determine if the next sequence starts at the sum of both. In the example below, the second match starts at position 0 + 3 for both sequences, the second match starts at position 4 for the first sequence and at position 3 for the second sequence. Since 0 + 3 equals 3, the matching sequence ‘touch’ each other in the second sequence and the first sequence contains an additional element between both matched sequences. This means, an additional element much be added to the second sequeuce.

👁 Image by author
Image by author

There are three possibilities: The first sequence contains one or more additional elements, the second string contaoins additional elements or both string contains additional elements. In the latter case, there is amismatch in sequences, e.g. the words ‘summer’ and ‘winter’ in our first example.

We are going to equalize the sequence by adding additional elements in places where the other sequence has unmatched elements. These additional elements will be given the same length as the unmatched element and will conist of underscores:

The method iterates over the found matches (line 15) and for each match checks if it is adjacent to the previous match. If this is not the case, elements are added to the other sequence to create a match (line 16–23). After adding missing elements, both sequences are extended with the matched sequence. Before returning the values, they are untokenized.

The word ‘not’ that is not in the first string, is added as underscores to the first string. The different words ‘summer.’ and ‘winter.’ are both added to the other string. Seeing the result above eachother already gives a nice overview of the differences between both strings.

Visualizing the differences

So now we have the equalised strings and we need a convenient way to visualize the comparison. For single sentences printing them above one another is an excellent solution. For longer texts, a side-by-side comparison is handier.

For a side-by-side comparison, we need to break the strings in chunks with a limited with, e.g. 40 characters. We want to break in the middle of words, so an acceptable window is added of e.g. 10 characters, meaning we will start searching for a space after 40 characters (line 5) and go back a maximum of 10 to find the first space smaller than and closest to the 40 characters (line 7–8):

The function returns a list of strings, none longer than 40. Since equalizing the strings has made them equal in length, we can use this method to break both string in junks and these junks will always contain the same words on the same line. They are easy to compare. We can print the strings including the underscores are with removed underscores. By removing the underscores after breaking then text up, we guarantee that the same words in both texts end up in the same line:

The comparison() function takes the two texts to compare as parameter, followed by the optional width for the side-by-side display, the window for finding a space from this width and wether to display the texts side-by-side or above each other.

The complete code:

Running an example with all possible visualizations:

I hope this side-by-side comparison method can help you as much as it helps me. Finding differences in texts is much easier with this comparison where it is truly helpful that the same text left and right appear on the same line. For my privacy filter, length differences become quite large making comparisons awkward when equal text parts are shown with multiple lines in between.

I hope you enjoyed this article. For inspiration check some of my other articles:

Disclaimer: The views and opinions included in this article belong only to the author.


Written By

Leo van der Meulen

Towards Data Science is a community publication. Submit your insights to reach our global audience and earn through the TDS Author Payment Program.

Write for TDS

Related Articles