Finding Similar Images with LS-Hash


For our semester-long project in Algorithms (CSCI 3383) with Prof. Bento Ayres Periera, we were given the following task:

  • You have to develop an algorithm that given a query image finds the “closest” entries to it on a dataset of images
  • Your solution will be evaluated based on how its run-time scales with the size of the input dataset, the number of input queries and the value of K. It will also be evaluated on how its memory usage scale with the input.
  • Finally, your solution will also be evaluated based on its accuracy to solve a classification. Your solution will also be evaluated based on its accuracy to solve a classification task. More specifically, we will run your final code on a folder with a dataset of images and another folder with a several query images. The images will be 2D arrays of 1s and 0s. We will run your code on these input folders for a value of K = 9. These images fall into several category. We know the categories in which each image falls but you do not. We will read the output file produced by your program where you stored the K files closest to each query. We will read it line by line. If the category that most frequently appears on these K files is the same category as the input file, then you get one point. Otherwise you get zero points. This will produce a total score. We will then run your code, just like described above, also for K = 1, 3, 5, 7. Your final accuracy score will the highest score you get among the values of K tested.

To solve this, we took a hybrid approach incorporating elements of Machine Learning and Computer Vision in which we extracted features from the images, and 'clustered' them using Locality-Sensitive Hashing. On GitHub here.

Deciding on an Approach

While getting more familiar with different aspects of this project, we debated many different solutions, before landing on our final implementation. The first of which included comparing every query image to every database image, and maintaining a list of each comparison along with the score associated with that comparison. We planned to do a number of operations to the database image, such as rotations and horizontal and vertical flips, in each comparison to account for potential changes to the image data.

After abandoning this idea because it would require too many passes on the data and would not scale well for multiple query images, we moved more towards a computer vision approach, which relied on extracting important features from the image. After deciding on feature extraction, we debated how to plot each image in the feature space and look for nearby neighbors. We looked at methods from Machine Learning to best understand how to do this, and a naive implementation of the k-Nearest Neighbors algorithm immediately come to mind. k-NN works very well in lower dimensional spaces and when time and scare are not the most immediate concern. Unfortunately for us, time was of the essence and we needed to find an algorithm that could scale and perform quicker than O(n*d). Even though we were getting good results from just a few simple features when we tried k-NN, we decided to search for a faster alternative that would scale efficiently.

Through more research, we found two new potential solutions to our problem: a k-d tree and a locality-sensitive hash (LSH). We began to work with both and found that although neither would be retrieving results as accurately as the naive k-NN, we could still perform quite well and much quicker. The ultimate decision then came down to a matter of speed, and even though the performance of the k-d tree and LSH seemed about the same (empirically) on simple features, we went with the LSH method– capable of indexing and querying in higher dimensions with O(1) retrieval– because we were not sure how large our feature vector would become.

Preprocessing the Images

We took a number steps to preprocess the images before generating a feature vector so that the features extracted would be relevant and normalized. First, we de-noised the image to remove small groups of pixels that were surrounded by the opposite value: For each pixel p in the image, we find a 4x4 matrix of pixels from the original image data such that this 4x4's upper left-hand pixel is p. When near an edge or corner, the data used to generate this mini-matrix would wrap around the edges of the original image. Once each 4x4 was extracted, we would check the outside 12 pixels, verify that they were all the same color, and if so, force that the inner four pixels too became this color. Our algorithm also counted the number of instances where the outside 12 pixels were either all 0, or all 1 and after going through the entire image, the most prominent of the two counts would tell us with high likelihood what the background color of the image was. Below is the code for the denoise operation.

Following the denoise function we rotated the image around a major axis.  The major axis was found by placing random lines through the image, and seeing which line hit the most points in the shape. After finding this line, we calculated the angle between this line and the X axis, using the inverse tangent. As long as there was a significant rotation that needed to be made (more than 5 degrees and less and 85 degrees) then the array was rotated using scipy.ndimage.interpolation. We had implemented a deskew function, but since this is only really important in MNIST images, we decided to just use the rotation function because it would work on all different types of shapes, and not just hand-drawn digits.

The final preprocessing step we took was the creation of the bounding box, which would start from the top-left and bottom-right corners of the image and work its way diagonally checking each column and row of image for anything other than the background color. If it finds this, then it stops searching in that direction and return the appropriate dimension from that side. Once the searchers return, the image is then resized to the new dimensions plus padding to make sure we have a little room to work with. The bounding box helped give a better idea of the dimensions of the shape, which was used later. This meant that the dimensions of a longer skinnier shape, like a 1 would be very different from a wider more square shape, like the image of a 5.

Feature Extraction

Now that we were working with clean and normalized images, we moved to feature extraction to generate the feature vector for each image. The first features we calculate deal with pixel counts: foreground pixels, horizontal symmetry, vertical symmetry, number of shape pixels in the right half, and number of shape pixels in the left half of the image. All of these features were extracted on two sweeps which went through half of the pixels in the image, which equates to one sweep through the entire image. Each of these feature values were saved as values in range [0.0, 1.0], in order to normalize the entire feature vector, but also to normalize the values across images that might be different sizes after the bounding box function. We decided on these features because they can tell a lot about the area that a shape encompasses.

Next we also used the FAST corner detection to find the number of corners and the position of the corners in the shape. We then used the position of the corners to find the centroid - since we could not approximate the number of corners each image had, we needed a generalized way to utilize corners and make sure our feature vector was the same standard length. We had implemented a Harris Corner Detection algorithm to find the number of corners, but decided to actually use the FAST implementation because it was much faster and more accurate than our own implementation. Below are a number of methods we used to generate a feature vector for each image.

Locality-Sensitive Hashing (LSH)

After feature extraction, we then used a pure python implementation of LSH to hash on the feature vector for each image. Where as a typical hash function will for instance generate very different values to the two function calls naiveHashMap.put([.90, .82, .12]) and naiveHashMap.put([.92, .82, .11]), a locality-sensitve hash would place these two vectors into very nearby buckets which makes it a valuable structure for Machine Learning-type problems. We tried many different combination of features, as well as different scoring techniques, but decided that normalizing the features to all be a percentage (less than 1) gave us the most accurate results. After hashing all of the database images, and getting the feature vector for the query image, we used LSH.query([query image feature vector], k) method which returns an array of feature vectors the size k (if there are at least k neighbors).

While parsing the database images, we had saved both the images file name, and the images feature vector into two separate lists. After getting all of the feature vectors for the k nearest neighbors, we used this map to find the index of this feature vector, and therefore used this index to find the actual image file name.

In some extreme cases, the LSH query function returned less than k neighbors. In order to get around this and find exactly k neighbors, we then used whatever neighbors it did find, and found their closest neighbors. In the extremely rare case that a query image returned no neighbors, we decided to just print this query image k times. We decided this was the best course of action because there were no other neighbors to look at, and we really did not want to implement a slower algorithm which could give us the correct number of neighbors.

Generalization & Runtimes

While we were working with the MNIST data set for some time, we really tried to focus our approach on a scalable implementation, that could work for other data sets as well. That is why we took so much time to rotate the image, and find the bounding box, and use percentages throughout the entire feature vector. It means that our data was much more normalized and could find similar images that were simple scalers of one another, or that were rotations of one another. That is also why we decided to use the LSH function, because we decided that it was more important to scale time-wise, than it was to get the exact nearest neighbors.

Empirical Runtime:

  • 14.46s - Average Preprocessing time for 1000 db images
  • 16.5676s - Average NN search time for 1000 db images and 1000 query images

Before finding the NN for each query image, we preprocesses the image then run neighbors. This means that, in effect, the NN search time also includes the time it takes to preprocess each of the query images. Going by the average preprocessing time, this means that query time only took approximately 2.1076s for each of the query images. This means that our LSH will efficiently scale for even larger data sets.

This project was done by Danielle Nash, Ryan Reede and Drew Hoo.

Code Contributions