Jun 16

In this document I introduce and analyze the new Google algorithm for the indexing and ranking of images. I believe this issue to be crucial considering the increasing number of search engines and that being introduced by Google, it is a real revolution in the field of image search.
These  two points are in some way the problem that Image Rank (IR) intends to address and resolve. The existing algorithms on which, for example, the Google Image search system is based do not involve  in any way the reading of the Image. Currently indexing of images is  exclusively based on certain factors (filename, alt tags, text adjacent to, etc..) that are not strictly related to it.
This type of approach, obviously not exhaustive, sometimes hits the mark thanks to advancements in technology in the field of indexation of web pages and especially since it avails itself of investment of resources (economic and non).
After this timely clarification on the solutions currently used by search engines, a possible approach to resolve  the problem becomes obvious:  in order to deliver results that are increasingly relevant in an image search, search engines must “get their hands” into the images themselves, determine what links the images to each other (visual link) and create an iterative process to assign a “numerical weight” to each image in order to  determine its ranking.

This concept involves of a series of issues, far from  trivial, which, the Google team, seems to have resolved. I will attempt  to analyse and explain “how”  by studying the official document published at the end of April 2008.

Objectives
1) Introduce an image ranking algorithm based on visual similarity.
2) Introduce a system that is able to re-catalogue and re-order the current Google Images  index, demonstrating that, by comparing  some parameters of each image, a reliable system of ranking can be derived.
3) Improve the image search results, by evaluating the project on the main searches carried out in the largest collection of existing images.

Approach & Algorithm
The principle underlying the image PageRank algorithm is the Eigenvector Centrality.  According to this theory it is possible to measure the importance of all the nodes in a network by giving each one a numerical weight, where the weight assigned to each node is directly proportional to the weight assigned to the nodes connected to that node.
Although this principle is very interesting, I will not get into a lecture on  linear algebra but I will try, from  a SEO viewpoint to  correlate Visual PageRank to PageRank, since the latter is also based on the Eigenvector Centrality.
For our more curious readers, I recommend the following reading material on eigenvector theory:
Eigenvector centrality (EN)
Eigenvalue, eigenvector and eigenspace (EN)

Eigenvalue, eigenvector and eigenspace (IT)
Hierarchical Analysis and comparison matrix
Moving right along with the analysis, the brief definition above can be applied to the PageRank algorithm as follows:
For each web page (node) numerical weight (PR) is assigned  as a function of other web pages linked to it (hyperlinks) and their numerical weight.
Understanding the definition/application of Image Rank is not as immediate. The definition of PageRank is based on hyperlinks, elements introduced by humans who represent a conceptual link between web pages and the objective of which is to enable people to easily find other related pages . In the case of  images, what can be the linking element? Given a picture, what are the images related to it?
ImageRank is based, as mentioned earlier, on the concept of visual image  similarity. Two images are connected (and therefore linked) between them if there are points of attention common to both images, where points of attention may have different shades, different forms or simply be taken from different perspectives while representing the same thing.
A reliable system for the assessment of  image  similarity is a crucial point for how the algorithm was defined. There are, in this regard, a series of theories on the comparison of images, in terms of analysis of shapes, colors and perspectives. For sake of brevity, however, in regards to this topic as well, I will limit myself to suggesting more detailed readings on the main algorithms used by ‘ImageRank:

SIFT, Scale Invariant Feature Transform (EN)
A performance evaluation of local descriptors (EN)
Difference of Gaussians (EN)

The above mentioned theories are employed by ImageRank to produce a single descriptor vector, o f 128 dimensions, the comparison of which determines  the similarity between images according to the following definition:
Given two images and their related descriptor vectors, the similarity between the images is given by the number of points of interest in common divided by the average of points of interest between them.
Once  the definition of similarity of the images  is determined, however it should be noted that the application of the above is computationally impractical, considering the amount of images currently managed by image search engines (in the billions range). The algorithm is therefore applied with a “query dependent” approach, and so existing metadatation elements (name, tags, page text , etc..)  are still used to reduce the number of images to be analyzed and within this subset the graph of similarities is developed and then ImageRank is finally applied.
A Complete Search System
The current research systems, often return irrelevant images in the results and above all, since they do not analyze the content s of the images, they do not provide a good variety of results when the query is posed differently. ImageRank satisfies these conditions both in the case of   searches with  homogeneous visual concepts, where a single similarity graph will return the image with the highest number of points of interest within the set of results and in the case of heterogeneous visual concepts where a non defined number of concepts and thus a number of similarity graphs, is established a priori, and then within each one,  the most relevant images are recovered.
 
Experimental results and conclusions

To ensure the effective reliability of the algorithm, tests were carried out on a sample of 2000 queries, the most popular queries of Google Product, and from each one the first 1000 images were extracted and ImageRank was applied.
The results were submitted to 150 volunteers for evaluation, with an approach that avoids any influence of personal tastes or experiences in order to assess whether the algorithm actually improves the degree of relevance of the results.
The test results, from  which the  ImageRank  improvement is obvious are reported in the following table:

  

The analysis of test results also revealed the current limits of the algorithm and two cases in particular where ImageRank fails. For some searches, for example “of computers”, among  the more relevant images appears the Dell logo  which is not related to the search, as it doesn’t represent any Dell product  or even computer. The logo is proposed by the algorithm as a relevant image because it is found in many images related  to Dell products and also because the logo provides the algorithm with a wide range of distinctive traits particularly appropriate in terms of  local descriptor  extraction.  The other case where the ImageRank returns irrelevant images is when web page screenshots are present.  Again, here the imprecision is caused by points of interest common to many images.

In conclusion, Image Rank hits the target  both in terms of  a significant reduction in the number of irrelevant images in search results, and in terms of application of the  success  pattern of Page Rank in the retrieval of images, defining a new Hyperlink concept and introducing the operational procedures for its evaluation. Also, Image Rank, opens the way for future research to evaluate the behavior of the system when subjected to adverse conditions (introduction of duplicate images in the results in order to manipulate the relevance of images).
Precisely from this starting point I will try to follow in the coming posts the developments of the algorithm and to study the foundations of subsequent positioning strategies.

Diego Borrelli
SEO Strategist

Add a comment