PageRank for Product Image Search: Google introduces ImageRank
2008 at 13,40
published by ciaopeople
In this document I introduce and analyze the new Google algorithm for the indexing and ranking of images. I believe this issue to be crucial considering the increasing number of search engines and that being introduced by Google, it is a real revolution in the field of image search.
These two points are in some way the problem that Image Rank (IR) intends to address and resolve. The existing algorithms on which, for example, the Google Image search system is based do not involve in any way the reading of the Image. Currently indexing of images is exclusively based on certain factors (filename, alt tags, text adjacent to, etc..) that are not strictly related to it.
This type of approach, obviously not exhaustive, sometimes hits the mark thanks to advancements in technology in the field of indexation of web pages and especially since it avails itself of investment of resources (economic and non).
After this timely clarification on the solutions currently used by search engines, a possible approach to resolve the problem becomes obvious: in order to deliver results that are increasingly relevant in an image search, search engines must “get their hands” into the images themselves, determine what links the images to each other (visual link) and create an iterative process to assign a “numerical weight” to each image in order to determine its ranking.
This concept involves of a series of issues, far from trivial, which, the Google team, seems to have resolved. I will attempt to analyse and explain “how” by studying the official document published at the end of April 2008.
Objectives
1) Introduce an image ranking algorithm based on visual similarity.
2) Introduce a system that is able to re-catalogue and re-order the current Google Images index, demonstrating that, by comparing some parameters of each image, a reliable system of ranking can be derived.
3) Improve the image search results, by evaluating the project on the main searches carried out in the largest collection of existing images.
Approach & Algorithm
The principle underlying the image PageRank algorithm is the Eigenvector Centrality. According to this theory it is possible to measure the importance of all the nodes in a network by giving each one a numerical weight, where the weight assigned to each node is directly proportional to the weight assigned to the nodes connected to that node.
Although this principle is very interesting, I will not get into a lecture on linear algebra but I will try, from a SEO viewpoint to correlate Visual PageRank to PageRank, since the latter is also based on the Eigenvector Centrality.
For our more curious readers, I recommend the following reading material on eigenvector theory:
Eigenvector centrality (EN)
Eigenvalue, eigenvector and eigenspace (EN)
Eigenvalue, eigenvector and eigenspace (IT)
Hierarchical Analysis and comparison matrix
Moving right along with the analysis, the brief definition above can be applied to the PageRank algorithm as follows:
For each web page (node) numerical weight (PR) is assigned as a function of other web pages linked to it (hyperlinks) and their numerical weight.
Understanding the definition/application of Image Rank is not as immediate. The definition of PageRank is based on hyperlinks, elements introduced by humans who represent a conceptual link between web pages and the objective of which is to enable people to easily find other related pages . In the case of images, what can be the linking element? Given a picture, what are the images related to it?
ImageRank is based, as mentioned earlier, on the concept of visual image similarity. Two images are connected (and therefore linked) between them if there are points of attention common to both images, where points of attention may have different shades, different forms or simply be taken from different perspectives while representing the same thing.
A reliable system for the assessment of image similarity is a crucial point for how the algorithm was defined. There are, in this regard, a series of theories on the comparison of images, in terms of analysis of shapes, colors and perspectives. For sake of brevity, however, in regards to this topic as well, I will limit myself to suggesting more detailed readings on the main algorithms used by ‘ImageRank:
The above mentioned theories are employed by ImageRank to produce a single descriptor vector, o f 128 dimensions, the comparison of which determines the similarity between images according to the following definition:
Given two images and their related descriptor vectors, the similarity between the images is given by the number of points of interest in common divided by the average of points of interest between them.
Once the definition of similarity of the images is determined, however it should be noted that the application of the above is computationally impractical, considering the amount of images currently managed by image search engines (in the billions range). The algorithm is therefore applied with a “query dependent” approach, and so existing metadatation elements (name, tags, page text , etc..) are still used to reduce the number of images to be analyzed and within this subset the graph of similarities is developed and then ImageRank is finally applied.
A Complete Search System
The current research systems, often return irrelevant images in the results and above all, since they do not analyze the content s of the images, they do not provide a good variety of results when the query is posed differently. ImageRank satisfies these conditions both in the case of searches with homogeneous visual concepts, where a single similarity graph will return the image with the highest number of points of interest within the set of results and in the case of heterogeneous visual concepts where a non defined number of concepts and thus a number of similarity graphs, is established a priori, and then within each one, the most relevant images are recovered.
Experimental results and conclusions
To ensure the effective reliability of the algorithm, tests were carried out on a sample of 2000 queries, the most popular queries of Google Product, and from each one the first 1000 images were extracted and ImageRank was applied.
The results were submitted to 150 volunteers for evaluation, with an approach that avoids any influence of personal tastes or experiences in order to assess whether the algorithm actually improves the degree of relevance of the results.
The test results, from which the ImageRank improvement is obvious are reported in the following table:
The analysis of test results also revealed the current limits of the algorithm and two cases in particular where ImageRank fails. For some searches, for example “of computers”, among the more relevant images appears the Dell logo which is not related to the search, as it doesn’t represent any Dell product or even computer. The logo is proposed by the algorithm as a relevant image because it is found in many images related to Dell products and also because the logo provides the algorithm with a wide range of distinctive traits particularly appropriate in terms of local descriptor extraction. The other case where the ImageRank returns irrelevant images is when web page screenshots are present. Again, here the imprecision is caused by points of interest common to many images.
In conclusion, Image Rank hits the target both in terms of a significant reduction in the number of irrelevant images in search results, and in terms of application of the success pattern of Page Rank in the retrieval of images, defining a new Hyperlink concept and introducing the operational procedures for its evaluation. Also, Image Rank, opens the way for future research to evaluate the behavior of the system when subjected to adverse conditions (introduction of duplicate images in the results in order to manipulate the relevance of images).
Precisely from this starting point I will try to follow in the coming posts the developments of the algorithm and to study the foundations of subsequent positioning strategies.
Diego Borrelli
SEO Strategist








































IT Ciaoblog