Google Gets an Upgrade

According to Google’s Grzegorz Czajkowski, many of the things people are involved in can be represented in the form of graph, including professional activities and personal relationships. The company was the first to use advanced graph analysis methods like PageRank to get more from the Web.

Using graphs in a dynamic environment like the society brings a new angle to computational power and scalability. Google applies the Bulk Synchronous Parallel Model in its scalable graph analysis by using a framework known as Pregel. Pregel simplifies the calculation of PageRank and scales clusters autonomously without requiring programmers to intervene manually. As a result, the software engineers have more time to concentrate on the algorithm itself.

The Basics of PageRank

PageRank uses the random suffer model that assumes that Web surfers use a linear method when following links until their interests stop or they stop browsing. All the clicks away from the source documents reduce the PageRank. Of course, the actual process is more complex, with the typical value of PageRank dampening being 0.15.

The entire Web can be treated like a graph, where all the pages and index-able files are regarded as ‘vertices’ and the links as ‘edges.’
The vertices are usually initialised with starting values that, interestingly, make no major influence on the end-result. Pregel runs
through a series of super-steps after initialisation by updating values and sending messages to other vertices.

Related Frameworks and Methodologies

According to Bill Slawski of SEO by the Sea, there’s more behind Pregel and Google, which uses other techniques like FlumeJava and Dremel. The company uses Pregel because it is ‘expressive’ and easy to program.
Software engineers have designed their own frameworks and toolkits, especially when dealing with multi-step graph operations.

Characteristics and Benefits of Dremel

Nested data

Interactive speed

Trillion-record, multi-terabyte datasets

Columnar processing and storage

Aggregation tree architecture

Spam analysis

Analysing crawled Web documents

In situ data access

Crash reporting for Google Products

OCR results from Google Books

Tracking the install of Android Market apps

Resource monitoring for work run in Google’s data centres

Debugging map tiles on Google Maps

FlumeJava

Google users started using FlumeJava in May 2009. It is simpler than MapReduce and can control executor and optimizer if necessary. Hundreds of people use pipelines with processing capacities ranging from gigabytes to petabytes every month.
Google employs interchangeable tools and systems that multiple groups can use.

 

References:

http://arxiv.org/abs/1201.2261
http://arxiv.org/ftp/arxiv/papers/1201/1201.2261.pdf
http://arxiv.org/a/petrovic_d_1