According to Google’s Grzegorz Czajkowski, many of the things people are involved in can be represented in the form of graph, including professional activities and personal relationships. The company was the first to use advanced graph analysis methods like PageRank to get more from the Web.
Using graphs in a dynamic environment like the society brings a new angle to computational power and scalability. Google applies the Bulk Synchronous Parallel Model in its scalable graph analysis by using a framework known as Pregel. Pregel simplifies the calculation of PageRank and scales clusters autonomously without requiring programmers to intervene manually. As a result, the software engineers have more time to concentrate on the algorithm itself.
The Basics of PageRank
PageRank uses the random suffer model that assumes that Web surfers use a linear method when following links until their interests stop or they stop browsing. All the clicks away from the source documents reduce the PageRank. Of course, the actual process is more complex, with the typical value of PageRank dampening being 0.15.
The entire Web can be treated like a graph, where all the pages and index-able files are regarded as ‘vertices’ and the links as ‘edges.’
The vertices are usually initialised with starting values that, interestingly, make no major influence on the end-result. Pregel runs
through a series of super-steps after initialisation by updating values and sending messages to other vertices.
Related Frameworks and Methodologies
According to Bill Slawski of SEO by the Sea, there’s more behind Pregel and Google, which uses other techniques like FlumeJava and Dremel. The company uses Pregel because it is ‘expressive’ and easy to program.
Software engineers have designed their own frameworks and toolkits, especially when dealing with multi-step graph operations.
Characteristics and Benefits of Dremel
Trillion-record, multi-terabyte datasets
Columnar processing and storage
Aggregation tree architecture
Analysing crawled Web documents
In situ data access
Crash reporting for Google Products
OCR results from Google Books
Tracking the install of Android Market apps
Resource monitoring for work run in Google’s data centres
Debugging map tiles on Google Maps
Google users started using FlumeJava in May 2009. It is simpler than MapReduce and can control executor and optimizer if necessary. Hundreds of people use pipelines with processing capacities ranging from gigabytes to petabytes every month.
Google employs interchangeable tools and systems that multiple groups can use.