Random Walks, Learning, and This Blog

| Comments | by mike scarpati

This blog is named after the way I see my personal learning process. I was initially thinking of something like “Breadth First Learning,” but I realized that isn’t quite how it happens for me, and that someone has already written a nice article about it!

Random Walks

Wikipedia says that a random walk is “a mathematical formalization of a path that consists of a succession of random steps.” I’ve seen the concept come up primarily in the domain of graphs, where PageRank is a nice example. The math behind it is quite elegant, but I’m going to focus on the intuition.

A Real Graph to Motivate Us

It is easy to find graph datasets, but I wanted something that was large enough to be interesting, but small enough that computation time and memory requirements would be an issue.

A good example is a network of political blogs from 2005, which appeared in: L. A. Adamic and N. Glance, “The political blogosphere and the 2004 US Election”, in Proceedings of the WWW-2005 Workshop on the Weblogging Ecosystem (2005). If you want to follow along, you can download it here.

Graphs can be visualized in a lot of different ways, but here’s one option (from the graph-tool documentation):

The nodes are sized and colored by their PageRank: big, warm-colored dots are linked to by a lot of important sites, and the small, dark dots on the outside aren’t as important.

We can see that there are two big groups. These are pretty easily explainable-one represents the conservative bloggers, and one represents the liberal bloggers. It isn’t possible to tell from this picture, but the one on the right is the left-leaning group.

To load this into R, we can use

LoadingInR.r
1
2
3
4
require(igraph)
polblogs <- read.graph("polblogs.gml", format="gml")
blognames <- get.vertex.attribute(polblogs, "label")
pageranks <- page.rank.old(polblogs, old=TRUE)

As a note, I used the page.rank.old function because I’m on Windows, and there is an issue that isn’t in the prebuilt binary releases yet.

If you look at the PageRanks, you’ll see that Daily Kos has the largest PageRank, and that is a liberal blog. The biggest, warmest node in the image above is therefore in the liberal cluster, which is how we know that those are the ones on the right.

What is PageRank?

The most intuitive description of PageRank revolver around the “random surfer” named Bob. Let’s say that Bob starts out by reading a randomly chosen blog. After a while, he has read everything of interest on that page, and moves on to another one. How does he decide where to go? Around 85% of the time, he clicks on one of the links on the current blog. The other 15% of the time, he randomly picks from any of the other political blogs out there.

If Bob has no life and does this for a really long time, the percent of time he spends on a given page will be its PageRank (if he spends the same amount of time on each site visit).

Random Walk Similarity

PageRank is a global measure of importance, but a related algorithm can tell us about similarity between blogs. Remember that 15% of the time we navigate to a random blog. To calculate Random Walk similarity, we just change that to “navigate to a specific blog”.

Instead of following Bob the “random surfer”, we now look at Liberal Lucy, who loves Daily Kos. She always starts her morning by reading it, then goes to another blog that it links to after she’s had her fill. 15% of the time, she decides that the other blogs aren’t as good and wants to check back on Daily Kos. The other 85%, she keeps clicking on the links of whatever blog she’s currently reading.

Lucy does this for a really long time, and keeps track of the amount of times she visits each page. The percent of visits to each page is now the random walk similarity . Hopefully there’s an intuitive understanding of why it measures similarity – when there is a link between two pages, you’ll spend some time on the linked page. There might not be a direct link, but if a lot of the pages linked to by Daily Kos link to the same site, that site will get visited a lot, and will be designated as quite similar.

The main appeal of this notion of similarity is that it is robust. Shortest path is another common method of assessing graph similarity, but it can be “short circuited.” Consider two blogs that have a shortest path length of 50 in some graph. Now we add a single edge between the two, and the shortest path length goes down to 1. Do we really think that bloggers choose their links so carefully that a single link can change the distance so dramatically?

Calculating These Measures

PageRank is usually calculated using either the power iteration method, or by using the algebraic solution. There’s plenty of information about these on Wikipedia. The notation differs, but to calculate the similarity instead, you basically just change the damping vector to be a “1-hot” vector, where it has all 0’s except for a 1 in the index of the blog that you want to focus on. Here’s a Python snippet with functions that do this for one blog, or for a full matrix.

randomWalks.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from networkx import *
import numpy as np

def getPersonalizedPR(nodeID, g):
    personalization = {}
    for node in g.node:
        personalization[node] = 1 if node == nodeID else 0
    return pagerank(g, personalization=personalization)
def getAllPPRs(g):
    personalizedPageRanks = np.zeros((len(g.node), len(g.node)))
    for i in xrange(len(g.node)):
        try:
            tmp = getPersonalizedPR(i+1, g)
            for k,v in tmp.iteritems():
                personalizedPageRanks[i,k-1] = v
        except NetworkXError:
            print "Didn't converge on node ", i+1
    return personalizedPageRanks

If you pass getAllPPRs your graph, it will return a matrix, where each row is the random walk similarity for that blog. The weird ‘+1’ and ‘-1’ pieces are because the data I used are 1-based, and Python uses 0-based arrays.

To see what this looks like, I removed the diagonal entries and used implot to generate the following image

Bright reds are the highest scoring, and black means similarity of zero. It isn’t quite as clear as in the network diagram above, but you can still see the two communities. The lower left portion of the graph is liberal blogs being similar to other liberal blogs, and the upper-right is conservative blogs being similar to other conservative blogs. You can also see that there are a few big blogs that get a high weight from bloggers on both sides of the spectrum.

About the Logo

We discussed the common ways of calculating these measures, but it’s also possible to explicitly conduct the random walks. In this case, the computational procedure closely resembles the explanation given above.

I decided make some inefficient but hopefully legible R code that does something like this, then plot the path taken by Lucy from earlier.

explicitWalks.r
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
oneRandomWalk <- function(edges, startNode){
  # Returns a vector with the ids of visited nodes, in order
  ret <- integer(50) #gets us started, maybe will finish
  sourceNode <- startNode
  i <- 1
  while(TRUE){
    ret[i] <- sourceNode
    potentialNodes <- edges[src==sourceNode]$dest
    if(length(potentialNodes) < 1)
      break
    else if(length(potentialNodes) == 1){
      # sample would pick from all ints less than source without this
      sourceNode <- potentialNodes
    } else{
      sourceNode <- sample(potentialNodes, 1)
    }
    i <- i + 1
  }
  ret[ret>0]
}

allWalks <- replicate(100, oneRandomWalk(edges, 155))

This code will run 100 random walks, starting at node 155 (Daily Kos), and return a vector of the nodes visited, in order. I put that in a data frame, matched it with its previous location so that each point is (sourceNode, targetNode) , and kept track of the “step number” (number of pages already visited on this walk) and the walk number (1 to 100 above). I plotted it with ggplot2, letting the color be determined by the step number, and follow a rainbow color scheme. It got too crowded to look even a little nice when I had a ton of walks on the plot, so I just included 10.

I think that it ended up being at least a little bit visually appealing, but I like it because there’s a data-related story behind it.

How does this fit in with learning?

The way I learn is a cross between the way Bob and Lucy read political blogs: a lot of the time, I’ll be working on something that will prompt me to learn about a closely related topic. I might be working on ranking for document classification under class imbalance, and start learning about information retrieval, since that’s where most of the learning to rank literature has been done. Other times, I’ll get an impetus to switch to something completely different. Most recently, this has been wine. I was reading something online, and realized that my experience with GIS data might help figure out good places to locate a vineyard. It tends to keep me from developing super deep knowledge anywhere, but keeps me curious and gives me a pretty broad set of skills.

Comments