Dear humanists, scientists, music-makers, and dreamers of dreams,
Stop. Please, please, please stop telling me about the power law / Pareto distribution / Zipf’s law you found in your data. I know it’s exciting, and I know power laws are sexy these days, but really the only excuse to show off your power law is if it’s followed by:
- How it helps you predict/explain something you couldn’t before.
- Why a power law is so very strange as opposed to some predicted distribution (e.g., normal, uniform, totally bodacious, etc.).
- A comparison against other power law distributions, and how parameters differ across them.
- Anything else that a large chunk of the scholarly community hasn’t already said about power laws.
Alright, I know not all of you have broken this rule, and many of you may not be familiar enough with what I’m talking about to care, so I’m going to give a quick primer here on power law and scale-free distributions (they’re the same thing). If you want to know more, read Newman’s (2005) seminal paper.
The take-home message of this rant will be that the universe counts in powers rather than linear progressions, and thus in most cases a power law is not so much surprising as it is overwhelmingly expected. Reporting power laws in your data is a bit like reporting furry ears on your puppy; often true, but not terribly useful. Going further to describe the color, shape, and floppiness of those ears might be just enough to recognize the breed.
The impetus for this rant is my having had to repeat it at nearly every conference I’ve attended over the last year, and it’s getting exhausting.
Even though the content here looks kind of mathy, anybody should be able to approach this with very minimal math experience. Even so, the content is kind of abstract, and it’s aimed mostly at those people who will eventually look for and find power laws in their data. It’s an early intervention sort of rant.
I will warn that I conflate some terms below relating to probability distributions and power laws 1 in order to ease the reader into the concept. In later posts I’ll tease apart the differences between log-normal and power laws, but for the moment let’s just consider them similar beasts.
This is actually a fairly basic and necessary concept for network analysis, which is why I’m lumping this into Networks Demystified; you need to know about this before learning about a lot of recent network analysis research.
Introducing Power Laws
One of the things I’ve been thanked for with regards to this blog is keeping math and programming out of it; I’ll try to stick to that in this rant, but you’ll (I hope) forgive the occasional small formula to help us along the way. The first one is this:
f(x) = xn
The exponent, n, is what’s called a parameter in this function. It’ll be held constant, so let’s arbitrarily set n = -2 to get:
f(x) = x-2
When n = -2, if we make x the set of integers from 1 to 10, we wind up getting a table of values that looks like this:
x | f(x) = x^(-2) -------------------- 1 | 1 2 | 0.25 3 | 0.1111 4 | 0.0625 5 | 0.04 6 | 0.0277 7 | 0.0204 8 | 0.0156 9 | 0.0123 10 | 0.01
When plotted with x values along the horizontal axis and f(x) values along the vertical axis, the result looks like this:
Follow so far? This is actually all the math you need, so if you’re not familiar with it, take another second to make sure you have it down before going forward. That curve the points make in Figure 1, starting up and to the left and dipping down quickly before it shoots to the right, is the shape that people see right before they shout, joyously and from the rooftops, “We have a power law!” It can be an exciting moment, but hold your enthusiasm for now.
There’s actually a more convenient way of looking at these data, and that’s on a log-log plot. What I’ve made above is a linear plot, so-called because the numbers on the horizontal and vertical axes increase linearly. For example, on the horizontal x axis, 1 is just as far away from 2 as 2 is from 3, and 3 is just as far away from 4 as 9 is from 10. We can transform the graph by stretching the the low numbers on the axes really wide and pressing the higher numbers closer together, such that 1 is further from 2 than 2 is from 3, 3 is further from 4 than 4 is from 5, and so forth.
If you’re not familiar with logs, that’s okay, the take-home message is that what we’re doing is not changing the data, we’re just changing the graph. We stretch and squish certain parts of the graph to make the data easier to read, and the result looks like this:
This is called a log-log plot because the horizontal and vertical axes grow logarithmically rather than linearly; the lower numbers are further apart than the higher ones. Note that with this new graph shows the same data as before but, instead of having that steep curve, the data appear on a perfectly straight diagonal line. It turns out that data which changes exponentially appears straight on a log-log plot. This is really helpful, because it allows us to eyeball any dataset thrown on a log-log plot, and if the points look like they’re on a straight line, it’s pretty likely that the best-fit line for the data is some function involving an exponent. That is, a graph that looks like Figure 2 usually has one of those equations I showed above behind it; a “power law.” We generally care about the best-fit line because it either tells us something about the underlying dataset, or it allows us to predict new values that we haven’t observed yet. Why precisely you’d want to fit a line to data is beyond the scope of this post.
To sum up: a function – f(x) = x-2 – produces a set of points along a curved line that falls rapidly (exponentially quickly, actually). That set of points is, by definition, described by a power law; a power function produced it, so it must be. When plotted on a log-log scale, power-law fitting data looks like a straight diagonal line.
When people talk about power laws, they’re usually not actually talking about simple functions like the one I showed above. In the above function ( f(x) = x-2 ), every x has a single corresponding value. For example, as we saw in the table, when x = 1, f(x) = 1. When x = 2, f(x) = 0.25. Rinse, repeat. Each x has one and only one f(x). When people invoke power laws, however, they usually say the distribution of data follows a power law.
Stats people will get angry with me for fudging some vocabulary in the interest of clarity; you should probably listen to them instead of me. Pushing forward, though, it’s worth dwelling over the subject of probability distributions. If you don’t think you’re familiar with probability distributions, you’re wrong. Let’s start with an example most of my readers will probably be familiar with:
The graph is fairly straightforward. Out of a hundred students, it looks like one student has a grade between 50 and 55, three students have a grade between 55 and 60, four students have a grade between 60 and 65, about eight students have a grade between 65 and 70, etc. The astute reader will note that a low B average is most common, with other grades decreasingly common on either side of the peak. This is indicative of a normal distribution, something I’ll touch on again momentarily.
Instead of saying there are a hundred students represented in this graph, let’s say instead that the numbers on the left represent a percentage. That is, 20% of students have a grade between 80 and 85. If we stack up the bars on top of each other, then, they would just reach 100%. And, if we were to pick a student at random from a hat (advice: always keep your students in a hat), the height of each bar corresponds to the probability that the student you picked at random would have the corresponding grade.
More concretely, when I reach into my student hat, 20% of the uncomfortably-packed students have a low B average, so there’s a 20% chance that the student I pick has a low B. There’s about a 5% chance that the student has a high A. That’s why this is a graph of probability distribution, because the height of each bar represents the probability that a particular value will be seen when data is chosen at random. Notice particularly how different the above graph is from the first one we discussed, Figure 1. For example, every x value (the grade group) corresponds to multiple other data points; that is, about 20% of the students correspond to a low B. That’s because the height of the data points, the value on the vertical axis, is a measure of the frequency of a particular value rather than just two bits of information about one data point.
The grade distribution above can be visualized much like the function from before (remember, that f(x) = x-2 bit?), take a look:
This graph and Figure 3 show the same data, except this one is broken up into individual data points. The horizontal axis is just a list of 100 students, in order of their grades, pretending that each number on that axis is their student ID number. They’re in order of their rank, which is generally standard practice. The vertical axis measures each student’s grade.
We can see near the bottom left that only four students scored below sixty, which is the same thing we saw in the degree distribution graph above. The entire point of this is to say that, when we talk about power laws, we’re talking graphs of distributions rather than graphs of data points. While figures 3 and 4 show the same data, figure 3 is the interesting one here, and we say that figure 3 is described by a normal distribution. A normal distribution looks like a bell curve, with a central value with very high frequency (say, students receiving a low B), and decreasing frequencies on either side (few high As and low Fs).
There are other datasets which present power law probability distributions. For example, we know that city population sizes tend to be distributed along a power law 2. That is, city populations are not normally distributed, with most cities having a around the same population and a few cities having some more or some less than the average. Instead, most cities have quite a small population, and very few cities have quite high populations. There are only a few New Yorks, but there are plenty of Bloomington, Indianas.
If we were to plot the probability distribution of cities – that is, essentially, the percent of some national population living in various size cities, we’d get a graph like Figure 5.
This graph should be seen as analogous to Figure 3, the probability distribution of student grades. We see that nearly 20% of cities have a population of about a thousand, another 10% of cities have populations around two thousand, about 1% of cities have a population of twenty thousand, and the tiniest fraction of cities have more than a hundred thousand residents.
If I were to reach into my giant hat filled with cities in the fictional nation graphed here (let’s call it Narnia, because it seems to be able to fit into small places), there’d be a 20% chance that the city I picked up had only a thousand residents, and a negligible chance that the city would be the hugely populated one on the far right. Figure 5 shows that cities, unlike grades, are not normally distributed but instead are distributed along some sort of power function. This can be seen more clearly when the distribution is graphed on a log-log plot like Figure 2, stretching the axes exponentially.
A straight line! A power law! Time to sing from the heavens, right? I mean hey, that’s really cool. We expect things to be normally distributed, with a lot of cities being medium-sized and fewer cities being smaller or larger on either side, and yet city sizes are distributed exponentially, where a few cities have huge populations and increasing numbers of cities hold increasingly smaller portions of the population.
Not so fast. Exponentially shrinking things appear pretty much everywhere, and it’s often far more surprising that something is normally or uniformly distributed. Power laws are normal, in that they appear everywhere. When dealing with differences of size and volume, like city populations or human wealth, the universe tends to enjoy numbers increasing exponentially rather than linearly.
Orders of Magnitude and The Universe
I hope everyone in the whole world has seen the phenomenal 1977 video, Powers of Ten. It’s a great perspective on our place in the universe, zooming out from a picnic to the known universe in about five minutes before zooming back down to DNA and protons. Every ten seconds the video features a zooming level another of order of magnitude smaller or greater, going from ten to a hundred to a thousand meters. This is by no means the same as a power law (it’s the difference of 10x rather than x10), but the point in either case is that understanding the scale of the universe is a lot easier in terms of exponents than in terms of addition or multiplication.
Zipf’s law is pretty cool, guys. It says that most languages are dominated by the use of a few words over and over again, with more rare words exponentially less frequently used. The top-most ranking word in English, “the,” is used twice as much second-most-frequent word, “of,” which itself is used nearly twice as much as “and.” In fact, in one particularly large corpus, only 135 words comprise half of all words used. That is, if we took a favorite book and removed all but the top hundred words used, half the book would still exist. This law holds true for most languages. Power law!
The Long Tail
A few years ago, Wired editor Chris Anderson wrote an article (then a blog, then a book) called “The Long Tail,” which basically talked about the business model of internet giants like Amazon. Because Amazon serves such a wide audience, it can afford to carry books that almost nobody reads, those niche market books that appeal solely to underwater basket weavers or space taxidermists or Twilight readers (that’s a niche market, right? People aren’t really reading those books?).
Local booksellers could never stock those books, because the cost of getting them and storing them would overwhelm the number of people in the general vicinity who would actually buy them. However, according to Anderson, because Amazon’s storage space and reach is nigh-infinite, having these niche market books for sale actually pays off tremendously.
Take a look at Figure 7. It’s not actually visualizing a distribution like in Figure 5; it’s more like Figure 4 with the students and the grades. Pretend each tick on the horizontal axis is a different book, and their heights corresponds to how many times each book is bought. They’re ordered by rank, so the bestselling books are over at the left, and as you go further to the right of the graph, the books clearly don’t do as well.
One feature worth noting is that, in the graph above, the area of green is equal to the area of yellow. That means a few best-sellers comprise 50% of the books bought on Amazon. A handful of books dominate the market, and they’re the green side of Figure 7.
However, that means the other 50% of the market is books that are rarely purchased. Because Amazon is able to store and sell books from that yellow section, it can double its sales. Anderson popularized the term “long tail” as that yellow part of the graph. To recap, we now have cities, languages, and markets following power laws.
The Pareto Principle, also called the 80/20 rule, is brought up in an absurd amount of contexts (often improperly). It says that 80% of land in Italy is held by 20% of the population; 20% of pea pods contain 80% of the peas; the richest 20% control 80% of the income (as of 1989…); 80% of complaints in business come from 20% of the clients; the list goes on. Looking at Figure 7, it’s easy to see how such a principle plays out.
Yay! We finally get to the networks part of this Networks Demystified post. Unfortunately it’ll be rather small, but parts 4 and 5 of Networks Demystified will be about Small World and Scale Free networks more extensively, and this post sort of has to come first.
If you’ve read Parts I and II of Networks Demystified, then you already know the basics. Here’s what you need in brief: Networks are stuff and relationships between them. I like to call the stuff nodes and the relationships edges. A node’s degree is how many times it’s connected to an edge. Read the beginning of Demystified Part II for more detail, but that should be all you need to know for the next example. It turns out that the node degree distribution of many networks follows a power law (surprise!), and scholarly citation networks are no exception.
If you look at scholarly articles, you can pretend each paper is a node and each citation is an edge going from the citing to the cited article. In a turn of events that will shock no-one, some papers are cited very heavily while most are, if they’re lucky, cited once or twice. A few superstar papers attract the majority of citations. In network terms, a very few nodes have a huge degree – are attached to many edges – and the amount of edges per node decreases exponentially as you get to less popular papers. Think of it like Figure 5, where cities with exponentially higher population (horizontal axis) are exponentially more rare (vertical axis). It turns out this law holds true for almost all citation networks.
The concept of preferential attachment is very old and was independently discovered by many, although the term itself is recent. I’ll get into the history of it and relevant citations in my post on scale-free networks, what matters here is the effect itself. In a network, the idea goes, nodes that are already linked to many others will be more likely to collect even more links. The rich get richer. In terms of citation networks, the mechanism by which nodes preferentially attach to other nodes is fairly simple to discern.
A few papers, shortly after publication, happen to get a lot of people very excited; the rest of the papers published at the same time are largely ignored. Those few initial papers are cited within months of their publication and then researchers come across the new papers in which the original exciting papers were cited. That is, papers that are heavily cited have a greater chance of being noticed, which in turn increases their chances of being even more heavily cited as time goes on. The rich get richer.
This is also an example of a positive feedback loop, a point which I’ll touch on in some greater detail in the next section. This preferential attachment appears in all sorts of evolving systems, from social networks to the world wide web. And, in each of these cases, power laws appear present in the networks.
Wikipedia defines positive feedback as “a process in which the effects of a small disturbance on a system include an increase in the magnitude of the perturbation.” We’ve all heard what happens when somebody takes a microphone too close to the speaker it’s projecting on. The speaker picks up some random signal from the mic and projects it back into the mic, which is then re-amplified through the speaker, picked up again by the mic, and so on until everyone in the room has gone deaf or the speaker explodes.
Wikipedia seemingly-innocuously adds “[w]hen there is more positive feedback than there are stabilizing tendencies, there will usually be exponential growth of any oscillations or divergences from equilibrium.” In short, feedback tends to lead to exponential changes. And systems that are not sufficiently stabilized (hint: most of them aren’t) will almost inevitably fall into positive feedback loops, which will manifest as (you guessed it) power laws.
Once more with the chorus: power laws.
By this point I’m sure you all believe me about the ubiquity (and thus impotence) of power laws. Given that, what I’m about to say shouldn’t surprise you but, if you haven’t heard of it before, I promise you it will still come as a shock. It’s called Benford’s Law, and what it says is equal parts phenomenal and absurd.
Consult your local post office records (it’s okay, I’ll wait) and find a list of every street address in your county. Got it? Good. Now, get rid of everything but the street numbers; that is, if one address reads “1600 Pennsylvania Avenue, Northwest Washington, DC 20500,” get rid of everything but the “1600.” We’re actually going to go even further, get rid of everything but the first digit of the street address.
Now, you’re left with a long list of single-digits that used to be the first digits of street addresses. If you were to count all of those digits, you would find that digit “1” is used significantly more frequently than digit “2.” Further, digit “2” appears significantly more frequently than digit “3.” This trend continues down to digit “9.”
That’s odd, but not terribly shocking. It gets more shocking when you find out that the same process (looking at the first digits of numbers and seeing that 1 is used more than 2, 2 more than 3, 3 more than 4, etc.) holds true for the area of rivers, physical constants, socio-economic data, the heights of skyscrapers, death rates, numbers picked at random from issues of Readers’ Digest, etc., etc., etc. In an absurdly wide range of cases, the probability distribution of leading digits in lists of numbers shows lower digits appear exponentially more frequently than higher digits. In fact, when looking at the heights of the world’s tallest buildings, it holds true no matter the scale; that is, if heights are measured in yards, miles, kilometers, or furlongs, Benford’s law still holds.
So what’s going on here? It turns out that if you take the logarithm of all the numbers in these sets, the numbers turn out to be uniformly distributed 3. I won’t unpack logs here (I’ve already gone on too long, haven’t I?), but this basically means that if you take into account powers and exponents, all of these numbers are actually uniformly distributed. The universe tends to filter random numbers into exponential equations before we have a chance to count them, but once we do count them, all we see are power laws.
About four thousand words ago, I wrote that reporting power laws in your data is about as useful as reporting furry ears on your dog. That neither means that power laws are useless nor that they should never be reported. It’s just that they’re not particularly exciting by themselves. If you’ve found a power law in your data, that’s great! Now comes the time when you have to actually describe the line that you’ve found. Finding the parameters for the power law and comparing them against other similar datasets can be exceptionally enlightening, as would using the trends found for prediction or comparison.
Too often (at least once in every conference I’ve attended in the last year), I see people mentioning that their data followed a power law and leaving it at that. Oy vey. Alright, now that that’s out of the way, I promise the next Networks Demystified post will actually be about networks again.
- I opt not to use words like “mass,” “log-normal,” or “exponential,” but don’t worry, I have a rant about people confusing those as well. It’ll come out soon enough. ↩
- Actually it’s log-normal, but it’s close enough for our purposes here. ↩
- The log-uniform distribution is actually a special case of a power law with a parameter of -1. Nifty, huh? ↩