Statistically Accurate Ratings

In a previous post on ratings I noted some issues with using the mean vs the media for the rating. A few days ago Jeff Atwood posted on user ratings and specifically how to sort a set of items that are rated by users. Jeff’s post included material from Evan Miller’s article, How Not To Sort By Average Rating, on the same subject.

Below is the relevant portion of Evan Miller’s article:

CORRECT SOLUTION: Score = Lower bound of Wilson score confidence interval for a Bernoulli parameter

Say what: We need to balance the proportion of positive ratings with the uncertainty of a small number of observations. Fortunately, the math for this was worked out in 1927 by Edwin B. Wilson. What we want to ask is: Given the ratings I have, there is a 95% chance that the “real” fraction of positive ratings is at least what? Wilson gives the answer. Considering only positive and negative ratings (i.e. not a 5-star scale), the lower bound on the proportion of positive ratings is given by:

wilson confidence interval formula

(For a lower bound use minus where it says plus/minus.) Here p is the observed fraction of positive ratings, zα/2 is the (1-α/2) quantile of the standard normal distribution, and n is the total number of ratings. The same formula implemented in Ruby:

require ‘statistics2’

def ci_lower_bound(pos, n, power)
if n == 0
return 0
z = Statistics2.pnormaldist(1-power/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) – z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)

pos is the number of positive rating, n is the total number of ratings, and power refers to the statistical power: pick 0.10 to have a 95% chance that your lower bound is correct, 0.05 to have a 97.5% chance, etc.

Now for any item that has a bunch of positive and negative ratings, use that function to arrive at a score appropriate for sorting on, and be confident that you are using a good algorithm for doing so.

Sadly everybody has simply quoted the mathematical forumla and not even given links to the material on how to derive the formula. I was able to find an article by Keith Dunnigan here that gave an outline of how it is derived along with several other confidence intervals. Hopefully later I can take a look at my textbooks and do a full derivation.



I’m going to demonstrate a simple example using forward agreements for stocks of what we mean when we say there is no arbitrage.

First assume that there is a risk free investment that yields continuous interest at rate r, typically treasury bonds are used as the risk free investment. Now assume we have a stock S that has price S_0 at time 0. Suppose you enter into a forward agreement to pay F_{0,T} at time T for the stock S.

Now using the risk free investment we can say that F_{0,T} is worth F_{0,T}e^{-rT} at time 0 (this is because if we invested F_{0,T}e^{-rT} in the risk free investment at time 0 we would have F_{0,T} money at time T).

The assumption of no arbitrage means that F_{0,T}e^{-rT} = S_0 since by investing F_{0,T}e^{-rT} in treasury bonds we will own the stock at time T and by buying the stock we also own it at time T.

Thus by assuming no arbitrage the forward price is F_{0,T} = S_0e^{rT}.

This is one example of using the assumption of no arbitrage. The Wikipedia article on arbitrage shows the other types of arbitrage.

Audio Analysis, Tools of the Trade

I’ve been reading a little bit about audio analysis and doing a lot of searching for audio libraries. There are two audio analysis methods I want to compute.

1. Compute the loudness of an audio file. That is add up the PCM output and divide by the length in seconds of the audio file. This outputs the average loudness of the file. It’s is an easy measure to compute and might be useful for categorizing music (I haven’t yet tried sorting my music by this “loudness” measure).

2. Do a spectral analysis of the audio file. That is compute the discrete (fast) fourier transform of the file and go from there. As the wikipedia article shows the discrete fourier transform requires significant analysis to get good spectral components for an audio file (see the Spectral Analysis section). For audio analysis there is also the Discrete Time Fourier Transform which is just a variant of the discrete fourier transform.

As of right now I’ve written the code to do 1. but I don’t have anything for 2.. Before I get too excited about doing the spectral analysis of audio files I need to read Example Applications DFT. And also brush up on my fourier analysis.

For implementing the spectral analysis I’m going to use the FFTW library developed at MIT.

I’ve listed below the packages and libraries for audio analysis and fourier analysis that I came across in my search.
SciPy (main python scientific platform (fourier analysis uses the FFTW library), CLAM (large audio analysis framework, lots of dependencies), Baudline Spectrum Analyzer, Matlab (no explanation needed), Octave (Matlab clone), FFTW.

I’m going to use the FFTW library for my program because the other libraries such as scipy etc. require 10’s of dependencies. And I want my program to rely on only a few dependencies. If writing C code gets too onerous I’ll switch to SciPy.

Musical Distance

As of most of you, I play my music through the computer. One of the popular settings for music players is the “shuffle” or “random” mode. That is the next song is randomly (or semi-randomly) selected from the playlist. This was a great innovation for listening to overlooked song in your music collection.

The only problem is that the “shuffle” mode mixes clashing genres that is different genres are played next to each other. This is avoidable by filtering your playlist to include only a specific genre (say rock) and then use the “shuffle” mode to play only rock songs. But for me that’s only a partial solution. Because when the music player lists only rock songs it includes songs by “Rammstein” (heavy metal) and songs by “Jonathon Coulton” (folk rock). In particular I consider Rammstein “hard” music and “Jonathon Coulton” soft music.

I think “Rammstein” and “Jonathon Coulton” are musically far apart. Not as far apart as Classical Music and Rap. But not as close as Classical Music and Jazz. The idea of music genres quantifies the idea of musical closeness to some degree. But I want a quantitative measurement for how close two songs are musically. Similar to how we say that a song is “soft” or “loud/hard”.

The idea of musical distance is that if two songs are musically close together then their musical distance is a small number. Conversely if two songs are musically far apart then their musical distance is large. That is I want a metric (in the mathematical sense) for songs.

The only trouble is I know nothing about audio analysis. I know what a spectrum is but that’s about it. Julius O. Smith III at Stanford has been kind enough to post online his textbooks on the topic of music/audio processing and analysis at Mathematics of the Discrete Fourier Transform with Audio Applications, Introduction to Digital Filter with Audio Applications, and Physical Audio Signal Processing
for Virtual Musical Instruments and Audio Effects
. When I have a couple of weeks to spare (hah) I’ll try giving the books a read to see what I can come up with.

Packing Calculations

Todays problem is from Pugh’s great book “Real Mathematical Analysis“, it’s problem 23 in chapter 1.

Problem Statement:

Let b(R) and s(R) be the number of integer unit cubes in \textbf{R}^M that intersect the ball and the sphere of radius R respectively, centered at the origin. By integer unit cube I mean a unit cube which has its vertices on the integer lattice.

(a) Let m = 2 and calculate the limits

\textrm{lim}_{R \rightarrow \infty} \frac{s(R)}{b(R)} and \textrm{lim}_{R \rightarrow \infty} \frac{s(R)^2}{b(R)}.

(b) Take m \ge 3. What exponent k makes the limit

\textrm{lim}_{R \rightarrow \infty} \frac{s(R)^k}{b(R)}


(c) Let c(R) be the number of integer unit cubes that are contained in the ball of radius R, centered at the origin. Calculate

\textrm{lim}_{R \rightarrow \infty} \frac{c(R)}{b(R)}.

(d) Shift the ball to a new, arbitrary center (not on the integer lattice) and re-calculate the limits.

End Problem Statement

This problem doesn’t involve finding a packing but instead involves counting the number of unit cubes after the packing has been chosen (in this case we’re packing a ball with integer unit cubes). So I find it to be a very interesting problem.

The first simplification I make is to only consider the first quadrant of the ball with radius R. By symmetry s(R) = 4 s_{\textrm{quad 1}}(R) and similarly for b(R) = 4 b_{\textrm{quad 1}}(R), where s_{\textrm{quad 1}} and b_{\textrm{quad 1}} are the functions s and b restricted to the first quadrant. For the rest of this solution I only work with the first quadrant. Let B_R = \{(x, y) \in \textbf{R}^2 : x^2 + y^2 \le R^2, x \ge 0, y \ge 0\}. The area of B_R is \pi R^2/4 so as a first approximation b(R) \approx \pi R^2/4.

After some thought the best way to count the number of unit cubes in s(R) is to imagine tracing out the curve y = R\sin(\theta), \textrm{ } x = R\cos(\theta) as \theta goes from 0 to \pi/2. Let c be the number of unit cubes the curve has been traced through. Initially c = 0. As we trace out the curve, every time y \in \textbf{N} we add 1 to c since the curve has entered a new unit square, similarly if x \in \textbf{N}, 1 is added to c. If y and x are both integers then only 1 should be added to c (this happens very rarely so I mostly ignore this case).

For a small example let R = 2. Then x goes from 2 to 0, so x \in \textbf{N} 3 times, similarly y goes from 0 to 2 so y \in \textbf{N} 3 times.

This gives us an initial count of 6 for s(2) but we need to handle the corner cases.

The corner cases are (x = 0, y = 2) and (x = 2, y = 0). Note that x = 0, y = 2 means both x and y are integers so that square was counted twice instead of once, also don’t forget that the square was counted when the curve entered it. So we need to subtract two from 3 + 3 giving 3 + 3 – 2. Also when x = 2, y = 0 this square is again counted twice so we need to subtract 1 from 3 + 3 – 2 giving 3 + 3 – 3 = 3 Thus s(2) = 3.

Example 2: R = 4, x goes from 4 to 0 and y goes from 0 to 4 so we have 5 + 5 – 3 (remember the corner cases) = 7.

From the above derivations s(R) = 2R - 3. To a first approximation the formula for s(R) in m dimensions is mR.

To calculate b(R) first note that if the packing of B_R was perfect than b(R) would be \pi R^2/4. It’s not a perfect packing so b(R) is greater than \pi R^2/4. To calculate how much more we first note that all the unit cubes used in counting s(R) have on average half their area in B_R and half their area outside B_R (remember we’re taking the limit to large R, so the result follows from symmetry).

There are 2R - 3 of these unit cubes that are only half in B_R. So they cover an area of R – 3/2 inside B_R so b(R) must be \pi R^2/4 - R + 3/2 (the part of B_R covered by unit cubes that are completely contained in B_R) + 2R – 3 (the unit cubes that are part-in and part-out) = \pi R^2/4 + R - 3/2.

To summarize:

s(R) = 2R - 3, \textrm{ } b(R) = \pi R^2/4 + R - 3/2,

c(R) = b(R) - s(R) = \pi R^2/4 - R + 3/2.

Now that we’ve worked out b(R), s(R), and c(R) the answers are:

(a) 0 and 8/\pi.

(b) m, higher dimensions follow the same methods of calculations.

(c) 1.

(d) Hehe, I haven’t tried this part maybe in a future post. My instinct is that the limits are identical.

Quadratic Equation

The quadratic equation of the form x^2 + ax + b = 0 is one of the simplest nonlinear problems in one variable. The method I use for solving it can also be applied to the cubic and quartic equations but of course not the quintic.

Let e_1 \textrm{ and } e_2 be two (not necessarily different) solutions to x^2 + ax + b = 0 . By the fundamental theorem of algebra (x - e_1)(x - e_2) = x^2 + ax + b . Multiplying out x^2 -(e_1 + e_2) x + e_1e_2=x^2 + ax + b . So e_1 + e_2 = -a \textrm{ and } e_1e_2 = b. This is a non-linear system of equations in two variables.

Note that {(e_1 - e_2)}^2 = e_1^2 + e_2^2 -2e_1e_2 which is equal to {(e_1 + e_2)}^2 - 4e_1e_2 . Thus e_1 - e_2 = {(a^2 - 4b)}^{1/2} Which easily yields the solution e_1 = -a/2 +-  {(a^2 - 4b)}^{1/2}.

It’s a fun exercise. The trick is writing the square of e_1 - e_2 and also assuming you already have two solutions.

Listing All Subsets

A common problem in discrete math is listing all subsets for a given set say S. There’s a trivial recursive algorithm for listing all subsets but I want to give an algorithm for listing subsets when S is small.

Lets try out the algorithm for the set S=\{a,b\}.

The subsets are \{\}, \{b\}, \{a\}, \{a,b\}.

Now look at the numbers 0, 1, 2, 3 in binary: 00, 01, 10, 11.

We can say that 00 represents the set \{\}, 01 the set \{a\}, 10 the set \{b\} and 11 the set \{a,b\}. Remember to read the binary numbers from right to left. Given a binary number the set it represents corresponds to the one’s in the binary number, so if the binary number has a one in the 10th bit (where the 10th bit is the 10 character from the right end of the number) than the 10th element in the set S is included in the subset.

This does mean we need an ordering for the set S. But the ordering is only used when listing out the subsets so it’s doesn’t violate the condition that sets aren’t ordered (if we wanted to list the subsets of an infinite set this could be a problem).

Now it’s obvious how to list the subsets of a set S.

1. Create an array, A, of all the numbers from 0 to 2^{|S|}  - 1.

2. Iterate over the array A and for each element, n, of A list out the corresponding subset of S. If n = 1 then the corresponding subset is \{a\}, where S = \{a,b\}.

Use bitwise arithmetic to retrieve the individual bits from a number. For getting the nth bit from a number, x, in java, c, etc. do “x & pow(2,n)”. If x has a 1 in the nth then “x & pow(2,n) == 2n” otherwise it’s == 0.

Thats it. Happy coding.