I am sharing here my analysis of Restricted Boltzmann Machines programming assignment in Dr. Hinton’s Neural Networks for Machine Learning course. I am referring to the last question of Week 13 assignment which is presented as an optional exercise for those who are interested in taking on a challenge.
The exercise is about the computation of the partition function (a.k.a. normalization constant that you see in the formula for the Boltzmann distribution). It took me three days to figure this out, and though I won’t disclose the answer (so I don’t break the honor code of the course), I think that I can offer some clarity about how to solve this interesting problem and expand on some of the comments offered by other users in the forums.
The problem is to compute the value of the log of the partition function for the proposed RBM which has 256 visible units (v) and 10 hidden units (h). My approach consisted of three steps:
- Figuring out possible configurations
- Replacing summations with products
- Figuring out parallelization of operations with matrices
Figuring out possible configurations
I tested this idea on a small network (# visible units = 3, # hidden units = 2). As shown below in the attached analysis you can take advantage of the fact that many possible (v, h) configurations evaluate to zero since both v and h need to be active (v=1, h=1) for the connection to have any energy, according to the definition of energy.
Replacing summations with products
The second step in this analysis deals with figuring out how to replace a summation of energies with products of exponentials of the form (1+e^w), where w is an expression of connection weights. For this, I used the factorization mentioned in the forums:
1+a+b+c+ab+bc+abc = (1+a)(1+b)(1+c)
If you look closely to the products of weights that you obtain in step 1, all of those can be factored like this. In the attached analysis you can see how this happens.
Parallelization of operations with matrices
Finally, I figured out how to obtain the weight expressions obtained in the products by taking advantage of the quadratic structure:
in this construct you can use h to sum over rows and v to sum over columns of W. As shown in the analysis, h will correspond to all possible binary combinations for the hidden layer and v will be an identity matrix of the size of the input layer.
I feel that knowing these algebraic tricks are secondary to developing an understanding and appreciation for RBMs and thus, I hope my analysis help people to not feel discourage or lose interest in this topic.