Thanks to Feng Yan who sent me his newly published work on parallel inference of LDA on GPU.
The basic motivation is that in the circumstances of GPU, display card memory has too small capacity to maintain a copy of nwk matrix for each core in GPU. So the very basic requirement is to keep a global nwk matrix for all cores. This brings a new requirement that when multiple cores work together in sampling, they should not update the same element of nwk simultaneously. Feng gave a solution to partition the training data by not only documents but also words. This is viable due to the observation that:
- for word w1 in document j1 and word w2 in document j2, if w1!=w2 and j1!=j2, simultaneious updates of topic assignment have no read/write conflicts on document-topic matrix njk nor wor-topic matrix nwk.
Feng also presents a preprocess algorithm which computes an optimal data partition under the goal of load balancing.