 Hierarchical softmax is an operator
 which approximates the softmax operator
 while giving significant training
 speed gains and reasonably comparable
 performance. In this operator, instead
 of calculating the probabilities of
 all the classes, we calculate the probability
 of each step in the path from root to the
 target word in the hierarchy.

 The operator takes a 2D tensor (Tensor)
 containing a batch of layers, a set of
 parameters represented by the weight
 matrix and bias terms, and a 1D tensor
 (Tensor) holding labels, or the indices
 of the target class. The hierarchy has
 to be specified as an argument to the
 operator.

 The operator returns a 1D tensor holding
 the computed log probability of the
 target class and a 2D tensor of intermediate
 outputs (from the weight matrix and
 softmax from each step in the path from
 root to target class) which will be used
 by the gradient operator to compute
 gradients for all samples in the batch.
