Neuron Network in Machine Learning [unfinished]

(Last updated: 05/12/2015)

1. Why we need to study neural computation ?
- To understand how the brain actually works.
- To understand a style of parallel computation inspired by neurons and their adaptive connections
- To solve practice problems by using novel learning algorithms inspired by the brain

2. What are neural networks ?
A typical cortical neuron has a gross physical structure consisting of
   1) a cell body
   2) an axon where it sends messages to other neurons
   3) a dendritic tree where it receives messages from other neurons

- Axon contacts dendritic trees at synapses

- Spike generation: enough charge in its dendritic tree, depolarize an axon hillock (part of a cell body), sends a spike out along its axon. (Spike: a wave of depolarization travelling along the axom).

Synapses
   - contains little vesicles of transmitter chemical (ones implementing positive weights and ones implementing negative weights)
   - spikes --> axom: these vesicles to migrate to the surface, released into the synaptic cleft.
   - transmitter molecules 1) diffuse across synaptic clef and 2) bind to receptor molecules in the membrane of the post-synaptic neuron -> change their shape -> create holes in the membrane -> allow specific ions in or out of the post-synaptic neuron -> changes their state of depolarization 0 -> synapses adapt

How synapses adapt ? (the most important property)
   1) varying the number of vesicles that get released when a spike arrives
   2) varying the number of receptor molecules that are sensitive to the released transmitter molecules.
- They use locally available signals to change their strengths, but how to do that ?

3. Overview of main types of neural network architecture
Neural network architecture or the way the neurons are connected together.
For example:
- feed forward neural network: info comes into the input units and flows in one direction through hidden layers until each reaches the output.
- recurrent neural network: info can flow round in cycles (more powerful than feed forward nn ?)
- symmetrically-connected network: the weights are the same in both directions between two units.

3.1 Feed forward neural network:
- First layer: input, last layer: output
- More than one hidden layer: deep neural network
- Compute a series of transformations that change the similarities between cases
   + Each layer, we get new representation of the input in which things that were similar in the previous layer may have become less similar, and vice versa.
   + Activities of the neurons in each layer are a non-linear function of the activities in the layer below

3.2 Recurrent networks
- Directed cycles in their connection graph
   + get back to where we started
- Very difficult to train
- More biologically realistic an interesting structure
- Recurrent nets with multi hidden layers are just a special case that has some of the hidden -> hidden connection missing.
- Natural way to model sequential data

3.3 Symmetrically connected networks
- like current nets, connections between units are symmetrical (same weights in both directions)
- Easier to analyze than recurrent nets
- more restricted in what they can do (e.g. cannot model cycles) because they have to obey an energy function.

4. Perceptron
- The first generation of neural networks
- Limitations:
    + Sensitive to features
    + Even with selected features, perceptron may not learn well.
    + "Minsy and Papert's "Group Invariance Theorem" says that part of a perceptron that learns cannot learn to do this if the transformations form a group. (The whole point of pattern recognition is to recognize patterns despite transformations like translation)

- Neural nets is really powerful if we can learn the feature detectors, not only the weights.
- Learning with hidden units
    + Networks without hidden units are very limited in the input-output mappings they can learn to model (for non-linear separable case, more layers of linear units do not help, fixed output non-linearities are not enough)
    + Mult layers of adaptive, non-linear hidden units --> how to train ?

5. Linear NNs
- Perceptron vs Linear Neuron:
   + In Perceptron, the weights getting closer to good set of weights, but not the outputs (we only focus on adapting the weight to satisfy an instance of training data)
   + In a Linear Neuron, the outputs -> target outputs.

- "Multi-layer" neural networks do not use the perceptron learning procedure (in non-convex problem, averaging two good sets of weights may give a bad set of weights) -> how to show that a learning procedure makes progress ?
   + Shows that the output values --> target values (squared error measure).

- Linear neurons (linear filters) has a real output that's simply the weighted sum of their inputs.
   + Aim of learning: minimize the error summed over all the training cases
 
- Use iterative learning procedure together with the squared error measure to train the linear nets. (computing the derivative)

- Behavior of the iterative learning procedure 
   + Does the learning procedure eventually get the right answer ? 1) There may be no perfect answer. 2) Making the learning rate small enough we can get as close as we desire to the best answer.
   + How quickly do the weights converge to their correct values ? 1) The learning can be quite slow if two input dimensions are highly correlated.

- Online delta-rule vs. learning rule for perceptrons
   + Perceptrons: inc or dec the weight vector by the input vector when we make an error
   + Delta-rule: inc or dec the weight vector by the input vector scaled by the residual error and the learning rate (how to set the learning rate ? too big a learning rate --> unstable, too small -> long time)

- The error surface of a linear function
   + Why we need to understand the error surface: to understand what happens as a linear neuron is learning
   + Online vs batch learning (moving perpendicular to the contour lines or to the constraint planes)
   + Why learning can be slow ? (some conditions make the learning very slow: if the ellipse is very elongated (when the lines correspond to training cases is almost parallel))

- From Linear Net to Multi-layer nets of non-linear neurons
   + Step 1: extend the learning rule to single non-linear neurons by wrapping the linear input to a non-linear function,e.g. logistic function.
   + Step 2: the

- Logistic neurons: Modify the update function to learn the weights of the logistic units (details on derivative please refer the the Lecture 3, part 7)

- Multi-net neural network:
   + Learning with hidden units:
      - Networks without hidden units are very limited in the input-output they can model
      - Using hand-coded features can also make the network powerful --> heavily on designing the features. Is there a way to find good features without requiring insights into the task or repeated trial and error where we guess some features and see how well they work
      - Automate the loop of designing features for a particular task and seeing how they work
      - A straightforward but very inefficient ways of learning: Learn by perturbing the weights, randomly, and recording any improvement in performance (reinforcement learning). It is inefficient because 1) in order to decide whether to change one weight, we need to do multiple forward passes on a representative set of training cases. (Back-propagation is much more efficient, by a factor of the number of weights in the network.) 2) large weight perturbations will nearly always make things worse, because the weights need to have the right relative values.
      - A variant of learning by weight perturbation: parallel the changes and correlate the performance gain with the weight changes --> not better than the original at all, still requires a lot of trials. Better approach: randomly perturb the activities of the hidden, rather than the weights.
      - A better idea: randomly perturb the activities of the hidden units (how ?)
   + Back-propagation
      - Compute how fast the error changes as we change a hidden activity on a particular training case by
           + Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities
           + Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined --> compute error derivatives for all the hidden units at the same time --> convert those error derivatives into error derivatives for the weights coming into a hidden unit.
      - BP algorithm on a single case
           + Define the error: discrepancy between output and its target value --> error derivative
           + Back-propagation the errors: Compute error derivatives in each hidden layer from error derivatives in the layer above. If one unit connects to many units above, we need to combine them using the same weights as we use in the forward pass to get error derivatives in the below layers. The error derivatives w.r.t to the weight \(w_{ij}\) (unit i in the below layer connects to unit j in the above layer) is the output from unit i times with the error derivatives w.r.t. the total inputs of unit j.
      - From error derivatives into a learning procedure: The back-propagation algorithm is an efficient way of computing the error derivative dE/dw for every weight on a single training case. There are still things we need to consider to have a fully learning procedure (how often do we update the weights, how to prevent the network from over-fitting in large network, ...)
           + Optimization issues: how do we use the error derivatives on individual cases to discover a good set of weights ? (how often to update (online, full-batch, mini-batch) ?, how much to update ?)
           + Generalization issues: how do we ensure that the learned weights work well for cases we did not see during training ?
         



Sources:
1. https://class.coursera.org/neuralnets-2012-001/

http://cs224d.stanford.edu/syllabus.html

Comments

Popular posts from this blog

Manifold Learning [unfinished]

Find all pairs in array of integers whose sum is equal to a given number ?

Feature scaling