Posts

Showing posts from 2015

MM Algorithm[unfinished]

MM algorithm 1. Overview of MM algorithm The MM algorithm is not an algorithm, but a prescription for constructing optimization algorithms.  The EM algorithm from statistics is a special case. An MM algorithm operates by creating a surrogate function that minorizes or majorizes the objective function. When the surrogate function is optimized, the objective function is driven uphill or downhill as needed. In minimization MM stands for majorize/minimize, and in maximization MM stands for minorize/maximize. 2. Rationale for the MM principle It can generate an algorithm that avoids matrix inversion.  It can separate the parameters of a problem.  It can linearize an optimization problem.  It can deal gracefully with equality and inequality constraints.  It can turn a non-differentiable problem into a smooth problem. 3. Majorization and definition of algorithms A function $g(θ | θ^n)$ is said to majorize the function f(θ) at θn provided: 

Conditional Random Fields (CRF) -- unfinished

Reference: 1. John Lafferty, Andrew McCallum, Fernando Pereira . “ Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data” , ICML2001. 2. Fei Sha , Fernando Pereira. “Shallow parsing with Conditional Random Fields ”, HLT-NAACL 2003.

Parameter estimation in text modeling [unfinished]

Image
In this post, we would like to present some parameter estimation methods common with discrete probability distribution, which is very popular in text modeling. Then we explain the model of Latent Dirichlet Allocation (LDA) in detail. I. Introduction There are two inference problems in parameter estimation: (1) how to estimate values for a set of distribution parameters theta that can best explain a set of observation data. (2) calculate the probability of a new observation given by previous observation. We introduce Bayes' rule to solve these problem above. Bayes' rule is defined as: and may be called:  We will introduce maximum likelihood, a posteriori and Bayesian estimation, central concepts like conjugate distributions and Bayesian networks to tackle two problems. II. Parameter estimation methods 1. Maximum likelihood estimation Maximum likelihood (ML) tries to find the parameters that maximize the likelihood The common way to obtain the parameter es

Neuron Network in Machine Learning [unfinished]

(Last updated: 05/12/2015) 1. Why we need to study neural computation ? - To understand how the brain actually works. - To understand a style of parallel computation inspired by neurons and their adaptive connections - To solve practice problems by using novel learning algorithms inspired by the brain 2. What are neural networks ? A typical cortical neuron has a gross physical structure consisting of    1) a cell body    2) an axon where it sends messages to other neurons    3) a dendritic tree where it receives messages from other neurons - Axon contacts dendritic trees at synapses - Spike generation: enough charge in its dendritic tree, depolarize an axon hillock (part of a cell body), sends a spike out along its axon. (Spike: a wave of depolarization travelling along the axom). Synapses    - contains little vesicles of transmitter chemical (ones implementing positive weights and ones implementing negative weights)    - spikes --> axom: these vesicles to migra

Feature scaling

Image
1. What is feature scaling?  Feature scaling is a method used to normalize all independent values of our data. We also call it as data normalization and is generally performed before running machine learning algorithms. 2. Why do we need to use feature scaling?  In practice the range of raw data is very wide and hence the object functions will not work properly (means that it will stuck at local optimum), or they are time-consuming without normalization. For example:  K-Means, might give you totally different solutions depending on the preprocessing methods that you used. This is because an affine transformation implies a change in the metric space: the Euclidean distance between two samples will be different after that transformation. When we apply gradient descent, feature scaling also helps it to converge much faster that without normalized data.  With and without feature scaling in gradient descent 3. Methods in feature scaling? Rescaling The simplest method is resc

Notes on Statistics

Topic #1. Distributions of functions of random variables Given random variables drawn from some known distribution, and a function of these random variables. We are interested in finding the distribution of this function's value. We are going through three techniques to find the probability distributions of random variables: distribution function technique, change of variable technique and moment-generating function technique. First, we consider function of one random variable with two techniques: distribution function technique and change of variables technique. Then, we extend the same technique to transformation of two random variables. Next, we consider cases where the random variables are independent, and sometimes identically distributed. After that, we spend sometimes with the third technique, moment generating function, to find the distribution of functions of random variables. - Source      1) https://onlinecourses.science.psu.edu/stat414/node/127

Manifold Learning [unfinished]

Image
I. What is manifold learning Dimensionality Reduction The accuracy of the training algorithms is directly proportional to the amount of data we have. However, managing a large number of features is usually a burden to our algorithm.  Some of these features may be irrelevant, so it's important to make sure that the final model doesn't get affected by this.  What is Dimensionality?  Dimensionality refers to the minimum number of coordinates needed to specify any point within a space or an object. Why do we need Dimensionality Reduction? If you keep the dimensionality high, it will be nice and unique but it may not be easy to analyse because of complexity involved. Apart from simplifying data, visualization is also an interesting and challenging application. Linear Dimensionality Reduction Fig 1 : PCA illustration (source: http://evelinag.com/blog ) Principle Component Analysis  Given a data set, PCA finds the directions along which the data has max

Find all pairs in array of integers whose sum is equal to a given number ?

Problem: Write a program to find all pairs of integers whose sum is equal to a given number. For example if input integer array is {1, 2, 3, 4, 5}, and the given sum is 5, the code should return two pairs {1, 4} and {2, 3}. This is a common interview question related to array structure. A quick search on Google can return several pages discussing the problem, to list but a few [1,2, 3, 4]. Those pages only discuss the solutions at the conceptual level, without supporting comparing numbers, although they also provide the code. In this note I would re-implement all these approaches and compare their efficiency in practice. Common approaches There are four approaches which are often mentioned together: Brute-force, HashSet, sorting with search and sorting with binary search. Each has its own pros and cons. Interested readers can find the details in the list of sources below. Here is a summary of the storage complexity and computational complexity. Approach Storage co

Feature selection [TODO]

Image
I. Introduction Feature selection is the process of selecting feature or sub set of features for build model in machine learning. A good feature selection method leads to build a better performing model, and hence helps to improve our results. In particular, there are two reasons explaining why feature selection is important: If we are able to reduce number of features, it means that we reduce 'over-fitting' and improve the generalization of models. Moreover, it also helps to reduce the measurement and storage requirements, reducing training and utilization times. We gain a better understanding of the underlying structure and characteristics of the data and leads to better intuition of the algorithm. In this blog, we briefly present some popular methods in the feature selection. We use python and scikit-learn to run all the experiments for feature selection. The data including 506 samples and 13 features is available at (http://scikit-learn.org/stable/datasets/ ).