One of the major drawbacks of neural networks (or any machine learning model, in general) is the inability to handle data with huge dimensions, effectively. In this blog post, I will be exploring the technique of neural embedding, which is a variation on using auto-encoders for dimensionality reduction.
Data with high dimensions pose a very unique problem to any statistical analyses as the volume of the vector space under consideration increases exponentially with increase in dimensionality. As the vector space increases exponentially, the number of data points has to grow exponentially in order to retain the statistical significance of the data. Else, the sparsity of the data increases which hinders the statistical analysis employed by a majority of machine learning algorithms. This problem is usually referred to as the Curse Of Dimensionality.
There are numerous approaches for dimensionality reduction, with Principal Component Analysis (PCA) and Auto-Encoders being the most widely used methods for linear and non linear reduction respectively.
Auto-Encoders are feed forward neural networks that are used to reduce the dimensions of the input vector and convert them from a sparse representation into a dense representation. The simplest of these encoders have a single hidden layer, consisting of neurons equal to the desired output dimensionality. The differentiating factor of auto-encoders from a normal feed forward neural network is that the target vector is the same as the input vector. The input-hidden weights are forced to learn the dense representations of sparse input vectors. The hidden-output weights are tuned in such a way that the dense representation can be used to reasonably recreate the input vector. Hence, the neural network is forced to reduce dimensions without losing information (much information, anyway). Thus, instead of using the high dimensioned input vector, feed it as input to the trained auto-encoder and use the hidden activation as the replacement for your machine learning applications. Auto-encoders can be as deep as you wish it to be. Sometimes, a single hidden layer is not enough to encode the input vector. In that case, we can have as many hidden layers as needed and the appropriate hidden layer activation can be considered as the dense representation of input vector.
A variation of this technique is employed in Word2Vec, a widely used word embedding technique (Exploring Word2Vec). Since the target vector in auto-encoders are same as the input vector, each data point is considered independent of each other i.e it is assumed that there is no semantic relation between the data points. On the contrary, as we know, words are semantically linked to each other, more so with the neighbouring words. Hence, in the case of Word2Vec, if the input vector is a vector representation of the current word, the target vector is the vector representation of the neighbouring word. This ensures that along with the dimensionality reduction of the vector representation, the neural net also preserves the semantic relation between neighbouring words. A classic case of two birds, one stone.
We can further tweak the auto-encoders along the lines of Word2Vec in order to make the representation work for us, depending on the use case. We can replace the target vector with any suitable target variable (input vector itself in case of auto-encoders and the neighbouring word in case of Word2Vec). This serves the dual purpose of dimensionality reduction as well as establishing some semantics between data points. On a high level, we can view this as a clustering activity where the dense vectors with similar target values are clustered together in the reduced vector space. Since these vectors already embed some amount of semantic information within them, it becomes easier for other machine learning models to leverage this information and make better predictions.
More information regarding non linear dimensionality reduction can be found at the following wiki page, which I think is very detailed: Nonlinear Dimensionality Reduction.
Data with high dimensions pose a very unique problem to any statistical analyses as the volume of the vector space under consideration increases exponentially with increase in dimensionality. As the vector space increases exponentially, the number of data points has to grow exponentially in order to retain the statistical significance of the data. Else, the sparsity of the data increases which hinders the statistical analysis employed by a majority of machine learning algorithms. This problem is usually referred to as the Curse Of Dimensionality.
There are numerous approaches for dimensionality reduction, with Principal Component Analysis (PCA) and Auto-Encoders being the most widely used methods for linear and non linear reduction respectively.
Auto-Encoders are feed forward neural networks that are used to reduce the dimensions of the input vector and convert them from a sparse representation into a dense representation. The simplest of these encoders have a single hidden layer, consisting of neurons equal to the desired output dimensionality. The differentiating factor of auto-encoders from a normal feed forward neural network is that the target vector is the same as the input vector. The input-hidden weights are forced to learn the dense representations of sparse input vectors. The hidden-output weights are tuned in such a way that the dense representation can be used to reasonably recreate the input vector. Hence, the neural network is forced to reduce dimensions without losing information (much information, anyway). Thus, instead of using the high dimensioned input vector, feed it as input to the trained auto-encoder and use the hidden activation as the replacement for your machine learning applications. Auto-encoders can be as deep as you wish it to be. Sometimes, a single hidden layer is not enough to encode the input vector. In that case, we can have as many hidden layers as needed and the appropriate hidden layer activation can be considered as the dense representation of input vector.
A variation of this technique is employed in Word2Vec, a widely used word embedding technique (Exploring Word2Vec). Since the target vector in auto-encoders are same as the input vector, each data point is considered independent of each other i.e it is assumed that there is no semantic relation between the data points. On the contrary, as we know, words are semantically linked to each other, more so with the neighbouring words. Hence, in the case of Word2Vec, if the input vector is a vector representation of the current word, the target vector is the vector representation of the neighbouring word. This ensures that along with the dimensionality reduction of the vector representation, the neural net also preserves the semantic relation between neighbouring words. A classic case of two birds, one stone.
We can further tweak the auto-encoders along the lines of Word2Vec in order to make the representation work for us, depending on the use case. We can replace the target vector with any suitable target variable (input vector itself in case of auto-encoders and the neighbouring word in case of Word2Vec). This serves the dual purpose of dimensionality reduction as well as establishing some semantics between data points. On a high level, we can view this as a clustering activity where the dense vectors with similar target values are clustered together in the reduced vector space. Since these vectors already embed some amount of semantic information within them, it becomes easier for other machine learning models to leverage this information and make better predictions.
More information regarding non linear dimensionality reduction can be found at the following wiki page, which I think is very detailed: Nonlinear Dimensionality Reduction.
No comments:
Post a Comment