2013年6月28日 星期五

Nonlinear dimensionality reduction by locally linear embedding

Locally linear embedding (LLE) is an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs.
Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single
global coordinate system of lower dimensionality, and its optimizations do not involve local minima.

The problem of nonlinear dimensionality reduction

Black outlines in (B) and (C) show the neighborhood of a single point. 
PCA Map faraway data points to nearby points in the plane, failing to identify the underlying structure of the manifold. 



 Steps of locally linear embedding:
(1) Assign neighbors to each data point Xi  (for example byusing the K nearest neighbors).
(2) Compute the weights Wij that best linearly reconstruct Xi from its neighbors, solving the
constrained least-squares problem



(3) Compute the low-dimensional embedding vectors Yi


For implementations of LLE, the algorithm has only one free parameter: the number of neighbors, K.



Images of faces  mapped into the embedding space described by the first two coordinates of LLE.

Large-scale machine learning at twitter

This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform.

This paper means that machine learning is just another Pig script, which allows seamless integration
with existing infrastructure for data management, scheduling, and monitoring in a production environment,

This paper has having three contributions.

  1. Provide an overview of Twitter's analytics stack
  2. Describe Pig extensions that allow seamless integration of machine learning capabilities into this production platform.
  3. Identify stochastic gradient descent and ensemble methods as being particularly amenable to large-scale machine learning.

Training Models

Illustration of how learners are integrated into Pig storage functions. 

By controlling the number of reducers in the nal MapReduce job, we can control the number of models constructed: on the left, a single classi er, and on the right, a two-classi er ensemble.


Their machine learning algorithms can be divided into two classes: batch learners and online learners.

Batch learners require all data to be held in memory, and therefore the Pig storage functions wrapping such learners must first internally buffer all training instances before training. 

Online learners have no such restriction: the Pig storage function simply streams through incoming instances,feeds them to the learner, and then discards the example. 



Conclusion:
Pig allows us to seamlessly scale down onto individual servers or even laptops by running in "local mode".
And the paper describe the Stochastic Gradient Descent and Ensemble Methods application.

Online dictionary learning for sparse coding

Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements.

This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, and propose an online optimization algorithm for dictionary learning, based on stochastic approximations.


Most recent algorithms for dictionary learning are second-order iterative batch procedures, accessing the
whole training set at each iteration in order to minimize a cost function under some constraints.
They cannot effectively handle very large training sets, or dynamic training data changing over time,

This paper makes three main contributions.

  1. Cast  the dictionary learning problem as the optimization of a smooth nonconvex objective function over a convex set,
  2. Propose in  an iterative online algorithm that solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the set of constraints. 
  3. The algorithm is significantly faster than previous approaches to dictionary learning on both small and large datasets of natural images.




 Optimizing the Algorithm

Handling Fixed-Size Datasets.

In practice, although it may be very large, the size of the training set is often finite

Mini-Batch Extension.

Improve the convergence speed of our algorithm by drawing η > 1 signals at each iteration instead of a single one, which is a classical heuristic in stochastic gradient descent algorithms.

Purging the Dictionary from Unused Atoms. 

Every dictionary learning technique sometimes encounters situations
where some of the dictionary atoms are never (or very seldom) used, which happens typically with a very bad intialization.



Hive - A Warehousing Solution Over a Map-Reduce Framework

Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.

Hive, an open-source data warehousing solution built on top of Hadoop.
Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

Data Model

Data in Hive is organized into:
  1. Tables - These are analogous to tables in relational databases. Each table has a corresponding HDFS directory.
    • The data in a table is serialized and stored in files within that directory.
  2. Partitions - Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
  3. Buckets - Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Query Language

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause.
HiveQL is also very extensible. It supports user de ned column transformation (UDF) and aggregation (UDAF) functions implemented in Java.
Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface,

HIVE ARCHITECTURE

Hive Architecture

  • External Interfaces - Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API)
  • The Metastore is the system catalog
  • The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
  • The Compiler is invoked by the driver upon receivin a HiveQL statement.

Metastore

  • Database - is a namespace for tables.
  • Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information.
  • Partition - Each partition can have its own columns and SerDe and storage information.

Compiler

  • The Parser transforms a query string to a parse tree representation.
  • The Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators.
  • The Optimizer performs multiple passes over the logical plan 

A Survey on Transfer Learning

This paper is a survey paper, it summary the spirit of most existed transfer learning algorithm.


What is Transfer Learning on Machine Learning


Machine learning methods work well only under a common assumption:
The training and test data are drawn from the same feature space and the same distribution. When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data.
It is expensive or impossible to recollect the needed training data and rebuild the models.
Knowledge transfer or transfer learning between task domains would be desirable.
Transfer the classification knowledge into the new domain.
Different learning processes between (a) traditional machine learning and (b) transfer learning.


OVERVIEW

This paper focus on transfer learning for classificationregression, and clustering problems that are related more closely to data mining tasks.

In transfer learning, we have the following three main research issues: 
1) what to transfer, 
2) how to transfer, 
3) when to transfer.

“What to transfer” asks which part of knowledge can be transferred across domains or tasks. 
“When to transfer” asks in which situations, transferring skills should be done.

Most current work on transfer learning focuses on 
“What to transfer” and “How to transfer”.



“What to transfer” has four cases


1. 
Context can be referred to as 
instance-based transfer learning

2.
Feature-representation-
transfer approach

3. 
Parameter-transfer 
approach

4. 
The relational knowledge- 
transfer problem

Below are table and overview of Transfer Learning setting.
The detail are in the reference paper and it is too much .


Different Settings of Transfer Learning


An overview of different settings of transfer.

Semi-Supervised Hashing for Scalable Image Retrieval

Unsupervised hashing methods show good performance with metric distances but, in image search, semantic similarity is usually given in terms of labeled pairs of images.
There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy.Moreover, these methods are usually very slow to train.


Hashing methods can be divided into two main categories:unsupervised methods and supervised methods.

1. Unsupervised methods use just the unlabeled data X to generate binary codes for the given points.
Locality Sensitive Hashing (LSH), Spectral Hashing (SH).

2. Supervised hashing methods use with labeled data.
A Boosting Similarity Sensitive Coding (BoostSSC), Restricted Boltzmann Machines (RBMs)

In this paper propose a Semi-Supervised Hashing (SSH) technique that can leverage semantic similarity using labeled data while remaining robust to overfitting.
A supervised term tries to minimize the empirical error on the labeled data while an unsupervised term provides effective regularization by maximizing desirable properties

Semi-Supervised Hashing

Fraction of pairs are associated with two categories of label information, M and C.
M is denoted as a neighbor-pair in which xi and xj are either neighbors in a metric space or share common class labels.
C is called a nonneighbor-pair if two samples are far away in metric space or have different class labels.

Formulation

We want to learn a W that gives the same bits for (xi, xj) belong M and different bits for (xi, xj) belong C.

Define the following objective function measuring the empirical accuracy on the labeled data for a
family of hash functions H:

Defining a matrix S incorporating the pairwise labeled information from Xl as

 The objective function J(H) can be represented as

Learn optimal hash functions H by maximizing the modified objective function with constraints
as:

Relaxing Objective Function

Relax the balancing constraint Sigma hk(xi) = 0 by first showing that this constraint is equivalent to maximizing the variance for the kth bit


Relaxing Orthogonality Constraints

It is well known that for most real-world datasets, most of the variance is contained in top few projections. The orthogonality constraints force one to progressively pick those directions that have very low variance, substantially reducing the quality of lower bits.
Hence, instead of imposing hard orthogonality constraints, authors convert them into a penalty term added to the objective function.


Semi-supervised paradigm is to learn efficient hash codes which can handle semantic similarity/dissimilarity among the data points.
The proposed method leads to a very simple eign-decomposition based solution which is extremely efficient.


The paper is concerned with learning to rank, which is to construct a model or a function for ranking objects.

There are advantages with taking the pairwise approach.
1. Existing methodologies on classification can be directly applied.
2. The training instances of document pairs can be easily obtained in certain scenarios.

There are also problems with the approach
1. The objective of learning is formalized as minimizing errors in classification of document pairs, rather than minimizing errors in ranking of documents
2. The training process is computationally costly, as the number of document pairs is very large.
3. The assumption of that the document pairs are generated.
4. The number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries with more document pairs

This paper propose employing what we call the listwise approach, in which document lists instead of document pairs are used as instances in learning.

The major contributions of this paper include
 (1) proposal of the listwise approach
 (2) formulation of the listwise loss function on the basis of probability models
 (3) development of the ListNet method
 (4) empirical verification of the effectiveness of the approach.


Listwise Approach

The objective of learning is formalized as minimization of the total losses with respect to the training data.















Probability Models

Permutation Probability

Define the permutation probability, so that it has desirable properties for representing the likelihood of a permutation (ranking list), given the ranking function.
Given two lists of scores, we can first calculate two permutation probability distributions from them, and then calculate the distance between the two distributions as the listwise loss function. Since the number of permutations is n!, however, the calculation might be intractable.

Top One Probability

The top one probability of an object represents the probability of its being ranked on the top, given the scores of all the objects.
Theorem 6 shows that we can calculate top one probability, which is efficient.

With the use of top one probability, given two lists of scores we can use any metric to represent the distance
(listwise loss function) between the two score lists. For example, when we use Cross Entropy as metric.
listwise loss function becomes:

Learning Method: ListNet

Authors employ a new learning method for optimizing the listwise loss function based on top one probability, with Neural Network as model and Gradient Descent as optimization algorithm.



The key issue for the listwise approach is to define a listwise loss function.
Authors proposed employing a probabilistic method to solve it. Specifically, we make use of probability models: permutation probability and top one probability to transform ranking scores into probability distributions.
We can then view any metric between probability distributions (e.g., Cross Entropy) as the listwise loss function.

Support vector learning for ordinal regression

This task is referred to as ordinal regression and is complementary to the standard machine learning tasks of classification and metric regression.

In ordinal regression, authors consider a problem which shares properties of both classification  and metric regression.and then present a distribution independent model for ordinal regression, which is based on a loss function that acts on pairs of ranks.


A Risk Formulation for Ordinal Regression



























Thus, the problem of ordinal regression can be reduced to a classification problem on pairs of

objects.





Support Vector Machines for Ordinal Regression

1.Consider a linear function





2. U(\bold{x}) incurs no error if





3. Finite margin 
between the n-dimensional feature vectors Xi(1)-Xi(2)of classes Zi=+1and Zi=-1

4. define parallel hyperplanes passing through each pair  (Xi(1), Xi(2)) by

5.minimizing the squared norm
 under constraint 4.



This paper showed that every ordinal regression problem corresponds to a unique preference learning problem on pairs of objects.
This result builds the link between ordinal regression and classification methods on pairs of objects and allows for a theoretical treatment in the framework of classification.

Latent Dirichlet allocation

The goal is to find short descriptions of the members of a collection that enable efficient processing of large collections while preserving the essential statistical relationships.


Probabilistic LSI (pLSI) model, also known as the aspect model.
There are several problems:
(1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting.
 (2) it is not clear how to assign probability to a document outside of the training set.

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is
that documents are represented as random mixtures over latent topics, where each topic is characterized
by a distribution over words.
Consider mixture models that capture the exchangeability of both words and documents.

Mathematical equation:

probability density on the simplex:

Given the parameters a and b, the joint distribution of a topic mixture q, a set of N topics z, and
a set of N words w is given by:

Graphical model representation of LDA




Relationship with other latent variable models
LDA to simpler latent variable models for text—the unigram model, a

mixture of unigrams, and the pLSI model.

Graphical model representation of different models of discrete data.

The topic simplex for three topics embedded in the word simplex for three words.
The pLSI model induces an empirical distribution on the topic simplex denoted by x. 
LDA places a smooth distribution on the topic simplex denoted by the contour lines.


Actually I am not well-understanding to this paper. There are too much mathematical equations.


Probabilistic latent semantic indexing

This paper present an approach to automated document indexing which is based on a statistical
latent class model for factor analysis of count data.


Many retrieval methods are based on simple word matching strategies to determine the rank of relevance of a document with respect to a query.
They has drawbacks mainly due to the ambivalence of words and their unavoidable lack of precision as well as due to personal style and individual difference in word usage.

Latent Semantic Analysis (LSA) is an approach to automatic indexing and information retrieval that attempts
to overcome these problems by mapping documents as well as terms to a representation in the so called latent semantic space.
However it has a number of de ficits, mainly due to its unsatisfactory statistical foundation.

Probabilistic Latent Semantic Analysis (PLSA) - that has a solid statistical foundation, since it is based on the likelihood principle and defines a proper generative model of the data.
PLSA allows to deal with polysemous words and to explicitly distinguish between different meanings
and different types of word usage.


The Model 

Hidden topic: z \in Z = \{z_1, ..., z_K\}
Word (Term): w \in W = \{w_1, ..., w_M\}
Document: d \in D = \{d_1, ..., d_N\}


Expectation Maximization (EM) algorithm

E- step

 M-step re-estimation 

And more author propose a generalization of maximum likelihood for mixture models -
called tempered EM (TEM) - which is based on entropic regularization and is closely related to a method known as deterministic annealing.

Modify the E-step

The main advantage of TEM in context is to avoid over fitting.


Geometry of the Aspect Model

A continuous latent space is obtained within the space of all multinomial distributions.
This can also be thought of in terms of dimensionality reduction and the sub-simplex can be identified with a probabilistic latent semantic space.
 Sketch of the probability sub-simplex spanned by the aspect model.



This is a method for automated indexing based on a statistical latent class model. This approach has important theoretical advantages over standard LSI, since it is based on the likelihood principle.
It can also take advantage of statistical standard methods for model fitting, over fitting control, and model combination.




2013年6月27日 星期四

Efficient visual search of videos cast as text retrieval


This paper describe object retrieval which searches for and localizes all the occurrences of an object in
a video, given a query image of the object. The object is represented by a set of viewpoint invariant region descriptors.
Efficient retrieval is achieved by employing methods from statistical text retrieval.




Steps:
VIEWPOINT INVARIANT DESCRIPTION

1. Employ the technology of viewpoint invariant segmentation developed for
wide baseline matching

Regions are detected in a viewpoint invariant manner and regions are detected in a viewpoint invariant manner.

2. The second type of region is constructed by selecting areas from an intensity watershed image segmentation.

The SA regions tend to be centred on corner like features, and the MS regions correspond to blobs of high contrast with respect to their surroundings such as a dark window on a grey wall.


3.Each elliptical affine covariant region is represented by a 128- dimensional vector using the SIFT descriptor

 Combining the SIFT descriptor with affine covariant regions gives region description

vectors which are invariant to affine transformations of the image.

To reduce noise and reject unstable regions, information is aggregated over a sequence of frames. The regions detected in each frame of the video are tracked using a simple constant velocity dynamical model and correlation.

BUILDING A VISUAL VOCABULARY

The objective here is to vector quantize the descriptors into clusters which will be the visual ‘words’ for text retrieval.
The vector quantization is carried out by K-means clustering.

Samples of normalized affine covariant regions from clusters
corresponding to a single visual word: (a–c) Shape Adapted regions; (d–f)
Maximally Stable regions.


A. Term frequency–inverse document frequency weighting

where nid is the number of occurrences of word i in document
d, nd is the total number of words in the document d, Ni is the
number of documents containing term i, and N is the number of
documents in the whole database.

The weighting is a product of two terms: the word frequency, nid/nd, and the inverse document
frequency, log N/Ni.

At the retrieval stage documents are ranked by the normalized scalar product (cosine of angle)

B. Stop list
Using a stop list analogy the most frequent visual words that occur in almost all images are suppressed.

C. Spatial consistency
Spatial consistency can be measured quite loosely by requiring that neighbouring matches in the query region lie in a surrounding area in the retrieved frame.

Illustration of spatial consistency voting. To verify a pair of matching
regions (A,B) 

The final score of the frame is determined by summing the spatial consistency votes, and adding
the frequency score sim(vq, vd)
Including the frequency score (which ranges between 0 and 1) disambiguates ranking amongst frames which receive the same number of spatial consistency votes.



The similarity between document retrieval using a bag-of-words, and frame retrieval using a bag-of-visual words:

1. Visual features overlap in the image, some spatial information is implicitly preserved
2. An image query typically contains many more visual words than a text query
3. Internet search engines exploit cues. This query independent rank provides a general indicator of a quality of a web-page and enables more efficient and in some cases more accurate retrieval.


2013年3月13日 星期三

Aggregating local descriptors into a compact image representation



Image search on a very large scale, where three constraints have to be considered jointly:
the accuracy of the search, its efficiency, and the memory usage of the representation


Author propose aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation.

Obtains a significantly higher accuracy with, typically, a 20-byte representation. This is obtained by optimizing:

  1. the representation, i.e., how to aggregate local image descriptors into a vector representation;
  2. the dimensionality reduction of these vectors;
  3. the indexing algorithm.

Two contribution of this work:


  1. proposing a representation that provides excellent search accuracy with a reasonable vector dimensionality
  2. show the advantage of jointly optimizing the trade-off between the dimensionality reduction and the indexation algorithm

VLAD: vector of locally aggregated descriptors


  1. learn a codebook C = {c1, ...ck} of k visual words with k-means. Each local descriptor x is associated to its nearest visual word
  2. The idea of the VLAD descriptor is to accumulate, for each visual word c , the differences    x−ci of the vectors x assigned to ci, This characterizes the distribution of the vectors with respect to the center.                                                                                                                         
  3. A Component of v is obtained as a sum over all the image descriptors:


Images and corresponding VLAD descriptors, fork=16 centroids (D=16×128). The components of the descriptor are represented like SIFT, with negative components (see Equation 1) in red




VLAD performance and dimensionality reduction 



Compare VLAD descriptors with BoF: INRIA Holidays Dataset (mAP,%)

Dimension is reduced to from D to D’ dimensions with PCA


Observations:

  • VLAD better than BoF for a given descriptor size 
    • comparable to Fisher kernels for these operating points 
  • Choose a small D if output dimension D’ is small 


Indexing algorithm: searching with quantization 


  • Search/Indexing = distance approximation problem 
  • The distance between a query vector x and a database vector y is estimated by 
 where q(.) is a quantizer 
    • vector-to-code distance 
  • The choice of the quantizer is critical 
    • needs many centroids 
    • regular k-means and approximate k-means can not be used 
      • we typically want k=2^
        64 
        for 64-bit codes 


Product quantization for nearest neighbor search

  • Vector split into m subvectors: 
  • Subvectors are quantized separately by quantizers 
where each qi is learned by k-means with a limited number of centroids 
  • Example: y = 128-dim vector split in 8 subvectors of dimension 16