2013年6月28日 星期五

Nonlinear dimensionality reduction by locally linear embedding

Locally linear embedding (LLE) is an unsupervised learning algorithm that computes low-dimensional, neighborhood-preserving embeddings of high-dimensional inputs.
Unlike clustering methods for local dimensionality reduction, LLE maps its inputs into a single
global coordinate system of lower dimensionality, and its optimizations do not involve local minima.

The problem of nonlinear dimensionality reduction

Black outlines in (B) and (C) show the neighborhood of a single point. 
PCA Map faraway data points to nearby points in the plane, failing to identify the underlying structure of the manifold. 



 Steps of locally linear embedding:
(1) Assign neighbors to each data point Xi  (for example byusing the K nearest neighbors).
(2) Compute the weights Wij that best linearly reconstruct Xi from its neighbors, solving the
constrained least-squares problem



(3) Compute the low-dimensional embedding vectors Yi


For implementations of LLE, the algorithm has only one free parameter: the number of neighbors, K.



Images of faces  mapped into the embedding space described by the first two coordinates of LLE.

Large-scale machine learning at twitter

This paper presents a case study of Twitter's integration of machine learning tools into its existing Hadoop-based, Pig-centric analytics platform.

This paper means that machine learning is just another Pig script, which allows seamless integration
with existing infrastructure for data management, scheduling, and monitoring in a production environment,

This paper has having three contributions.

  1. Provide an overview of Twitter's analytics stack
  2. Describe Pig extensions that allow seamless integration of machine learning capabilities into this production platform.
  3. Identify stochastic gradient descent and ensemble methods as being particularly amenable to large-scale machine learning.

Training Models

Illustration of how learners are integrated into Pig storage functions. 

By controlling the number of reducers in the nal MapReduce job, we can control the number of models constructed: on the left, a single classi er, and on the right, a two-classi er ensemble.


Their machine learning algorithms can be divided into two classes: batch learners and online learners.

Batch learners require all data to be held in memory, and therefore the Pig storage functions wrapping such learners must first internally buffer all training instances before training. 

Online learners have no such restriction: the Pig storage function simply streams through incoming instances,feeds them to the learner, and then discards the example. 



Conclusion:
Pig allows us to seamlessly scale down onto individual servers or even laptops by running in "local mode".
And the paper describe the Stochastic Gradient Descent and Ensemble Methods application.

Online dictionary learning for sparse coding

Sparse coding—that is, modelling data vectors as sparse linear combinations of basis elements.

This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, and propose an online optimization algorithm for dictionary learning, based on stochastic approximations.


Most recent algorithms for dictionary learning are second-order iterative batch procedures, accessing the
whole training set at each iteration in order to minimize a cost function under some constraints.
They cannot effectively handle very large training sets, or dynamic training data changing over time,

This paper makes three main contributions.

  1. Cast  the dictionary learning problem as the optimization of a smooth nonconvex objective function over a convex set,
  2. Propose in  an iterative online algorithm that solves this problem by efficiently minimizing at each step a quadratic surrogate function of the empirical cost over the set of constraints. 
  3. The algorithm is significantly faster than previous approaches to dictionary learning on both small and large datasets of natural images.




 Optimizing the Algorithm

Handling Fixed-Size Datasets.

In practice, although it may be very large, the size of the training set is often finite

Mini-Batch Extension.

Improve the convergence speed of our algorithm by drawing η > 1 signals at each iteration instead of a single one, which is a classical heuristic in stochastic gradient descent algorithms.

Purging the Dictionary from Unused Atoms. 

Every dictionary learning technique sometimes encounters situations
where some of the dictionary atoms are never (or very seldom) used, which happens typically with a very bad intialization.



Hive - A Warehousing Solution Over a Map-Reduce Framework

Hadoop is a popular open-source map-reduce implementation which is being used as an alternative to store and process extremely large data sets on commodity hardware.
However, the map-reduce programming model is very low level and requires developers to write custom programs which are hard to maintain and reuse.

Hive, an open-source data warehousing solution built on top of Hadoop.
Hive supports queries expressed in a SQL-like declarative language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

Data Model

Data in Hive is organized into:
  1. Tables - These are analogous to tables in relational databases. Each table has a corresponding HDFS directory.
    • The data in a table is serialized and stored in files within that directory.
  2. Partitions - Each table can have one or more partitions which determine the distribution of data within sub-directories of the table directory.
  3. Buckets - Data in each partition may in turn be divided into buckets based on the hash of a column in the table.

Query Language

Hive provides a SQL-like query language called HiveQL which supports select, project, join, aggregate, union all and sub-queries in the from clause.
HiveQL is also very extensible. It supports user de ned column transformation (UDF) and aggregation (UDAF) functions implemented in Java.
Users can embed custom map-reduce scripts written in any language using a simple row-based streaming interface,

HIVE ARCHITECTURE

Hive Architecture

  • External Interfaces - Hive provides both user interfaces like command line (CLI) and web UI, and application programming interfaces (API)
  • The Metastore is the system catalog
  • The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
  • The Compiler is invoked by the driver upon receivin a HiveQL statement.

Metastore

  • Database - is a namespace for tables.
  • Table - Metadata for table contains list of columns and their types, owner, storage and SerDe information.
  • Partition - Each partition can have its own columns and SerDe and storage information.

Compiler

  • The Parser transforms a query string to a parse tree representation.
  • The Logical Plan Generator converts the internal query representation to a logical plan, which consists of a tree of logical operators.
  • The Optimizer performs multiple passes over the logical plan 

A Survey on Transfer Learning

This paper is a survey paper, it summary the spirit of most existed transfer learning algorithm.


What is Transfer Learning on Machine Learning


Machine learning methods work well only under a common assumption:
The training and test data are drawn from the same feature space and the same distribution. When the distribution changes, most statistical models need to be rebuilt from scratch using newly collected training data.
It is expensive or impossible to recollect the needed training data and rebuild the models.
Knowledge transfer or transfer learning between task domains would be desirable.
Transfer the classification knowledge into the new domain.
Different learning processes between (a) traditional machine learning and (b) transfer learning.


OVERVIEW

This paper focus on transfer learning for classificationregression, and clustering problems that are related more closely to data mining tasks.

In transfer learning, we have the following three main research issues: 
1) what to transfer, 
2) how to transfer, 
3) when to transfer.

“What to transfer” asks which part of knowledge can be transferred across domains or tasks. 
“When to transfer” asks in which situations, transferring skills should be done.

Most current work on transfer learning focuses on 
“What to transfer” and “How to transfer”.



“What to transfer” has four cases


1. 
Context can be referred to as 
instance-based transfer learning

2.
Feature-representation-
transfer approach

3. 
Parameter-transfer 
approach

4. 
The relational knowledge- 
transfer problem

Below are table and overview of Transfer Learning setting.
The detail are in the reference paper and it is too much .


Different Settings of Transfer Learning


An overview of different settings of transfer.

Semi-Supervised Hashing for Scalable Image Retrieval

Unsupervised hashing methods show good performance with metric distances but, in image search, semantic similarity is usually given in terms of labeled pairs of images.
There exist supervised hashing methods that can handle such semantic similarity but they are prone to overfitting when labeled data is small or noisy.Moreover, these methods are usually very slow to train.


Hashing methods can be divided into two main categories:unsupervised methods and supervised methods.

1. Unsupervised methods use just the unlabeled data X to generate binary codes for the given points.
Locality Sensitive Hashing (LSH), Spectral Hashing (SH).

2. Supervised hashing methods use with labeled data.
A Boosting Similarity Sensitive Coding (BoostSSC), Restricted Boltzmann Machines (RBMs)

In this paper propose a Semi-Supervised Hashing (SSH) technique that can leverage semantic similarity using labeled data while remaining robust to overfitting.
A supervised term tries to minimize the empirical error on the labeled data while an unsupervised term provides effective regularization by maximizing desirable properties

Semi-Supervised Hashing

Fraction of pairs are associated with two categories of label information, M and C.
M is denoted as a neighbor-pair in which xi and xj are either neighbors in a metric space or share common class labels.
C is called a nonneighbor-pair if two samples are far away in metric space or have different class labels.

Formulation

We want to learn a W that gives the same bits for (xi, xj) belong M and different bits for (xi, xj) belong C.

Define the following objective function measuring the empirical accuracy on the labeled data for a
family of hash functions H:

Defining a matrix S incorporating the pairwise labeled information from Xl as

 The objective function J(H) can be represented as

Learn optimal hash functions H by maximizing the modified objective function with constraints
as:

Relaxing Objective Function

Relax the balancing constraint Sigma hk(xi) = 0 by first showing that this constraint is equivalent to maximizing the variance for the kth bit


Relaxing Orthogonality Constraints

It is well known that for most real-world datasets, most of the variance is contained in top few projections. The orthogonality constraints force one to progressively pick those directions that have very low variance, substantially reducing the quality of lower bits.
Hence, instead of imposing hard orthogonality constraints, authors convert them into a penalty term added to the objective function.


Semi-supervised paradigm is to learn efficient hash codes which can handle semantic similarity/dissimilarity among the data points.
The proposed method leads to a very simple eign-decomposition based solution which is extremely efficient.