BIG DATA ANALYSIS AND MININGBIG DATA ANALYSIS AND MINING CH1

BIG DATA ANALYSIS AND MINING

CH1

What is BIG DATA?

Big data is a phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.

4V model

Volume: Data at rest, terabytes to exabytes data to process
velocity: data in motion, streaming data milliseconds to seconds to respond
Variety: Data in many form, structured, unstructured and multimedia data
veracity: Data in doubt, uncertainty due to data inconsistency & incompeleteness, ambiguities, latency...

What is DATA MINING(DM)?

DM consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data

Main tasks in DM?

Association rule mining
Cluster analysis
Classification/prediction
Outlier detection

CH2 Subspace learning

curse of dimensionality

The Curse of Dimensionality refers to a series of problems that arise as the dimensionality of data increases in high-dimensional spaces:

the more features we have, the more difficult we process it
distances among points are disappear
existing relevant and irrelevant attributes
there are correlations among subsets of attributes

dimension reduction

Linear method

Principal component analysis (PCA):
1. idea: find projections that capture the largest amounts of variation in data. find the eigenvectors of the covariance matrix, and these eigenvectors define the new sapce.
2. Defination: Given a set of data X, find the principal axes. principal axes are those orthonormal axes onto which the variance retained under projection is maximal.
3. objective: the variance retrains the maximal
Multidimensional scaling (MDS)
1. objective: Attempt to preserve pairwise distances

Nonliniear method

Goal: to unfold, rather than to project linearly.

Locally linear embedding (LLE): attempt to maintain the distance relationship between points and its neighboring points
1. identify the neighbors of each data point
2. compute weights that best linealy reconstruct the point from its neighbor (in high dimension space)
3. find the low-dimensional embedding vector which is best reconstructed by the weights determined in step 2
laplacian eigenmaps(EigenMaps) similar to LLE but different in weights setting and objective function
Isometric feature mapping (ISOMAP): ISOMAP is a special type of MDS method. The difference is that ISOMAP replaces the Euclidean distance in MDS with the shortest path distance between two points in the graph, thus better fitting manifold data.
1. construct the neighbor graph
2. compute the shortest path lenght (geodesic distance) between pairwise data points
3. recover the low dimensional embeddings of data by MDS with perserving those geodesic distance/path lenght
Stochastic neighbor embedding (SNE)： using probabilities characterizes the neighborhood relationships in the original and embedded spaces, and achieves nonlinear dimensionality reduction by constraining distribution matching via the Kullback-Leibler divergence.

in original space, the probability that i pick j as its neighbor

in embedding space, neighbor distribution:

objective function: By minimizing the KL divergence, the consistency of the distributions of P and Q is maintained.

mindmap
      distance based
          global distance
            MDS
            ISOMAP
         local distance
             LLE
             eigenMaps

Subspace clustering

CH3 HASHING

why need hash?

handle “curse of dimensionality”
reduce storage cost
Accelerate the query speed

case study:

K-shingle:

waht is? A sequence of k characters that appears in the document, using this sequence to represent a document
Assuption Documents that have lots of shingles in common have similar text, even if the text appears in different order <=> the proportion of same shingle -> the similarity of text
carefully choose a K for short documents k=5 is ok; k=10 is better for long documents

MinHash:

Preliminary:

Jaccard Similarity of sets

Sim(C1,C2) = |intersection(C1,C2)|/|Union(C1,C2)|

For Boolean Matrices

rows -> elements of Universal set
columns -> sets

example:

(0,0)->this element of this line(row) not appear both in C1 and C2

defination:

Min - hashing represents documents as smaller signature matrices, and approximates the similarity between documents by calculating the similarity between the signature matrices.

outline

compute signatures of columns(is a small summaries of columns)
examine pairs of signartures to find similar signatures
(optional) check that columns with similar signatures are really similar

signatures

Key idea : hash each column C to a small signature Sig(C), such that
1. Sig(C) is small enough that we can fit a signature in main memory for each column.
2. Sim(C1,C2) ≈ Sim(Sig(C1),Sig(C2))
Compute:
1. imagine the rows permuted randomly.
2. hash func h(c) = the number of the first row in which clomns C has 1
3. use several independent hash func to create a signture

Property of Signature:
1. the probability that h(C1)=h(C2) iss the same as Sim(C1,C2)
2. both are a/(a+b+c)

implementation

Problem: Because it's hard to generate a random permutation for a large rows data(take times and memory, and leads to thrashing), we use a hash function to approximate a random permutation.
Details: h(r) gives the order of row r of the permutation (r->original row, h(r)->permutated raw. a hash func is a permutation)
Example:

Locality-Sensitive Hashing

Background and problem: Now we have lots of data that representing lots of object in main memory. this data could be the signatures in minihash or the objects themselves. we want to compare each to each, finding those pairs that are sufficiently similar.it hard to have comparisons on all data.
idea: Hash cloumns to many buckets, and make elements of the same bucket candidate pairs.
algorithm:
1. hash columns of signature matrix M several times.
2. arrange that similar columns are likely to hash to the same bucket
3. candidate pairs are those that hash at least once to the same bucket
details：
1. Divide matrix M into b bands of r rows
2. for each bond, hash its portion of each column to a hash table with k buckets
3. candidate pairs are those that hash at least once to the same bucket
4. tune b and r to catch most similar pairs, but few dissimilar pairs

caculate:

there are 100,000 columns;signatures of 100 integers;we want all 80% similar pairs; choose 20 bands of 5 integers.

1. case of C1,C2 that has 80% similarity: the probability in one particular band C1 AND C2 are identical is (0.8)^5 = 0.328. the probability of C1 and C2 are not similar in any of the 20 bands is (1-0.328)^20 = 0.00035(False negatives)

2.case of C1,C2 that has 30% similarity: the probability in one particular band C1,C2 are identical:(0.3)^5=0.00243.the probability of C1 C2 are identical in at least 1 band is smaller than 20*0.00243=0.0486(Flase positive)

Application:

LSH-based Data Compression:key idea: improve compression ratio by putting similar block data together. theory: After clustering similar blocks, the encoding can be shared during compression to improve compression efficiency.
Learn to Hash:

CH4 Sampling for Big Data

inverse transform sampling

key idea: sampling based on the inverse of cumulative distribution function;firstly, generate a sample Y from Uniform distribution, Then, through the inverse function of the cumulative distribution function, CDF-1, map Y to the sample X of the target distribution Drawbacks: it's hard to get the inverse function

Rejection Sampling

key ideas: accept the samples in the region under the graph of its density function and reject others

details:
1. proposal distribution q(x) should always covers the target distribution p(x): use a big positive M magnify q(x)
2. choice of q(x) matters

importance sampling

key idea: not reject but assign weight to each instance so that the correct distribution is targeted

IS vs. RS

instances from RS share the same weight only some of the instances are reserved
instance from IS have different weight, all instances are reserved
IS is less sensitive to proposal distribution q(x)

CH5 Data stream mining

Challenges of Data stream

what is data stream?

A data sream is a massive sequence of data objects which have some unique features:1. one by one 2.potentially unbounded 3.concept drift

challenges:

Data stream: (a). infinite lenght (b). Evolving Nature

single pass handling
memory limitation
low time complexity
concept drift

Concept drift

concept drift: in pridictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.
in a word, the probability distribution changes
1. change in P(C)
2. change in P(C|X)
3. change in P(X)->virtual concept drift

Concept Drift detection

Distribution-based detector:

basic idea:

Monitoring the change of data distributions between two fixed or adaptive windows of data(e.g. ADWIN)

drawback:
1. hard to determine the window size
2. learn concept slower
3. virtural concept drift
example: Adaptive windowing (ADWIN): the idea is —— whenever two "large enough" subwindows of W exhibit "distinct enough" averages, one can conclude that the correspondinig expected values are different, and the older portion of window is droped.

Error-rate based detector

basic idea:

Capture concept drift based on the change of the classification performance.(e.g. page-hinkley test, DDM)

example: A significant increase iin the errpr of the algorithm suggest a change in class distribution, and whether is a significant increase is base on modeling the error rate as a normal distribution:

Data stream classifiction

algorithm request:

process an example at a time and inspect it only once
be ready to predict at any time
use a limited amount of memory
work in limited amount of time

DT & chanllenges

classic DT learners assume all training data can be simultaneously stored in main momery
Disk-based DT learners repeatedly read training data from disk sequentially
gaol: desgn DT learners that read each example at most once and use a small constant time to process it

VFDT

A decision tree learning system based on the hoeffding tree algorithm
in order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node
1. given a stream of examples, use the first ones to choose the root attribute
2. once the root attribute is choosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.
use Hoeffding bound to decide how many examples are enough at each node
1. G(xi) is a heuristic measure used to choose test attributes(e.g. infomation gain (rate)/Gini index)
2. Xa be the attribute with highest attribute evaluation value after seeing n example
3. Xb be the second highest one
4. Given a $δ$ , if ΔG = G(Xa) - G(Xb) > $ε$ , Hoeffding bound guarantees the true ΔG> $ε$ with probability 1- $δ$ , this node can be split using Xa, the succeeding examples will be passed to the new leaves
5. Hoeffding bound:
6. Hoeffding's inequality gives an upper bound on the probability for the sum of random variables to deviate from its expected value. based on hoeffding bound principle, we can use a small set of sample to divide.
Strengths and weaknesses:
1. strengths: Scales better than traditional methods
2. incremental
weeknesses:
1. spend lots of time and need more examples with ties(平局，即存在多个G接近的属性时)
2. memory used increases with tree expansion
3. Number of candidate attributes affect the efficiency

CVFDT

CVFDT BASIC:
1. extend VFDT
2. maintain VFDT's speed and accuracy
3. detect and repond to changes in the example-generating process
observation & motivation:
1. with time-changing concept, the current splitting attribute of some nodes may not be the best any more
2. but an outdated subtree may still be better than the best single leaf, particularly if it is near the root node->so, grow a alternative subtree with the new best attribute at its root, when the old attribute seems out of date instead of replacing it at once
3. periodically use a bunch of samples to evaluate qualities of trees->replace the old, out-of-date tree with the alternative one when the alternative one becomes more accurate

SyncStream

basic idea：Prototype-based learning：An intuitive way is to dynamically select the short term and/or long term representative examples to capture the trend of time changing concepts.
1. online data maintaining: P-Tree
2. prototypes selection: error-driven representativeness learning and synchronization-inspired constraind clustering
3. sudden concept drift: PCA and statistics
4. Lazy learning: knn
Online data maintaining
1. the prototype level modeling the concept by selecting most representative data (prototype)
2. the concept level store a list of prototype->represent the historical and current concepts

how to measure the representativeness of data?->error driven
1. if data's label is same with K-nearest-neighbor node's result: representativeness+1, else -1
2. if the prototype data counts out of max boundary, remove the low representative data. for those unchanged represtativeness data->summerization(maybe that means those cluster of data could be represented as a smaller prototype)
Abrupt concept drift detection:

Detect based on PCA: using the angle measure the drift

detect by statistic analysis

Learning on open-set data

open-set detection

scenario: incomplete knowledge of the world exist at training time, and unknown classes can be submitted to test set
Task: Not only accurately classify the seen classes, but also effectively deal with unseen ones

incremental learning

Task: maiintain old knowledge and balance old/new classes, extract examples

Data stream clustering

Framework

Online phase:

summarize the data into memory-efficient data structures

Offline phase:

use a clustering algorithm to find the data partion

Micro-cluster:

a micro-cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline marco-clustering phase.

Cluster feature

some algorithm use cluster feature as the mirco-cluster's data structure, Cluster feature: CF=(N, LS, SS)

component: N is data points count, LS = $∑X_i$ , SS = $∑X_i^2$
properties:
1. Property 1 Let $C_1$ and $C_2$ be two sets of points(mirco-cluster). then the cluster feature vector CFT(C1∪C2) = CFT(C1) + CFT(C2)
2. Property 2 Let $C_1$ and $C_2$ be two sets of points(mirco-cluster) and C1⊇C2, then CFT(C1-C2) = CFT(C1) - CFT(C2)

CH6 Hadoop/Spark

waht is hadoop(Design princ)/spark?

what is hadoop

hadoop is a software framework for distributed processing of large datasets across large clusters of computers
open-source implementation for MapReduce
based on a simple programing model called Mapreduce
based on a simple data model, any data will fit

design principles of hadoop

need to process big data
need to parallelize computation across thousands of nodes
commodity hardware (large nums of low-end cheap machines working in parallel to solve a computing problem)
this is contrast to Parallel DBs(Small nums of high-end expensive machines)
automatic paralleliztion & distribution (hidden from end-users)
fault tolerance and automatic recovery(Nodes/tasks sometimes fail and will recover automatically)
clean and simple programing abstraction(only two functions 'map' and 'reduce' are provided for users)

spark

spark's goal was to generalize mapreduce to support new apps within same engine.

spark is a fast and genearal-puprpose cluster computing system.

HDFS

component:1. centralized namenode 2.many datanodes
1. namenode: maintains metadata info about files(managing FsImage file and Editlog file to manager meta infomation: FsImage->(namespace, system properties, block mapping information) EditLog->(record every change to file system metadata)->(EditLog is used to update FsImages))
2. datanode: store the actual data; and files are divided into blocks; each block is replicated N times(Default N=3)
3. HeartBeat & BlockReport: DataNode periodically sends heartbeats and block reports to NameNode. The former informs NameNode that the DataNode is alive, and the latter contains information about the data blocks on that DataNode. After receiving the information, NameNode aggregates all block reports, checks whether the data blocks of files are lost and whether the replication count of data blocks meets the requirements. If not, it will enter safe mode. If NameNode doesn’t receive the heartbeat from a certain DataNode for a certain period of time, NameNode will mark that node as dead, replicate the data blocks originally stored on that DataNode to other nodes, and subsequent new computing tasks will no longer be sent to this DataNode. also，the namenode may dead
main properties of HDFS
1. large: A HDFS instance may consist of thousands of server machines, each storing part of the file system's data
2. replication: each data block is replicated many times
3. failure: failurs is the normal rather than exception
4. fault tolerance: detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS(namenode is consistently checking datanode)

MapReduce

Spark

basic concept

the main abstraction in spark is that of a resilient distributed dataset(RDD), which represents a read-only collection of objects partitioned accross a set of machines that can be rebuilt if a partition is lost (In fact, it can be understood as a distributed collection, and it is as simple as operating a local collection)

RDD

RDDs support two types of operations:transformation,which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset

Transformation: create a new dataset from an existing one. all transformations in spark are lazy: they don't compute their results right away, instead they remember the transformations applied to some base dataset
why lazy?: 1. optimize the required calculations 2. recover from lost data partition

(RDD doesn’t actually store the data that needs to be computed. Instead, it records where the data is located and the transformation relationships (which methods were called and which functions were passed in) )

Spark vs Hadoop

MapReduce:
1. great at one-pass computation, but inefficient for multi-pass aglorithms.
2. No efficient primitives for data sharing
3. MapReduce requires separate engines and reads from and writes to DFS at each step.
Spark:
1. extends a programming language with a resilient distributed collection data structure(RDD)
2. clean APIs in Java, Python...
3. Spark uses the same engine to perform data extraction, model training, and interactive queries