BIG DATA ANALYSIS AND MINING
CH1
What is BIG DATA?
Big data is a phrase, used to describe a massive volume of both structured and unstructured data that is so large that it's difficult to process using traditional database and software techniques.
4V model
- Volume: Data at rest, terabytes to exabytes data to process
- velocity: data in motion, streaming data milliseconds to seconds to respond
- Variety: Data in many form, structured, unstructured and multimedia data
- veracity: Data in doubt, uncertainty due to data inconsistency & incompeleteness, ambiguities, latency...
What is DATA MINING(DM)?
DM consists of applying data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data
Main tasks in DM?
- Association rule mining
- Cluster analysis
- Classification/prediction
- Outlier detection
CH2 Subspace learning
curse of dimensionality
The Curse of Dimensionality refers to a series of problems that arise as the dimensionality of data increases in high-dimensional spaces:
- the more features we have, the more difficult we process it
- distances among points are disappear
- existing relevant and irrelevant attributes
- there are correlations among subsets of attributes
dimension reduction
Linear method
-
Principal component analysis (PCA):
- idea: find projections that capture the largest amounts of variation in data. find the eigenvectors of the covariance matrix, and these eigenvectors define the new sapce.
- Defination: Given a set of data X, find the principal axes. principal axes are those orthonormal axes onto which the variance retained under projection is maximal.
- objective: the variance retrains the maximal
-
Multidimensional scaling (MDS)
- objective: Attempt to preserve pairwise distances
Nonliniear method
Goal: to unfold, rather than to project linearly.
-
Locally linear embedding (LLE): attempt to maintain the distance relationship between points and its neighboring points
- identify the neighbors of each data point
- compute weights that best linealy reconstruct the point from its neighbor (in high dimension space)
- find the low-dimensional embedding vector which is best reconstructed by the weights determined in step 2
-
laplacian eigenmaps(EigenMaps) similar to LLE but different in weights setting and objective function
-
Isometric feature mapping (ISOMAP): ISOMAP is a special type of MDS method. The difference is that ISOMAP replaces the Euclidean distance in MDS with the shortest path distance between two points in the graph, thus better fitting manifold data.
- construct the neighbor graph
- compute the shortest path lenght (geodesic distance) between pairwise data points
- recover the low dimensional embeddings of data by MDS with perserving those geodesic distance/path lenght
-
Stochastic neighbor embedding (SNE): using probabilities characterizes the neighborhood relationships in the original and embedded spaces, and achieves nonlinear dimensionality reduction by constraining distribution matching via the Kullback-Leibler divergence.
- in original space, the probability that i pick j as its neighbor
- in embedding space, neighbor distribution:
- objective function: By minimizing the KL divergence, the consistency of the distributions of P and Q is maintained.
mindmap
distance based
global distance
MDS
ISOMAP
local distance
LLE
eigenMaps
Subspace clustering
CH3 HASHING
why need hash?
- handle “curse of dimensionality”
- reduce storage cost
- Accelerate the query speed
case study:
K-shingle:
- waht is? A sequence of k characters that appears in the document, using this sequence to represent a document
- Assuption Documents that have lots of shingles in common have similar text, even if the text appears in different order <=> the proportion of same shingle -> the similarity of text
- carefully choose a K for short documents k=5 is ok; k=10 is better for long documents
MinHash:
Preliminary:
- Jaccard Similarity of sets
Sim(C1,C2) = |intersection(C1,C2)|/|Union(C1,C2)|
- For Boolean Matrices
- rows -> elements of Universal set
- columns -> sets
example:
(0,0)->this element of this line(row) not appear both in C1 and C2
defination:
Min - hashing represents documents as smaller signature matrices, and approximates the similarity between documents by calculating the similarity between the signature matrices.
outline
- compute signatures of columns(is a small summaries of columns)
- examine pairs of signartures to find similar signatures
- (optional) check that columns with similar signatures are really similar
signatures
-
Key idea : hash each column C to a small signature Sig(C), such that
- Sig(C) is small enough that we can fit a signature in main memory for each column.
- Sim(C1,C2) ≈ Sim(Sig(C1),Sig(C2))
-
Compute:
- imagine the rows permuted randomly.
- hash func h(c) = the number of the first row in which clomns C has 1
- use several independent hash func to create a signture
- Property of Signature:
- the probability that h(C1)=h(C2) iss the same as Sim(C1,C2)
- both are a/(a+b+c)
implementation
- Problem: Because it's hard to generate a random permutation for a large rows data(take times and memory, and leads to thrashing), we use a hash function to approximate a random permutation.
- Details: h(r) gives the order of row r of the permutation (r->original row, h(r)->permutated raw. a hash func is a permutation)
- Example:
Locality-Sensitive Hashing
- Background and problem: Now we have lots of data that representing lots of object in main memory. this data could be the signatures in minihash or the objects themselves. we want to compare each to each, finding those pairs that are sufficiently similar.it hard to have comparisons on all data.
- idea: Hash cloumns to many buckets, and make elements of the same bucket candidate pairs.
- algorithm:
- hash columns of signature matrix M several times.
- arrange that similar columns are likely to hash to the same bucket
- candidate pairs are those that hash at least once to the same bucket
- details:
- Divide matrix M into b bands of r rows
- for each bond, hash its portion of each column to a hash table with k buckets
- candidate pairs are those that hash at least once to the same bucket
- tune b and r to catch most similar pairs, but few dissimilar pairs
- caculate:
there are 100,000 columns;signatures of 100 integers;we want all 80% similar pairs; choose 20 bands of 5 integers.
1. case of C1,C2 that has 80% similarity: the probability in one particular band C1 AND C2 are identical is (0.8)^5 = 0.328. the probability of C1 and C2 are not similar in any of the 20 bands is (1-0.328)^20 = 0.00035(False negatives)
2.case of C1,C2 that has 30% similarity: the probability in one particular band C1,C2 are identical:(0.3)^5=0.00243.the probability of C1 C2 are identical in at least 1 band is smaller than 20*0.00243=0.0486(Flase positive)
-
Application:
LSH-based Data Compression:key idea: improve compression ratio by putting similar block data together. theory: After clustering similar blocks, the encoding can be shared during compression to improve compression efficiency.
-
Learn to Hash:
CH4 Sampling for Big Data
inverse transform sampling
- key idea: sampling based on the inverse of cumulative distribution function;firstly, generate a sample Y from Uniform distribution, Then, through the inverse function of the cumulative distribution function, CDF-1, map Y to the sample X of the target distribution Drawbacks: it's hard to get the inverse function
Rejection Sampling
- key ideas: accept the samples in the region under the graph of its density function and reject others
- details:
- proposal distribution q(x) should always covers the target distribution p(x): use a big positive M magnify q(x)
- choice of q(x) matters
importance sampling
- key idea: not reject but assign weight to each instance so that the correct distribution is targeted
IS vs. RS
- instances from RS share the same weight only some of the instances are reserved
- instance from IS have different weight, all instances are reserved
- IS is less sensitive to proposal distribution q(x)
CH5 Data stream mining
Challenges of Data stream
what is data stream?
A data sream is a massive sequence of data objects which have some unique features:1. one by one 2.potentially unbounded 3.concept drift
challenges:
Data stream: (a). infinite lenght (b). Evolving Nature
- single pass handling
- memory limitation
- low time complexity
- concept drift
Concept drift
- concept drift: in pridictive analytics and machine learning, the concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways.
- in a word, the probability distribution changes
- change in P(C)
- change in P(C|X)
- change in P(X)->virtual concept drift
Concept Drift detection
Distribution-based detector:
- basic idea:
Monitoring the change of data distributions between two fixed or adaptive windows of data(e.g. ADWIN)
- drawback:
- hard to determine the window size
- learn concept slower
- virtural concept drift
- example: Adaptive windowing (ADWIN): the idea is —— whenever two "large enough" subwindows of W exhibit "distinct enough" averages, one can conclude that the correspondinig expected values are different, and the older portion of window is droped.
Error-rate based detector
- basic idea:
Capture concept drift based on the change of the classification performance.(e.g. page-hinkley test, DDM)
- example: A significant increase iin the errpr of the algorithm suggest a change in class distribution, and whether is a significant increase is base on modeling the error rate as a normal distribution:
Data stream classifiction
algorithm request:
- process an example at a time and inspect it only once
- be ready to predict at any time
- use a limited amount of memory
- work in limited amount of time
DT & chanllenges
- classic DT learners assume all training data can be simultaneously stored in main momery
- Disk-based DT learners repeatedly read training data from disk sequentially
- gaol: desgn DT learners that read each example at most once and use a small constant time to process it
VFDT
- A decision tree learning system based on the hoeffding tree algorithm
- in order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node
- given a stream of examples, use the first ones to choose the root attribute
- once the root attribute is choosen, the successive examples are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.
- use Hoeffding bound to decide how many examples are enough at each node
- G(xi) is a heuristic measure used to choose test attributes(e.g. infomation gain (rate)/Gini index)
- Xa be the attribute with highest attribute evaluation value after seeing n example
- Xb be the second highest one
- Given a , if ΔG = G(Xa) - G(Xb) > , Hoeffding bound guarantees the true ΔG> with probability 1-, this node can be split using Xa, the succeeding examples will be passed to the new leaves
- Hoeffding bound:
- Hoeffding's inequality gives an upper bound on the probability for the sum of random variables to deviate from its expected value. based on hoeffding bound principle, we can use a small set of sample to divide.
- Strengths and weaknesses:
- strengths: Scales better than traditional methods
- incremental
- weeknesses:
-
spend lots of time and need more examples with ties(平局,即存在多个G接近的属性时)
-
memory used increases with tree expansion
-
Number of candidate attributes affect the efficiency
-
CVFDT
- CVFDT BASIC:
- extend VFDT
- maintain VFDT's speed and accuracy
- detect and repond to changes in the example-generating process
- observation & motivation:
- with time-changing concept, the current splitting attribute of some nodes may not be the best any more
- but an outdated subtree may still be better than the best single leaf, particularly if it is near the root node->so, grow a alternative subtree with the new best attribute at its root, when the old attribute seems out of date instead of replacing it at once
- periodically use a bunch of samples to evaluate qualities of trees->replace the old, out-of-date tree with the alternative one when the alternative one becomes more accurate
SyncStream
- basic idea:Prototype-based learning:An intuitive way is to dynamically select the short term and/or long term representative examples to capture the trend of time changing concepts.
- online data maintaining: P-Tree
- prototypes selection: error-driven representativeness learning and synchronization-inspired constraind clustering
- sudden concept drift: PCA and statistics
- Lazy learning: knn
- Online data maintaining
- the prototype level modeling the concept by selecting most representative data (prototype)
- the concept level store a list of prototype->represent the historical and current concepts
-
how to measure the representativeness of data?->error driven
- if data's label is same with K-nearest-neighbor node's result: representativeness+1, else -1
- if the prototype data counts out of max boundary, remove the low representative data. for those unchanged represtativeness data->summerization(maybe that means those cluster of data could be represented as a smaller prototype)
-
Abrupt concept drift detection:
Detect based on PCA: using the angle measure the drift
detect by statistic analysis
Learning on open-set data
open-set detection
- scenario: incomplete knowledge of the world exist at training time, and unknown classes can be submitted to test set
- Task: Not only accurately classify the seen classes, but also effectively deal with unseen ones
incremental learning
- Task: maiintain old knowledge and balance old/new classes, extract examples
Data stream clustering
Framework
- Online phase:
summarize the data into memory-efficient data structures
- Offline phase:
use a clustering algorithm to find the data partion
Micro-cluster:
a micro-cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline marco-clustering phase.
Cluster feature
some algorithm use cluster feature as the mirco-cluster's data structure, Cluster feature: CF=(N, LS, SS)
- component: N is data points count, LS =, SS =
- properties:
- Property 1 Let and be two sets of points(mirco-cluster). then the cluster feature vector CFT(C1∪C2) = CFT(C1) + CFT(C2)
- Property 2 Let and be two sets of points(mirco-cluster) and C1⊇C2, then CFT(C1-C2) = CFT(C1) - CFT(C2)
CH6 Hadoop/Spark
waht is hadoop(Design princ)/spark?
what is hadoop
- hadoop is a software framework for distributed processing of large datasets across large clusters of computers
- open-source implementation for MapReduce
- based on a simple programing model called Mapreduce
- based on a simple data model, any data will fit
design principles of hadoop
- need to process big data
- need to parallelize computation across thousands of nodes
- commodity hardware (large nums of low-end cheap machines working in parallel to solve a computing problem)
- this is contrast to Parallel DBs(Small nums of high-end expensive machines)
- automatic paralleliztion & distribution (hidden from end-users)
- fault tolerance and automatic recovery(Nodes/tasks sometimes fail and will recover automatically)
- clean and simple programing abstraction(only two functions 'map' and 'reduce' are provided for users)
spark
spark's goal was to generalize mapreduce to support new apps within same engine.
spark is a fast and genearal-puprpose cluster computing system.
HDFS
-
component:1. centralized namenode 2.many datanodes
- namenode: maintains metadata info about files(managing FsImage file and Editlog file to manager meta infomation: FsImage->(namespace, system properties, block mapping information) EditLog->(record every change to file system metadata)->(EditLog is used to update FsImages))
- datanode: store the actual data; and files are divided into blocks; each block is replicated N times(Default N=3)
- HeartBeat & BlockReport: DataNode periodically sends heartbeats and block reports to NameNode. The former informs NameNode that the DataNode is alive, and the latter contains information about the data blocks on that DataNode. After receiving the information, NameNode aggregates all block reports, checks whether the data blocks of files are lost and whether the replication count of data blocks meets the requirements. If not, it will enter safe mode. If NameNode doesn’t receive the heartbeat from a certain DataNode for a certain period of time, NameNode will mark that node as dead, replicate the data blocks originally stored on that DataNode to other nodes, and subsequent new computing tasks will no longer be sent to this DataNode. also,the namenode may dead
-
main properties of HDFS
- large: A HDFS instance may consist of thousands of server machines, each storing part of the file system's data
- replication: each data block is replicated many times
- failure: failurs is the normal rather than exception
- fault tolerance: detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS(namenode is consistently checking datanode)
MapReduce
Spark
basic concept
the main abstraction in spark is that of a resilient distributed dataset(RDD), which represents a read-only collection of objects partitioned accross a set of machines that can be rebuilt if a partition is lost (In fact, it can be understood as a distributed collection, and it is as simple as operating a local collection)
RDD
RDDs support two types of operations:transformation,which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset
- Transformation: create a new dataset from an existing one. all transformations in spark are lazy: they don't compute their results right away, instead they remember the transformations applied to some base dataset
- why lazy?: 1. optimize the required calculations 2. recover from lost data partition
(RDD doesn’t actually store the data that needs to be computed. Instead, it records where the data is located and the transformation relationships (which methods were called and which functions were passed in) )
Spark vs Hadoop
- MapReduce:
- great at one-pass computation, but inefficient for multi-pass aglorithms.
- No efficient primitives for data sharing
- MapReduce requires separate engines and reads from and writes to DFS at each step.
- Spark:
- extends a programming language with a resilient distributed collection data structure(RDD)
- clean APIs in Java, Python...
- Spark uses the same engine to perform data extraction, model training, and interactive queries