stanford-cs221.github.io/autumn2021/…
modeling-inference-learning paradigm
stanford-cs221.github.io/autumn2021-…
Reflex-based models
- Most common models in machine learning
- include models such as linear classifiers and neural networks
- Fully feed-forward (no backtracking)
Binary classification
y ∈ {+1, −1} label
- Fraud detection
- Toxic comments
Regression
y ∈ R response
- Housing: information about house → price
- Arrival times: destination, weather, time → time of arrival
Structured prediction
y is a complex object
- Machine translation
- Dialogue
- Image captioning
- Image segmentation
State-based models
Applications: • Games: Chess, Go, Pac-Man, Starcraft, etc. • Robotics: motion planning • Natural language generation: machine translation, image captioning
Early ideas from outside AI
- 1801: linear regression (Gauss, Legendre)
- 1936: linear classification (Fisher)
- 1956: Uniform cost search for shortest paths (Dijkstra)
- 1957: Markov decision processes (Bellman)
Statistical machine learning
- 1985: Bayesian networks (Pearl)
- 1995: Support vector machines (Cortes/Vapnik)
stochastic(随机) gradient descent
Gradient descent is slow
each iteration requires going over all training examples — expensive when have lots of data!
Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples (x, y) and updates the weights w based on each example
feature templates
A feature template is a group of features all computed in a similar way.
Define types of pattern to look for, not particular patterns
Arrays (good for dense features):
Dictionaries (json) (good for sparse features):
neural networks
a way to construct non-linear predictors via problem decomposition.
Avoid zero gradients
Solution: replace with an activation function σ with non-zero gradients
backpropagation
Computation graphs
A directed acyclic graph whose root node represents the final mathematical expression and each node represents intermediate subexpressions
differentiable programming
SequenceModel takes a sequence of vectors and produces a corresponding sequence of vectors, where each vector has been ”contextualized” with respect to the other vectors. • We will see two implementations, recurent neural networks and Transformers.
• First, we apply self-attention to x to contextualize the vectors. Then we apply AddNorm (layer normalization with residual connections) to make things safe. • Second, we apply a feedforward network to further process each vector independently. Then we do another AddNorm, and that’s it.
• We can also generate new sequences. • The basic building block for generation is something that takes a vector and outputs a token. • This process is the reverse of EmbedToken, but uses it as follows: we compute a score x· EmbedToken(y) for each candidate token. • Then we apply softmax (exponentiate and normalize) to get a distribution over words y. From this distribution, we can either take the token with the highest probability or simply sample a word from this distribution.
generalization
Reduce the dimensionality d (number of features):
Manual feature (template) selection: • Add feature templates if they help • Remove feature templates if they don’t help Automatic feature selection (beyond the scope of this class): • Forward selection • Boosting • L1 regularization
norm
Reduce the norm (length) ||w||:
-
Regularized objective:
-
early stopping
K-means
a simple algorithm for clustering, a form of unsupervised learning.
Classification (supervised learning) Labeled data is expensive to obtain
search
dfs bfs
Dijkstra
courses.cs.washington.edu/courses/cse…
finding single-source shortest paths in a non negative-weight graph
Dynamic Programming
Uniform Cost Search
处理 cyclic graph, enumerates states in order of increasing past cost. Assumption:All action costs are non-negative
useful for infinite graphs and graphs which are too large to represent in memory. Uniform-Cost Search is mainly used in Artificial Intelligence.
structured perceptron (感知机)
A*, A* Relaxations
bias UCS towards exploring states which are closer to the end state, and that’s exactly what A* does.
Intuition: add a penalty for how much action a takes us away from the end state
Intuition: ideally, use h(s) = FutureCost(s), but that’s as hard as solving the original problem.
relaxation
- closed form solution
- ManhattanDistance
- easier search
- independent subproblems
- Relax original problem into independent subproblems.
Markov decision processes
A policy π is a mapping from each state s ∈ States to an action a ∈ Actions(s).
Following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is a random variable).
we maximize the expected utility, which we will refer to as value (of a policy).
policy evaluation
value iteration
The optimal value Vopt(s) is the maximum value attained by any policy.
conclusion
• Search DP computes FutureCost(s) • Policy evaluation computes policy value Vπ(s) • Value iteration computes optimal value Vopt(s)
reinforcement learning (增强学习)
Model-Based Value Iteration
Model-free Monte Carlo
Q-learning
epsilon-greedy
function approximation
RL Applications
- Managing datacenters; actions: bring up and shut down machine to minimize time/cost
- Routing Autonomous Cars: bring down the total latency of vehicles on the road
constraint satisfaction problems
variable-based models.
factor graph
Objective: find the best assignment of values to the variables
- All variable-based models have an underlying factor graph.
- A factor graph contains a set of variables (circle nodes), which represent unknown values that we seek to ascertain, and a set of factors (square nodes), which determine how the variables are related to one another.
beam search
keep track of more than just the single best partial assignment at each level of the search tree. This is exactly beam search, which keeps track of (at most) K candidates (K is called the beam size). It’s important to remember that these candidates are not guaranteed to be the K best at each level (otherwise greedy would be optimal).
local search
Start with an assignment and improve each variable greedily.
work with a complete assignment and make repairs by changing one variable at a time
Markov networks
Connect factor graphs with probability.
Marginal probabilities
Gibbs sampling
a simple algorithm for approximately computing marginal probabilities.
conditional independence
Markov blanket
Bayesian networks
probabilistic programming
Application: social network analysis
Naive Bayes
HMMs
Maximum likelihood
logic
Modeling paradigms
Ingredients of a logic
- Syntax: what are valid expressions in the language?
- Semantics: what do these expressions mean?