modeling-inference-learning paradigm

stanford-cs221.github.io/autumn2021-…

Reflex-based models

Most common models in machine learning
include models such as linear classifiers and neural networks
Fully feed-forward (no backtracking)

Binary classification

y ∈ {+1, −1} label

Fraud detection
Toxic comments

Regression

y ∈ R response

Housing: information about house → price
Arrival times: destination, weather, time → time of arrival

Structured prediction

y is a complex object

Machine translation
Dialogue
Image captioning
Image segmentation

State-based models

Applications: • Games: Chess, Go, Pac-Man, Starcraft, etc. • Robotics: motion planning • Natural language generation: machine translation, image captioning

Early ideas from outside AI

1801: linear regression (Gauss, Legendre)
1936: linear classification (Fisher)
1956: Uniform cost search for shortest paths (Dijkstra)
1957: Markov decision processes (Bellman)

Statistical machine learning

1985: Bayesian networks (Pearl)
1995: Support vector machines (Cortes/Vapnik)

stochastic（随机） gradient descent

Gradient descent is slow

each iteration requires going over all training examples — expensive when have lots of data!

Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples (x, y) and updates the weights w based on each example

feature templates

A feature template is a group of features all computed in a similar way.

Define types of pattern to look for, not particular patterns

Arrays (good for dense features):

Dictionaries (json) (good for sparse features):

neural networks

a way to construct non-linear predictors via problem decomposition.

Avoid zero gradients

Solution: replace with an activation function σ with non-zero gradients

backpropagation

Computation graphs

A directed acyclic graph whose root node represents the final mathematical expression and each node represents intermediate subexpressions

differentiable programming

SequenceModel takes a sequence of vectors and produces a corresponding sequence of vectors, where each vector has been ”contextualized” with respect to the other vectors. • We will see two implementations, recurent neural networks and Transformers.

• First, we apply self-attention to x to contextualize the vectors. Then we apply AddNorm (layer normalization with residual connections) to make things safe. • Second, we apply a feedforward network to further process each vector independently. Then we do another AddNorm, and that’s it.

• We can also generate new sequences. • The basic building block for generation is something that takes a vector and outputs a token. • This process is the reverse of EmbedToken, but uses it as follows: we compute a score x· EmbedToken(y) for each candidate token. • Then we apply softmax (exponentiate and normalize) to get a distribution over words y. From this distribution, we can either take the token with the highest probability or simply sample a word from this distribution.

generalization

Reduce the dimensionality d (number of features):

Manual feature (template) selection: • Add feature templates if they help • Remove feature templates if they don’t help Automatic feature selection (beyond the scope of this class): • Forward selection • Boosting • L1 regularization

norm

Reduce the norm (length) ||w||:

Regularized objective:
early stopping

K-means

a simple algorithm for clustering, a form of unsupervised learning.

Classification (supervised learning) Labeled data is expensive to obtain

search

dfs bfs

Dijkstra

courses.cs.washington.edu/courses/cse… finding single-source shortest paths in a non negative-weight graph

Dynamic Programming

Uniform Cost Search

处理 cyclic graph， enumerates states in order of increasing past cost. Assumption：All action costs are non-negative

useful for infinite graphs and graphs which are too large to represent in memory. Uniform-Cost Search is mainly used in Artificial Intelligence.

structured perceptron （感知机）

A, A Relaxations

bias UCS towards exploring states which are closer to the end state, and that’s exactly what A* does.

Intuition: add a penalty for how much action a takes us away from the end state

Intuition: ideally, use h(s) = FutureCost(s), but that’s as hard as solving the original problem.

relaxation

closed form solution
- ManhattanDistance
easier search
independent subproblems
- Relax original problem into independent subproblems.

Markov decision processes

A policy π is a mapping from each state s ∈ States to an action a ∈ Actions(s).

Following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is a random variable).

we maximize the expected utility, which we will refer to as value (of a policy).

policy evaluation

value iteration

The optimal value Vopt(s) is the maximum value attained by any policy.

conclusion

• Search DP computes FutureCost(s) • Policy evaluation computes policy value Vπ(s) • Value iteration computes optimal value Vopt(s)

reinforcement learning (增强学习)

Model-Based Value Iteration

Model-free Monte Carlo

Q-learning

epsilon-greedy

function approximation

RL Applications

Managing datacenters; actions: bring up and shut down machine to minimize time/cost
Routing Autonomous Cars: bring down the total latency of vehicles on the road

constraint satisfaction problems

variable-based models.

factor graph

Objective: find the best assignment of values to the variables

All variable-based models have an underlying factor graph.
A factor graph contains a set of variables (circle nodes), which represent unknown values that we seek to ascertain, and a set of factors (square nodes), which determine how the variables are related to one another.

beam search

keep track of more than just the single best partial assignment at each level of the search tree. This is exactly beam search, which keeps track of (at most) K candidates (K is called the beam size). It’s important to remember that these candidates are not guaranteed to be the K best at each level (otherwise greedy would be optimal).

local search

Start with an assignment and improve each variable greedily.

work with a complete assignment and make repairs by changing one variable at a time

Artificial Intelligence: Principles and Techniques

modeling-inference-learning paradigm

Reflex-based models

Binary classification

Regression

Structured prediction

State-based models

Early ideas from outside AI

Statistical machine learning

stochastic（随机） gradient descent

feature templates

neural networks

Avoid zero gradients

backpropagation

Computation graphs

differentiable programming

generalization

Reduce the dimensionality d (number of features):

norm

K-means

search

dfs bfs

Dijkstra

Dynamic Programming

Uniform Cost Search

structured perceptron （感知机）

A*, A* Relaxations

relaxation

Markov decision processes

policy evaluation

value iteration

conclusion

reinforcement learning (增强学习)

Model-Based Value Iteration

Model-free Monte Carlo

Q-learning

epsilon-greedy

function approximation

RL Applications

constraint satisfaction problems

variable-based models.

factor graph

beam search

local search

Markov networks

Marginal probabilities

Gibbs sampling

conditional independence

Markov blanket

Bayesian networks

probabilistic programming

Application: social network analysis

Naive Bayes

HMMs

Maximum likelihood

logic

Modeling paradigms

Ingredients of a logic

propositional logic syntax

Syntax of propositional logic

Interpretation function

Knowledge base

Syntax of first-order logic

A, A Relaxations