Artificial Intelligence: Principles and Techniques

69 阅读3分钟

stanford-cs221.github.io/autumn2021/…

modeling-inference-learning paradigm

stanford-cs221.github.io/autumn2021-…

image.png

Reflex-based models

image.png

  • Most common models in machine learning
  • include models such as linear classifiers and neural networks
  • Fully feed-forward (no backtracking)

Binary classification

y ∈ {+1, −1} label

  • Fraud detection
  • Toxic comments

Regression

y ∈ R response

  • Housing: information about house → price
  • Arrival times: destination, weather, time → time of arrival

Structured prediction

y is a complex object

  • Machine translation
  • Dialogue
  • Image captioning
  • Image segmentation

State-based models

Applications: • Games: Chess, Go, Pac-Man, Starcraft, etc. • Robotics: motion planning • Natural language generation: machine translation, image captioning

Early ideas from outside AI

  • 1801: linear regression (Gauss, Legendre)
  • 1936: linear classification (Fisher)
  • 1956: Uniform cost search for shortest paths (Dijkstra)
  • 1957: Markov decision processes (Bellman)

Statistical machine learning

  • 1985: Bayesian networks (Pearl)
  • 1995: Support vector machines (Cortes/Vapnik)

stochastic(随机) gradient descent

Gradient descent is slow

each iteration requires going over all training examples — expensive when have lots of data!

Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples (x, y) and updates the weights w based on each example

feature templates

A feature template is a group of features all computed in a similar way.

Define types of pattern to look for, not particular patterns

Arrays (good for dense features):

Dictionaries (json) (good for sparse features):

neural networks

a way to construct non-linear predictors via problem decomposition.

Avoid zero gradients

Solution: replace with an activation function σ with non-zero gradients

backpropagation

Computation graphs

A directed acyclic graph whose root node represents the final mathematical expression and each node represents intermediate subexpressions

differentiable programming

SequenceModel takes a sequence of vectors and produces a corresponding sequence of vectors, where each vector has been ”contextualized” with respect to the other vectors. • We will see two implementations, recurent neural networks and Transformers.

image.png

image.png

• First, we apply self-attention to x to contextualize the vectors. Then we apply AddNorm (layer normalization with residual connections) to make things safe. • Second, we apply a feedforward network to further process each vector independently. Then we do another AddNorm, and that’s it.

• We can also generate new sequences. • The basic building block for generation is something that takes a vector and outputs a token. • This process is the reverse of EmbedToken, but uses it as follows: we compute a score x· EmbedToken(y) for each candidate token. • Then we apply softmax (exponentiate and normalize) to get a distribution over words y. From this distribution, we can either take the token with the highest probability or simply sample a word from this distribution.

image.png

image.png

generalization

Reduce the dimensionality d (number of features):

Manual feature (template) selection: • Add feature templates if they help • Remove feature templates if they don’t help Automatic feature selection (beyond the scope of this class): • Forward selection • Boosting • L1 regularization

norm

Reduce the norm (length) ||w||:

  • Regularized objective: image.png

  • early stopping

K-means

a simple algorithm for clustering, a form of unsupervised learning.

Classification (supervised learning) Labeled data is expensive to obtain

search

dfs bfs

Dijkstra

courses.cs.washington.edu/courses/cse… finding single-source shortest paths in a non negative-weight graph image.png

Dynamic Programming

Uniform Cost Search

处理 cyclic graph, enumerates states in order of increasing past cost. Assumption:All action costs are non-negative

image.png

useful for infinite graphs and graphs which are too large to represent in memory. Uniform-Cost Search is mainly used in Artificial Intelligence.

structured perceptron (感知机)

image.png

A*, A* Relaxations

bias UCS towards exploring states which are closer to the end state, and that’s exactly what A* does.

image.png

Intuition: add a penalty for how much action a takes us away from the end state

Intuition: ideally, use h(s) = FutureCost(s), but that’s as hard as solving the original problem.

relaxation

  • closed form solution
    • ManhattanDistance
  • easier search
  • independent subproblems
    • Relax original problem into independent subproblems.

Markov decision processes

image.png

A policy π is a mapping from each state s ∈ States to an action a ∈ Actions(s).

Following a policy yields a random path. The utility of a policy is the (discounted) sum of the rewards on the path (this is a random variable).

we maximize the expected utility, which we will refer to as value (of a policy).

policy evaluation

image.png

value iteration

The optimal value Vopt(s) is the maximum value attained by any policy.

image.png

conclusion

• Search DP computes FutureCost(s) • Policy evaluation computes policy value Vπ(s) • Value iteration computes optimal value Vopt(s)

reinforcement learning (增强学习)

image.png

Model-Based Value Iteration

image.png

Model-free Monte Carlo

image.png

image.png

image.png

Q-learning

epsilon-greedy

image.png

function approximation

image.png

image.png

RL Applications

  • Managing datacenters; actions: bring up and shut down machine to minimize time/cost
  • Routing Autonomous Cars: bring down the total latency of vehicles on the road

constraint satisfaction problems

variable-based models.

factor graph

image.png Objective: find the best assignment of values to the variables

  • All variable-based models have an underlying factor graph.
  • A factor graph contains a set of variables (circle nodes), which represent unknown values that we seek to ascertain, and a set of factors (square nodes), which determine how the variables are related to one another.

image.png

image.png

beam search

keep track of more than just the single best partial assignment at each level of the search tree. This is exactly beam search, which keeps track of (at most) K candidates (K is called the beam size). It’s important to remember that these candidates are not guaranteed to be the K best at each level (otherwise greedy would be optimal).

image.png

local search

Start with an assignment and improve each variable greedily.

work with a complete assignment and make repairs by changing one variable at a time

Markov networks

Connect factor graphs with probability.

image.png

Marginal probabilities

image.png

Gibbs sampling

a simple algorithm for approximately computing marginal probabilities.

image.png

conditional independence

image.png

image.png

Markov blanket

image.png

image.png

Bayesian networks

image.png

probabilistic programming

Application: social network analysis

image.png

Naive Bayes

image.png

HMMs

image.png

Maximum likelihood

image.png

logic

Modeling paradigms

image.png

Ingredients of a logic

image.png

  • Syntax: what are valid expressions in the language?
  • Semantics: what do these expressions mean?

propositional logic syntax

Syntax of propositional logic

image.png

Interpretation function

image.png

Knowledge base

image.png

Syntax of first-order logic

image.png