Artificial Intelligence Foundations of Computational Agents

Part II Reasoning and Planning with Certainty

3. Searching for Solutions

graph search

state space search

4.Reasoning with Constraints

Constraint Satisfaction Problems A constraint satisfaction problem (CSP) consists of: • a set of variables • a domain for each variable • a set of constraints. A solution is a total assignment that satisfies all of the constraints

Solving CSPs by Searching

The generate-and-test algorithm to find one solution is as follows: check each total assignment in turn; if an assignment is found that satisfies all of the constraints, return that assignment. ---improvement : depth-first search

Consistency Algorithms

The consistency algorithms are best thought of as operating over a constraint network defined as: • There is a node (drawn as a circle or an oval) for each variable. • There is a node (drawn as a rectangle) for each constraint. • For every constraint c, and for every variable X in the scope of c, there is an arc X, c . The constraint network is thus a bipartite graph, with the two parts consisting of the variable nodes and the constraint nodes; each arc goes from a variable node to a constraint node. • There is also a dictionary dom with the variables as keys, where dom[X] is a set of possible values for variable X. dom[X] is initially the domain of X.

The generalized arc consistency (GAC) algorithm is given in Figure 4.4 (page 138). It takes in a CSP with variables Vs, constraints Cs, and (possibly reduced) domains specified by the dictionary dom and a set to do of potentially inconsistent arcs. The set to do initially consists of all arcs in the graph, {X, c | c ∈ Cs and X ∈ scope(c)}. It modifies dom to make the network arc consistent. While to do is not empty, an arc X, c is removed from the set and considered. If the arc is not arc consistent, it is made arc consistent by pruning the domain of variable X. All of the previously consistent arcs that could, as a result of pruning X, have become inconsistent are added to the set to do if they are not already there. These are the arcs Z, c , where c is a constraint different from c that involves X, and Z is a variable involved in c other than X. When to do is empty, the constraint graph is arc consistent.

Domain Splitting

variable elimination

Local Search

Local search methods start with a total assignment of a value to each variable and try to improve this assignment iteratively by taking improving steps, by taking random steps, or by restarting with another total assignment

Simulated annealing is a stochastic local search algorithm where the temperature is reduced slowly, starting from approximately a random walk at high temperature, eventually becoming pure greedy descent as it approaches zero temperature.

Population-Based Methods

The preceding local search algorithms maintain a single total assignment. This section considers algorithms that maintain multiple total assignments. The first method, beam search, maintains the best k assignments. The next algorithm, stochastic beam search, selects which assignments to maintain stochastically. In genetic algorithms, which are inspired by biological evolution, the k assignments forming a population interact to produce the new population. In these algorithms, a total assignment of a value to each variable is called an individual and the set of current individuals is a population.

4.8.1 Systematic Methods for Discrete Optimization

4.8.2 Local Search for Optimization

4.8.3 Gradient Descent for Continuous Functions

5.Propositions and Inference //todo

6.Deterministic Planning

Deterministic planning is the process of finding a sequence of actions to achieve a goal.

6.1 Representing States, Actions, and Goals

6.2 Forward Planning

A forward planner treats the planning problem as a path planning problem in the state-space graph

A forward planner searches the state-space graph from the initial state looking for a state that satisfies a goal description. It can use any of the search strategies described in Chapter 3.

6.3 Regression Planning

Regression planning involves searching backwards from a goal, asking the question: what does the agent need to do to achieve this goal, and what needs to hold to enable this action to solve the goal?

6.4 Planning as a CSP

6.5 Partial-Order Planning

Part III Learning and Reasoning with Uncertainty

7.Supervised Machine Learning

In a supervised learning task, the learner is given • a set of input features, X1,..., Xm • a set of target features, Y1,...,Yk • a bag (page 262) of training examples, where each example e is a pair (xe, ye), where xe = (X1(e),..., Xm(e)) is a tuple of a value for each input feature and ye = (Y1(e),...,Yk(e)) is a tuple of a value for each target feature. The output is a predictor, a function that predicts Ys from Xs. The aim is to predict the values of the target features for examples that the learner has not seen. For this book, consider only a single target feature, except where explicitly noted.

Supervised learning is called regression when the domain of the target is (a subset of) the real numbers. It is called classification when the domain of the target is a fixed finite set, for example, the domain could be Boolean {false, true}, clothing sizes {XS, S, M, L, XL, . . . }, or countries of birth (countries, or None for those people who were born outside of any country). Other forms of supervised learning include relational learning, such as predicting a person’s birth mother when test examples might be from a different population of people from training examples, and structured prediction, such as predicting the shape of a molecule.

7.3 Basic Models for Supervised Learning

7.3.1 Learning Decision Trees

A decision tree is a tree in which • each internal (non-leaf) node is labeled with a condition, a Boolean function of examples • each internal node has two branches, one labeled true and the other false • each leaf of the tree is labeled with a point estimate (page 269)

A greedy optimal split is a condition that results in the lowest error if the learner were allowed only one split and it splits on that condition

7.3.2 Linear Regression and Classification

Linear regression is the problem of fitting a linear function to a set of training examples, in which the input and target features are real numbers

A kernel function is a function that is applied to the input features to create new features.

7.5.1 Boosting

In boosting, there is a sequence of learners in which each one learns from the errors of the previous ones.

7.5.2 Gradient-Boosted Trees

Gradient-boosted trees are a mix of some of the techniques presented previously. They are a linear model (page 288) where the features are decision trees (page 281) with binary splits, learned using boosting (page 309).

8.Neural Networks and Deep Learning

8.1 Feedforward Neural Networks

A feedforward neural network implements a prediction function given inputs x as f(x) = fn(fn−1(... f2(f1(x)))).

8.1.1 Parameter Learning

8.2.1 Momentum

The momentum for each parameter acts like a velocity of the update for each parameter, with the standard stochastic gradient descent update acting like an acceleration. The momentum acts as the step size for the parameter. It is increased if the acceleration is the same sign as the momentum and is decreased if the acceleration is the opposite sign

8.2.2 RMS-Prop RMS-Prop (root mean squared propagation) is the default optimizer for the Keras deep learning library. The idea is that the magnitude of the change in each weight depends on how (the square of) the gradient for that weight compares to its historic value, rather than depending on the absolute value of the gradient.

8.2.3 Adam Adam, for “adaptive moments”, is an optimizer that uses both momentum and the square of the gradient, similar to RMS-Prop. It also uses corrections for the parameters to account for the fact that they are initialized at 0, which is not a good estimate to average with.

8.4 Convolutional Neural Networks

Convolutional neural networks tackle these problems by using filters that act on small patches of an image, and by sharing the parameters so they learn useful features no matter where in an image they occur.

8.5 Neural Models for Sequences

8.5.1 Word Embeddings

word embedding, a vector representing some mix of syntax and semantics that is useful for predicting words that appear with the word.

The input layer uses indicator variables (page 286), forming a one-hot encoding for words. That is, there is an input unit for each word in a dictionary. For a given word, the corresponding unit has value 1, and the rest of the units have value 0. This input layer can feed into a hidden layer using a dense linear function, as at the bottom of Figure 8.10. This dense linear layer is called an encoder, as it encodes each word into a vector. Suppose u defines the weights for the encoder, so u[i, j] is the weight for the ith word for the jth unit in the hidden layer. The bias term for the linear function can be used for unknown words – words in a text that were not in the dictionary, so all of the input units are 0 – but let’s ignore them for now and set the bias to 0.

The one-hot encoding has an interesting interaction with a dense linear layer that may follow it. The one-hot encoding essentially selects one weight for each hidden unit. The vector of values in the hidden layer for the input word i is [u[i, 0], u[i, 1], u[i, 2],... ], which is called a word embedding for that word.

To predict another word from this embedding, another dense linear function can be used to map the embedding into predictions for the words, with one unit per word, as shown in Figure 8.10. This function from the embeddings to words is a decoder. Suppose v defines the weights for the decoder, so v[j, k] is the weight for the kth word for the jth unit in the hidden layer.

8.5.2 Recurrent Neural Network

Between the inputs and the outputs for each time is a memory or belief state (page 55), h(t) , which represents the information remembered from the previous times. A recurrent neural network represents a belief state transition function (page 56), which specifies how the belief state, h(t) , depends on the percept, x(t) , and the previous belief state, h(t−1) , and a command function (page 57), which specifies how the output y(t) depends on the input and the belief state h(t) and the input x(t) . For a basic recurrent neural network, both of these are represented using a linear function followed by an activation function. More sophisticated models use other differentiable functions, such as deep networks, to represent these functions

One way to think about recurrent neural networks is that the hidden layer at any time represents the agent’s short-term memory at that time. At the next time, the memory is replaced by a combination of the old memory and new information, using the formula of Equation (8.2) (page 355). While parameter values can be designed to remember information from long ago, the vanishing and exploding gradients mean that the long-term dependencies are difficult to learn.

A long short-term memory (LSTM) network is a special kind of recurrent neural network designed so that the memory is maintained unless replaced by new information.

8.5.4 Attention and Transformers

A way to improve sequential models is to allow the model to pay attention to specific parts of the input. Attention uses a probability distribution over the words in a text or regions of an image to compute an expected embedding for each word or region.

8.5.5 Large Language Models

8.6.3 Diffusion Models

diffusion models are effective methods for generative AI, particularly for image generationThe belief of an agent before it observes anything is its prior probability. As it discovers information – typically by observing the environment – it updates its beliefs, giving a posterior probability.

9.Reasoning with Uncertainty

9.1 Probability

The belief of an agent before it observes anything is its prior probability. As it discovers information – typically by observing the environment – it updates its beliefs, giving a posterior probability.

Bayes’ Rule

Suppose an agent has a current belief in proposition h based on evidence k already observed, given by P(h | k), and subsequently observes e. Its new belief in h is P(h | e ∧ k). Bayes’ rule tells us how to update the agent’s belief in hypothesis h as new evidence arrives.

P(e | h) is the likelihood and P(h) is the prior probability of the hypothesis h. Bayes’ rule states that the posterior probability is proportional to the likelihood times the prior.

9.3 Belief Networks

A belief network is a directed acyclic graph representing conditional dependence among a set of random variables.

Define the parents of random variable Xi, written parents(Xi), to be a minimal set of predecessors of Xi in the total ordering such that the other predecessors of Xi are conditionally independent of Xi given parents(Xi). Thus Xi probabilistically depends on each of its parents, but is independent of its other predecessors. That is, parents(Xi) ⊆ {X1,..., Xi−1} such that P(Xi | X1,..., Xi−1) = P(Xi | parents(Xi)). This conditional independence characterizes a belief network

A belief network, also called a Bayesian network, is an acyclic directed graph (DAG), where the nodes are random variables. There is an arc from each element of parents(Xi) into Xi. Associated with the belief network is a set of conditional probability distributions that specify the conditional probability of each variable given its parents (which includes the prior probabilities of those variables with no parents)

9.4 Probabilistic Inference

9.6 Sequential Probability Models

9.6.1 Markov Chains

A Markov chain is a belief network with random variables in a sequence, where each variable only directly depends on its predecessor in the sequence. Markov chains are used to represent sequences of values, such as the sequence of states in a dynamic system or language model (page 357). Each point in the sequence is called a stage.

9.6.2 Hidden Markov Models

A hidden Markov model (HMM) is an augmentation of a Markov chain to include observations. A hidden Markov model includes the state transition of the Markov chain, and adds to it observations at each time that depend on the state at the time. These observations can be partial in that different states map to the same observation and noisy in that the same state can map to different observations at different times.

9.6.4 Dynamic Belief Networks

9.7 Stochastic Simulation

10.Learning with Uncertainty

The assumptions behind an HMM are: • The state at time t + 1 only directly depends on the state at time t for t ≥ 0, as in the Markov chain. • The observation at time t only directly depends on the state at time t.

11.Causality

Part IV Planning and Acting with Uncertainty

12.Planning with Uncertainty

Decision Networks

Dynamic Decision Networks

13.Reinforcement Learning

A reinforcement learning (RL) agent acts in an environment, observing its state and receiving rewards. From its experience of a stream of acting then observing the resulting state and reward, it must determine what to do given its goal of maximizing accumulated reward.

13.3 Temporal Differences

Suppose there is a sequence of numerical values, v1, v2, v3, . . . , and the aim is to predict the next. A rolling average Ak is maintained, and updated using the temporal difference equation, derived in Section A.1: Ak = (1 − αk) ∗ Ak−1 + αk ∗ vk = Ak−1 + αk ∗ (vk − Ak−1) (13.1) where αk = 1 k . The difference, vk − Ak−1, is called the temporal difference error or TD error;

13.4.1 Q-learning

Recall (page 559) that Q∗(s, a), where a is an action and s is a state, is the expected value (cumulative discounted reward) of doing a in state s and then following the optimal policy. Q-learning uses temporal differences to estimate the value of Q∗(s, a). In Q-learning, the agent maintains a table of Q[S, A], where S is the set of states and A is the set of actions. Q[s, a] represents its current estimate of Q∗(s, a). An experience s, a,r,s provides one data point for the value of Q(s, a). The data point is that the agent received the future value of r + γV(s ), where V(s ) = maxa Q(s , a ); this is the actual current reward plus the discounted estimated future value. This new data point is called a return. The agent can use the temporal difference equation (13.1) to update its estimate for Q(s, a):

It could exploit the knowledge that it has found to get higher rewards by, in state s, doing one of the actions a that maximizes Q[s, a]. • It could explore to build a better estimate of the Q-function, by, for example, selecting an action at random at each time.

14.Multiagent Systems

13.8 Model-Based RL