Computer Architecture 2024 Spring Final Project Part 2Overview Tutorial ● Gem5 Introduction ● Environment Setup Projects ● Part 1 (5%) ○ Write C++ program to analyze the speciﬁcation of L1 data cache. ● Part 2 (5%) ○ Given the hardware speciﬁcations, try to get the best performance for more complicated program. 2Project 2 3In this project, we will use a two-level cache computer system. Your task is to write a ViT(Vision Transformer) in C++ and optimize it. You can see more details of the system speciﬁcation on the next page. Description 4System Speciﬁcations ● ISA: X86 ● CPU: TimingSimpleCPU (no pipeline, CPU stalls on every memory request) ● Caches

L1 I cache and L1 D cache connect to the same L2 cache ● Memory size: 8192MB 5 I cache size I cache associativity D cache size D cache associativity Policy Block size L1 cache 16KB 8 16KB 4 LRU 32B L2 cache – – 1MB 16 LRU 32BViT(Vision Transformer) – Transformer Overview 6 ● A basic transformer block consists of ○ Layer Normalization ○ MultiHead Self-Attention (MHSA) ○ Feed Forward Network (FFN) ○ Residual connection (Add) ● You only need to focus on how to implement the function in the red box ● If you only want to complete the project instead of understanding the full algorithm about ViT, you can skip the section masked as redViT(Vision Transformer) – Image Pre-processing 7 ● Normalize, resize to (300,300,3) and center crop to (224,224,3)ViT(Vision Transformer) – Patch Encoder 8 ● In this project, we use Conv2D as Patch Encoder with kernel_size = (16,16), stride = (16,16) and output_channel = 768 ● (224,224,3) -> (14,14, 16163) -> (196, 768)ViT(Vision Transformer) – Class Token 9 ● Now we have 196 tokens and each token has 768 features ● In order to record global information, we need concatenate one learnable class token with 196 tokens ● (196,768) -> (197,768)ViT(Vision Transformer) – Position Embedding 10 ● Add the learnable position information on the patch embedding ● (197,768) + position_embedding(197,768) -> (197,768)ViT(Vision Transformer) – Layer Normalization 11 T

of tokens

C embedded dimension ● Normalize each token ● You need to normalize with the formulaAttention ViT(Vision Transformer) – MultiHead Self Attention (1) 12 ● Wk , Wq , Wv ∈ RC✕C ● b q , bk , bv ∈ RC ● W o ∈ RC✕C

● b o ∈ RC Input Linear Projection X Attention split into heads merge heads Output Linear Projection Y Wk , Wq , Wv W o b q , bk , bv b o ViT(Vision Transformer) – MultiHead Self Attention (2) 13 T

of tokens

C embedded dimension ● Get Q, K, V ∈ RT✕(NH*H) after input linear projection ● Split Q, K, V into Q1 , Q2 , Q3 ,..., QNH K1 , K2 , K3 ,..., KNH V1 , V2 , V3 ,..., VNH ∈ RT✕H H hidden dimension Linear Projection and split into heads Linear Projection Q = XWq T

b q K = XWk T
bk V = XW v T
b v NH

of head C = H * NHViT(Vision Transformer) – MultiHead Self Attention (2)

14 ● For each head i, compute Si = QiKi T /square_root(H) ∈ RT✕T ● Pi = Softmax(Si ) ∈ RT✕T , Softmax is a row-wise function ● Oi = Pi Vi ∈ RT✕H Matrix Multiplication and scale Qi Ki Softmax Matrix Multiplication Vi Oi SoftmaxViT(Vision Transformer) – MultiHead Self Attention (3) 15 T

of tokens

C embedded dimension ● Oi ∈ RT✕H , O = [O1 , O2 ,...,O2 ] H hidden dimension merge heads and Linear Projection Linear Projection output = OWo T

b o NH

of headViT(Vision Transformer) – Feed Forward Network

16 ● Get Q, K, V ∈ RT✕(h*H) after input linear projection ● Split Q, K, V into Q1 , Q2 , Q3 ,..., Qh K1 , K2 , K3 ,..., Kh V1 , V2 , V3 ,..., Vh ∈ RT✕H T

of tokens

C embedded dimension Input Linear Projection T

of tokens

OC hidden dimension GeLU output Linear ProjectionViT(Vision Transformer) – GeLU 17ViT(Vision Transformer) – Classiﬁer 18 ● Contains a Linear layer to transform 768 features to 200 class ○ (197, 768) -> (197, 200) ● Only refer to the ﬁrst token (class token) ○ (197, 200) -> (1, 200)ViT(Vision Transformer) – Work Flow 19 Pre-pocessing Embedder Transformer x12 Classiﬁer m5_dump_init Load_weight m5_dump_stat Argmax layernorm MHSA layernorm FFN matmul attention matmul matmul layernorm matmul Black footed Albatross + + gelu matmul gelu $make gelu_tb$ make matmul_tb $make layernorm_tb$ make MHSA_tb $make feedforward_tb$ make transformer_tb $run_all.sh layernorm layernorm MHSA residualViT(Vision Transformer) – Shape of array 20 layernorm token 1 token 2 …… token T C input/output [T*C] MHSA input/output/o [T*C] MHSA qkv [T*3*C] q token 1 C k token 1 v token 1 …… q token T k token T v token T feedforward input/output [T*C] feedforward gelu [T*OC] token 1 OC token 2 …… token TCommon problem 21 ● Segmentation fault ○ ensure that you are not accessing a nonexistent memory address ○ Enter the command$ ulimit -s unlimited All you have to do is 22 ● Download TA’s Gem5 image ○ docker pull yenzu/ca_ﬁnal_part2:2024 ● Write C++ with understanding the algorithm in ./layer folder ○ make clean ○ make _tb ○ ./_tbAll you have to do is 23 ● Ensure the ViT will successfully classify the bird ○ python3 embedder.py --image_path images/Black_Footed_Albatross_0001_796111.jpg --embedder_代写C++ program Computer Architecturepath weights/embedder.pth --output_path embedded_image.bin ○ g++ -static main.cpp layer/*.cpp -o process ○ ./process ○ python3 run_model.py --input_path result.bin --output_path torch_pred.bin --model_path weights/model.pth ○ python3 classiﬁer.py --prediction_path torch_pred.bin --classiﬁer_path weights/classiﬁer.pth ○ After running the above commands, you will get the following top5 prediction. ● Evaluate the performance of part of ViT, that is layernorm+MHSA+residual ○ Need about 3.5 hours to ﬁnish the simulation ○ Check stat.txtGrading Policy 24 ● (50%) Veriﬁcation ○ (10%) matmul_tb ○ (10%) layernorm_tb ○ (10%) gelu_tb ○ (10%) MHSA_tb ○ (10%) transformer_tb ● (50%) Performance ○ max(sigmoid((27.74 - student latency)/student latency))*70, 50) ● You will get 0 performance point if your design is not veriﬁed.Submission ● Please submit code on E3 before 23:59 on June 20, 2024. ● Late submission is not allowed. ● Plagiarism is forbidden, otherwise you will get 0 point!!! WX：codinghelp