Assignment 6
1/4
Assignment 6
5/1/2024
10 Points Possible
In Progress
NEXT UP: Submit Assignment
Unlimited Attempts Allowed
4/22/2024
Attempt 1 Add Comment
Details
For this assignment, you will submit a README.md with your answers to the questions below, along with the code you used to
produce your answers (including all boto3 scripts necessary to reproduce your cloud infrastructure, where relevant). You should
commit your Assignment 6 file(s) to your private “a6” GitHub repository (click here (classroom.github.com/a/jXPdPm3s) to
accept the GitHub Classroom invitation to access this repository) and submit a link to your repository here on the Canvas (clicking
the “Submit Assignment” button to make your submission). You must work alone on this assignment. Before submitting your
assignment, please take a look at the tips one of the previous TAs for the course (Jinfei Zhu) compiled for writing a grader-friendly
README file and organizing your assignment GitHub repository (github.com/lsc4ss-a21/…) if you have not already done so.

(6 Points Total) This first prompt builds on the survey submission pipeline you have been working on in Assignments 4 and 5. As
a final step in your survey submission pipeline, you will write a Python function that can 代写sc4ss-a21be invoked on a survey participant’s
mobile device when they complete a survey to send their survey submission into an SQS queue, which should then trigger the
Lambda function you wrote in Assignments 4 and 5.
Note that each survey submission is initially saved as a JSON file (on the mobile device) when a participant completes a survey
via the mobile app (see example files here ()
() ). For the purposes of this prompt, you do
not need to worry about the implementation of the mobile app or the creation of these JSON files. Your job is to write a Python
function that will send a string representation of this JSON data (an individual survey) into an AWS SQS queue (your function
will then be incorporated into the mobile app by another researcher). The SQS queue should then trigger your AWS Lambda
function from Assignment 5, which will take this survey submission data and perform necessary processing and storage
operations in the cloud. You should accomplish all of these tasks programmatically (using boto3 ) to ensure reproducibility of
your architecture. Specifically, you should complete the following tasks:
a. (1 Point) Write a Python function send_survey (which you can assume will be installed with the mobile app and will
automatically be invoked after a survey is saved as a JSON file on the device) that has the following signature:
def send_survey(survey_path, sqs_url):
'''
Input: survey_path (str): path to JSON survey data
(e.g. `./survey.json')
sqs_url (str): URL for SQS queue
Output: StatusCode (int): indicating whether the survey
was successfully sent into the SQS queue (200) or not (400)
'''
In the function body, you should use boto3 to send the data from a survey (a JSON file on the mobile device, converted into
a string representation) into an AWS SQS queue.
b. (2 Points) Create an SQS queue and configure it to act as a trigger for your Lambda function from Assignment 5 (which will
process your data and write it to storage).
Note that if you test your full survey submission pipeline using the example JSON files provided above (in a loop,
using time.sleep(10) in between survey submissions, as in Assignment 5), you should see the following keys in your S3
Bucket:
['0001092821120000.json', '0001092921120000.json', '0001093021120300.json',
'0002092821120000.json', '0003092821120001.json', '0004092821120002.json',
Assignment 6
2/4
'0005092821122000.json']
You should also see the following records if you query your DynamoDB table:
{'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0001',
'q3': Decimal('2'), 'q4': Decimal('2'), 'q5': Decimal('2'),
'num_submission': Decimal('3'),
'freetext': "I lost my car keys this afternoon at lunch, so I'm more stressed than normal"}
{'q1': Decimal('4'), 'q2': Decimal('1'), 'user_id': '0002',
'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('3'),
'num_submission': Decimal('1'),
'freetext': "I'm having a great day!"}
{'q1': Decimal('1'), 'q2': Decimal('3'), 'user_id': '0003',
'q3': Decimal('3'), 'q4': Decimal('1'), 'q5': Decimal('4'),
'num_submission': Decimal('1'),
'freetext': 'It was a beautiful, sunny day today.'}
{'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0004',
'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('1'),
'num_submission': Decimal('1'),
'freetext': 'I had a very bad day today...'}
{'q1': Decimal('3'), 'q2': Decimal('3'), 'user_id': '0005',
'q3': Decimal('3'), 'q4': Decimal('3'), 'q5': Decimal('3'),
'num_submission': Decimal('1'),
'freetext': "I'm feeling okay, but not spectacular"}
c. (3 Points) Your PI, who is overseeing this project, is worried that if all of the participants in the study (potentially thousands)
submit surveys at the same time in the day, this might cause the system to crash and your lab might lose data (this
happened to your PI when they ran a similar digital survey via on-premise servers in the early 2000s). How would you
reassure your PI that your architecture is scalable and will be able to handle such spikes in demand? Your response should
be at least 200 words and discuss the scalability of each of the cloud services you used in your pipeline in detail.
(4 points) For this prompt, we ask you to declare whether you will complete a Final Project or a Final Exam as your capstone
assignment for the course. You are welcome to meet with course staff and discuss your options and ideas with us before
making your election and submitting your answer to this prompt.
If you wish to complete a Final Project, you should additionally write a ～250 word-proposal in your README for this
assignment, detailing your plan for the project (see expectations and sample projects on the Final Exam/Final Project
Assignment page () ). You should explain why your project
idea helps to solve a social science research problem using large-scale computing methods and outline a schedule for
completing the project by the deadline. If you are working in a group, you should also write down the names of your group
members and describe how you are going to split up the work amongst yourselves.
If you wish to take a Final Exam, you should instead write one question for possible inclusion in the Final Exam and submit it
in your README for this assignment. The better the question you submit, the higher the likelihood you will see the question
(or a closely related one) on the exam. We will additionally post the best questions to the Final Exam page on Canvas so that
you can use them as study material for the exam. Note: YOU WILL NEED TO PROVIDE THE SOLUTION FOR YOUR
QUESTIONS. A good question is one that goes beyond memorization and asks the student to apply a concept in a way that
is similar to what we do in our in-class activities and conceptual questions in assignments (we will not ask implementation
questions that involve writing code from scratch). Specifically, we plan to include questions of the following types (for
additional examples, you can take a look at past examples of questions used on the exam on the Final Exam/Final Project
assignment () page):
Applied Conceptual Questions, such as:
You are conducting a large digital experiment, in which you have designed an online music sharing application and
recruited participants to use the platform over the course of a month. During the experiment, you will manipulate features
of the website in order to test your research hypotheses. In order to run the experiment, you need to be able to
collect/record thousands of data points per second; for instance: tracking the songs that participants download, the
treatments that they were exposed to (by you the researcher), as well as all of the things that participants click on. When
the experiment is over, you would like to perform a statistical analysis on a subset of the data to identify experimental
interventions that caused participants to change their clicking/downloading behavior. Ultimately, when your work is
published, you would also like to have your (de-identified) data publicly accessible, so that future scholars can replicate
your statistical analysis.
Assignment 6
3/4
What databases and/or storage solutions would you use to solve these problems (storing data while you run the
experiment, as well as afterwards) in the AWS cloud ecosystem? Why? How about if you scaled the experiment up by
several orders of magnitude to include millions of participants? Would this change your data storage/management
solution?
Code Interpretation Questions, such as:
Below is a serial version of a Monte Carlo simulation to estimate π that is written in Python. Identify parts of this code
that could be accelerated using a GPU, as well as those that would best be run on a CPU – attempting to accelerate the
estimation of π as much as possible. For each section of code, you should explain why your answers are the best
hardware options for optimal performance (e.g. thinking in terms of some of the key bottlenecks and hardware limitations
for CPUs vs. GPUs).

NumPy Pi Estimation with Monte Carlo Simulation

import numpy as np
import time
t0 = time.time()
n_runs = 10 ** 8 # Simulate Random Coordinates in Unit Square:
ran = np.random.uniform(low=-1, high=1, size=(2, n_runs))

Identify Random Coordinates that fall within Unit Circle and count them

result = ran[0] ** 2 + ran[1] ** 2 <= 1
n_in_circle = np.sum(result)

Estimate Pi

print("Pi Estimate: ", 4 * n_in_circle / n_runs)
print("Time Elapsed: ", time.time() - t0)
Troubleshooting Questions, such as:
You are training a linear regression model to predict the price of an AirBnB listing given a variety of text features derived
from the listing’s description on AirBnB (note that AirBnB publishes this data in CSV format for listings across the world
and the data is updated on a monthly basis).
You have written a machine learning workflow in PySpark that does the following on an AWS EMR cluster composed of 3
m5.xlarge EC2 instances (1 resource manager and 2 core instances), with 10 GB in EBS storage available on each
instance:

Cleans the description text data (e.g. drops stop words and punctuation) from all AirBnB listings around the world
from the past month (prior to the current month).
Engineers features based on the clean description data (such as categorical and binary features indicating whether
the description contains certain types of words).
Uses MLLib’s CrossValidator to identify the optimal hyperparameters for your linear regression model given a grid of
possible values used to tune the model (i.e. a grid search)
Trains the regression model using the optimal hyperparameters from (3) and make predictions on the prices of AirBnB
listings from the current month.
Having successfully run this workflow on one previous month of data, you want to increase your training data size to
several years worth of data. As you increase the amount of data entering into your pipeline, though, you begin to observe
unexpected (i.e. nonlinear) diminishing performance (in terms of speed) and beyond a certain data size, your job will not
complete at all – it keeps running indefinitely.
Describe at least two possible root causes of this slowdown (considering both hardware and software). Why would these
be concerns? Is it possible to remedy them? What would be your solutions?
Some hints for writing good questions:
You shouldn’t make the question needlessly complicated or overly verbose.
Try to be clear about what you’re asking and what you’re looking for.
Try to cover multiple topics from the course – i.e. a question that touches on the memory hierarchy, GPU vs. CPU
Parallelism, and Spark’s execution model, would be better than one that is narrowly relevant to invoking a Lambda function.
Assignment 6
4/4
Anything we’ve covered in the class is fair game (and you’re welcome to continue submitting relevant questions through
Wednesday of Week 9 related to material we cover after this assignment – you just will not
WX：codinghelp