sc4ss-a21

104 阅读7分钟

Assignment 6
1/4
Assignment 6
5/1/2024
10 Points Possible
In Progress
NEXT UP: Submit Assignment
Unlimited Attempts Allowed
4/22/2024
Attempt 1 Add Comment
Details
For this assignment, you will submit a README.md with your answers to the questions below, along with the code you used to
produce your answers (including all boto3 scripts necessary to reproduce your cloud infrastructure, where relevant). You should
commit your Assignment 6 file(s) to your private “a6” GitHub repository (click here (classroom.github.com/a/jXPdPm3s) to
accept the GitHub Classroom invitation to access this repository) and submit a link to your repository here on the Canvas (clicking
the “Submit Assignment” button to make your submission). You must work alone on this assignment. Before submitting your
assignment, please take a look at the tips one of the previous TAs for the course (Jinfei Zhu) compiled for writing a grader-friendly
README file and organizing your assignment GitHub repository (github.com/lsc4ss-a21/…) if you have not already done so.

  1. (6 Points Total) This first prompt builds on the survey submission pipeline you have been working on in Assignments 4 and 5. As
    a final step in your survey submission pipeline, you will write a Python function that can 代 写sc4ss-a21be invoked on a survey participant’s
    mobile device when they complete a survey to send their survey submission into an SQS queue, which should then trigger the
    Lambda function you wrote in Assignments 4 and 5.
    Note that each survey submission is initially saved as a JSON file (on the mobile device) when a participant completes a survey
    via the mobile app (see example files here ()
    () ). For the purposes of this prompt, you do
    not need to worry about the implementation of the mobile app or the creation of these JSON files. Your job is to write a Python
    function that will send a string representation of this JSON data (an individual survey) into an AWS SQS queue (your function
    will then be incorporated into the mobile app by another researcher). The SQS queue should then trigger your AWS Lambda
    function from Assignment 5, which will take this survey submission data and perform necessary processing and storage
    operations in the cloud. You should accomplish all of these tasks programmatically (using boto3 ) to ensure reproducibility of
    your architecture. Specifically, you should complete the following tasks:
    a. (1 Point) Write a Python function send_survey (which you can assume will be installed with the mobile app and will
    automatically be invoked after a survey is saved as a JSON file on the device) that has the following signature:
    def send_survey(survey_path, sqs_url):
    '''
    Input: survey_path (str): path to JSON survey data
    (e.g. `./survey.json')
    sqs_url (str): URL for SQS queue
    Output: StatusCode (int): indicating whether the survey
    was successfully sent into the SQS queue (200) or not (400)
    '''
    In the function body, you should use boto3 to send the data from a survey (a JSON file on the mobile device, converted into
    a string representation) into an AWS SQS queue.
    b. (2 Points) Create an SQS queue and configure it to act as a trigger for your Lambda function from Assignment 5 (which will
    process your data and write it to storage).
    Note that if you test your full survey submission pipeline using the example JSON files provided above (in a loop,
    using time.sleep(10) in between survey submissions, as in Assignment 5), you should see the following keys in your S3
    Bucket:
    ['0001092821120000.json', '0001092921120000.json', '0001093021120300.json',
    '0002092821120000.json', '0003092821120001.json', '0004092821120002.json',
    Assignment 6
    2/4
    '0005092821122000.json']
    You should also see the following records if you query your DynamoDB table:
    {'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0001',
    'q3': Decimal('2'), 'q4': Decimal('2'), 'q5': Decimal('2'), 
    'num_submission': Decimal('3'),
    'freetext': "I lost my car keys this afternoon at lunch, so I'm more stressed than normal"}
    {'q1': Decimal('4'), 'q2': Decimal('1'), 'user_id': '0002',
    'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('3'), 
    'num_submission': Decimal('1'),
    'freetext': "I'm having a great day!"}
    {'q1': Decimal('1'), 'q2': Decimal('3'), 'user_id': '0003',
    'q3': Decimal('3'), 'q4': Decimal('1'), 'q5': Decimal('4'), 
    'num_submission': Decimal('1'),
    'freetext': 'It was a beautiful, sunny day today.'}
    {'q1': Decimal('1'), 'q2': Decimal('1'), 'user_id': '0004',
    'q3': Decimal('1'), 'q4': Decimal('1'), 'q5': Decimal('1'),
    'num_submission': Decimal('1'),
    'freetext': 'I had a very bad day today...'}
    {'q1': Decimal('3'), 'q2': Decimal('3'), 'user_id': '0005',
    'q3': Decimal('3'), 'q4': Decimal('3'), 'q5': Decimal('3'),
    'num_submission': Decimal('1'),
    'freetext': "I'm feeling okay, but not spectacular"}
    c. (3 Points) Your PI, who is overseeing this project, is worried that if all of the participants in the study (potentially thousands)
    submit surveys at the same time in the day, this might cause the system to crash and your lab might lose data (this
    happened to your PI when they ran a similar digital survey via on-premise servers in the early 2000s). How would you
    reassure your PI that your architecture is scalable and will be able to handle such spikes in demand? Your response should
    be at least 200 words and discuss the scalability of each of the cloud services you used in your pipeline in detail.
  2. (4 points) For this prompt, we ask you to declare whether you will complete a Final Project or a Final Exam as your capstone
    assignment for the course. You are welcome to meet with course staff and discuss your options and ideas with us before
    making your election and submitting your answer to this prompt.
    If you wish to complete a Final Project, you should additionally write a ~250 word-proposal in your README for this
    assignment, detailing your plan for the project (see expectations and sample projects on the Final Exam/Final Project
    Assignment page () ). You should explain why your project
    idea helps to solve a social science research problem using large-scale computing methods and outline a schedule for
    completing the project by the deadline. If you are working in a group, you should also write down the names of your group
    members and describe how you are going to split up the work amongst yourselves.
    If you wish to take a Final Exam, you should instead write one question for possible inclusion in the Final Exam and submit it
    in your README for this assignment. The better the question you submit, the higher the likelihood you will see the question
    (or a closely related one) on the exam. We will additionally post the best questions to the Final Exam page on Canvas so that
    you can use them as study material for the exam. Note: YOU WILL NEED TO PROVIDE THE SOLUTION FOR YOUR
    QUESTIONS. A good question is one that goes beyond memorization and asks the student to apply a concept in a way that
    is similar to what we do in our in-class activities and conceptual questions in assignments (we will not ask implementation
    questions that involve writing code from scratch). Specifically, we plan to include questions of the following types (for
    additional examples, you can take a look at past examples of questions used on the exam on the Final Exam/Final Project
    assignment () page):
    Applied Conceptual Questions, such as:
    You are conducting a large digital experiment, in which you have designed an online music sharing application and
    recruited participants to use the platform over the course of a month. During the experiment, you will manipulate features
    of the website in order to test your research hypotheses. In order to run the experiment, you need to be able to
    collect/record thousands of data points per second; for instance: tracking the songs that participants download, the
    treatments that they were exposed to (by you the researcher), as well as all of the things that participants click on. When
    the experiment is over, you would like to perform a statistical analysis on a subset of the data to identify experimental
    interventions that caused participants to change their clicking/downloading behavior. Ultimately, when your work is
    published, you would also like to have your (de-identified) data publicly accessible, so that future scholars can replicate
    your statistical analysis.
    Assignment 6
    3/4
    What databases and/or storage solutions would you use to solve these problems (storing data while you run the
    experiment, as well as afterwards) in the AWS cloud ecosystem? Why? How about if you scaled the experiment up by
    several orders of magnitude to include millions of participants? Would this change your data storage/management
    solution?
    Code Interpretation Questions, such as:
    Below is a serial version of a Monte Carlo simulation to estimate π that is written in Python. Identify parts of this code
    that could be accelerated using a GPU, as well as those that would best be run on a CPU – attempting to accelerate the
    estimation of π as much as possible. For each section of code, you should explain why your answers are the best
    hardware options for optimal performance (e.g. thinking in terms of some of the key bottlenecks and hardware limitations
    for CPUs vs. GPUs).

NumPy Pi Estimation with Monte Carlo Simulation

import numpy as np
import time
t0 = time.time()
n_runs = 10 ** 8 # Simulate Random Coordinates in Unit Square:
ran = np.random.uniform(low=-1, high=1, size=(2, n_runs))

Identify Random Coordinates that fall within Unit Circle and count them

result = ran[0] ** 2 + ran[1] ** 2 <= 1
n_in_circle = np.sum(result)

Estimate Pi

print("Pi Estimate: ", 4 * n_in_circle / n_runs)
print("Time Elapsed: ", time.time() - t0)
Troubleshooting Questions, such as:
You are training a linear regression model to predict the price of an AirBnB listing given a variety of text features derived
from the listing’s description on AirBnB (note that AirBnB publishes this data in CSV format for listings across the world
and the data is updated on a monthly basis).
You have written a machine learning workflow in PySpark that does the following on an AWS EMR cluster composed of 3
m5.xlarge EC2 instances (1 resource manager and 2 core instances), with 10 GB in EBS storage available on each
instance:

  1. Cleans the description text data (e.g. drops stop words and punctuation) from all AirBnB listings around the world
    from the past month (prior to the current month).
  2. Engineers features based on the clean description data (such as categorical and binary features indicating whether
    the description contains certain types of words).
  3. Uses MLLib’s CrossValidator to identify the optimal hyperparameters for your linear regression model given a grid of
    possible values used to tune the model (i.e. a grid search)
  4. Trains the regression model using the optimal hyperparameters from (3) and make predictions on the prices of AirBnB
    listings from the current month.
    Having successfully run this workflow on one previous month of data, you want to increase your training data size to
    several years worth of data. As you increase the amount of data entering into your pipeline, though, you begin to observe
    unexpected (i.e. nonlinear) diminishing performance (in terms of speed) and beyond a certain data size, your job will not
    complete at all – it keeps running indefinitely.
    Describe at least two possible root causes of this slowdown (considering both hardware and software). Why would these
    be concerns? Is it possible to remedy them? What would be your solutions?
    Some hints for writing good questions:
    You shouldn’t make the question needlessly complicated or overly verbose.
    Try to be clear about what you’re asking and what you’re looking for.
    Try to cover multiple topics from the course – i.e. a question that touches on the memory hierarchy, GPU vs. CPU
    Parallelism, and Spark’s execution model, would be better than one that is narrowly relevant to invoking a Lambda function.
    Assignment 6
    4/4
    Anything we’ve covered in the class is fair game (and you’re welcome to continue submitting relevant questions through
    Wednesday of Week 9 related to material we cover after this assignment – you just will not
    ​​​​​​​WX:codinghelp