MET CS 689 Designing and Implementing a Data Warehouse Assignment 3A: ETL - Pyth

MET CS 689 Designing and Implementing a Data Warehouse Assignment 3A: ETL - Python Overview of the Assignment: ETL takes most of the data warehouse developers’ and administrators’ time. This assignment will go through some of the more common ETL processes using Python.Part 1 –Extract, Staging Download the two csv files Ships and CLIWOC15.csv. You just completed the “Extract” phase, it does get more complicated, but we keep it simple here.We will begin by “staging” the data. The first step before “Transformation” is to get the data prepared or what’s referred to as “staged” before we can load it, and that initial step is to load the data into “staging tables” in this case these are going to be staging data frames. The idea of staging tables is to keep the data as close to the source format as possible. We will begin by using Python, to load the two files into two data frames called Ship_df and Trip_df, respectively. Review the data and note that the two files share three columns: ShipName, ShipType, and Nationality. These will be used as “natural composite key” columns to join the two data frames (tables).Pandas Reference/Hints/Notes· Pandas - DataFrame. Reference (w3schools.com) is a great reference to Pandas methods· Review run LoadTitanic.py from assignment 1 on how to create a data frame. from a CSV file. Note that you don’t have to have a separate Load.py, you can paste the load command directly in the notebook, however having separate load.py source files allow you to build modularity.· display(df) method shows all the data in the data frame. – this is very helpful in understanding the data· df.head()method allows you to inspect the header and the first 10 rows – this is very helpful in understanding the data with performance in mind.· df.count() method will show count of records, it’s good to know what was loaded or not.· List(df.columns.values) method shows the attributes· Loading Trips_df might give you an error, understand why you are getting the error, and research how to solve it (you can use the “lazy option”, and not convert the data types- especially for the initial staging table)1. Once you load the two data frames: How many rows and columns are in each data frame?Rows Columns Ship_df Trip_df Show the record count of both Ship_df and Trip_df data frames as well as the count.Python command: Screenshots of the executed command: Part 2 –Creating SCD1 Dimension and Key maintenance 2. Our end goal for this section of the assignment is to create a Ship dimension table ShipDim. In this case we are going to keep it simple and use SCD type1, meaning overwrite if there are any changes, or add if it’s a new record. Recall that SCD type 1 needs to have a unique instance of each record, so let’s check if there are any duplicates in the Ship_dfHints:· There are null values in some attributes, we want all combinations with the null (NaN) values, not all of these techniques below will work correctly- look to figure out which ones will give you the best resultso value_counts() function – this might only help in seeing the count for a single attribute.o groupby() function – make sure to include all three attributes as we want to find a distinct combination of all three.o concatenate attributes for example df.a +"-"+ df.bo df.fillna('') which will help transform. empty values into stringso df.describe(include='all') which will give you statistics of the data frame.· You can apply multiple functions together on a data frame. For example, you can apply both groupby first and value_counts after to get a count of distinct values, however groupby may have issues with the nulls in one of the columns.Show if there are any duplidai 写MET CS 689 Designing and Implementing a Data Warehouse Assignment 3A: ETL - PythonPython cate combination of all three attributes in the Ship_df, this can just show the counts of combinations.Python command: Screenshots of the executed command: 3. You will notice that Ship_df has some duplicates that need to be removed before that data can be used to populate a ShipDim dimension table for it to be in SCD1 format. Use pandas to drop the duplicates and store the result into another data frame. labelled ShipDistinct_df.· Hint: Review the drop_duplicates() function.Show the above operation to create a new ShipDistinct_df data frame. by removing the duplicate records.Python command: Screenshots of the executed command: 4. Show new ShipDistinct_df data frame. record count which has no duplicates, use the commands you used in question 1 and 3 on this new data frame.Python command: Screenshots of the executed command: How many rows are in ShipDistinct_df after duplicates have been dropped? ______Note: the new count should make sense in reviewing groupby count results between the two data frames.5. Now let’s focus on the Trip_df. Trip has some additional ships (ShipName, ShipType, Nationality) that do not currently appear in ShipDistinct_df. These new dimension rows need to be pulled and added to the dimension table. Inspect the column names of the Trips_df and provide screenshot of the columns.Hint: Look into columns.tolist() function, if not all columns display, modify settings to display all columns by using pd.set_option('display.max_columns', None)Python command: Screenshots of the executed command: 6. Our goal is to identify any Ships in Trips_df that is not in ShipDistinct. We can use a LEFT JOIN on these dataframes to determine which values of (ShipName, ShipType, Nationality) are in Trip_df but not in ShipDistinct_df.First create a joined data frame. called ShipTrips_df by using the merge() function on the Trips_df and the ShipDistinct_df; use a left join on (ShipName, ShipType, Nationality) and set the indicator to True.Show the merge command to create the ShipTrips_df data frame.Python command: Screenshots of the executed command: 7. Inspect the resulting ShipsTrips_df data frame. (use the head() function), specifically scroll all the way to the right and note the _merge column that has been added. Let’s determine the unique combinations of _merge column by using the value_counts() function.Show the value_counts() of the _merge column.Python command: Screenshots of the executed command: For each of the resulting values (you should see three) of the results above, very briefly explain what it means – short, bulleted list. · Your explanation goes here…. ·8. Now we can filter out the new records we will need to bring into our Ships from the joined data frame, decide the filter condition based on the results from question 7.Hint: Look into query and filter functions to use on the ShipsTrips_df. The filter function will show the attributes that we want to look at, while the query function will help us filter the resultsProvide the function call for the ShipsTrips_df showing only the columns that we need (ShipName, ShipType, Nationality), as well as the _merge column. The function call should filter (query) _merge column as outlined in the directions above.Python command: Screenshots of the executed command: 9. Your rows count should match the count in question 7. How many new records were found? ____10. Now let’s find the distinct instances of the ShipName, ShipType, and Nationality. Perform. the same operations as outlined in steps 2 through 4 to create a ShipsTrips_Distinct_df (this new data frame. should not have any duplicates)Python command: Screenshots of the executed command: 11. Show the resulting data frame. WX：codehelp