自动化你的R-E.T.L工作流程的实例演示

187 阅读2分钟

你的ETL管道可以有不同的方式,这主要取决于你如何访问你的数据,处理它和使用它。
在这篇文章中,我将处理这样一个场景:从googlesheet中获取数据,在rmarkdown脚本中处理,并在shiny dashboard中使用。

在这种情况下,数据预处理脚本和shiny应用程序脚本都可以访问数据文件夹。因此,发生的情况是,markdown脚本对数据进行预处理,并将整洁的数据输出到data文件夹中,另一端的shiny应用则接收整洁的数据。

有这样一个数据,我一直在工作并做了一些文章。这是一个正在进行的关于工资的调查,数据被收集到一个googlesheet中,提交是自发进行的。可以考虑看看我之前关于这个数据的文章。
在处理了数据并开发了一个用于分析的仪表盘之后,我想到了一个方法,可以用最新的数据提交来更新仪表盘,而不一定要手动运行rmarkdown脚本。

输入cron job!

cron job是一个unix(linux/mac)系统的实用程序,可以用来自动进行重复性工作。我假设你对此有一个概念,如果没有,我想你可能需要考虑从篇文章中学习一些基础知识。

注意:这在linux/mac的设置中是可行的,对于windows可以考虑任务调度器。

考虑到数据的提交都是自动完成的,我们可以设置一个cron job,根据你的使用情况,每小时、每天、每周、每月等运行rmarkdown脚本。
不同的环境下,使用情况会有所不同,例如,我们可能会有这样的情况:数据是按分钟/秒、按日甚至按月流进来的。因此,你要写一个cron job,按照数据提交的频率来更新你的数据。

注意:有一些R包用于调度cron作业 cronR用于Unix和 TaskscheduleR如果你想简单点的话:)也就是说,对我来说,我选择了做脚本。

调度一个Cron作业

运行下面的命令来打开一个crontab

$ crontab -e

一个脚本将被打开,之后在脚本中输入下面一行,并保存。

* * * * * Rscript -e "rmarkdown::render('~/path/to/your/script.Rmd')"

注意:请记得相应地编辑你的rmarkdown脚本的路径。

上述cron job将每分钟运行一次,你可以编辑它,以便在你希望脚本被执行时运行。

从googlesheets4访问数据

有了googlesheets4软件包,我们可以使用一个url从google sheet中获取数据。
你将需要使用gs4_deauth()函数停用认证,这样我们就不会在每次脚本获取数据时被提示输入访问令牌。
记住,我们的目的是将数据导入、处理到可视化/存储的整个过程自动化,因此不需要认证。

gs4_deauth() # this deactivates authentication. We won't be prompted to login using credentials.
salary_data <- read_sheet("https://docs.google.com/spreadsheets/d/1IPS5dBSGtwYVbjsfbaMCYIWnOuRmJcbequohNxCyGVw/edit?resourcekey#gid=1625408792")
salary_data <- data.table::setDT(salary_data)

数据处理

在这一点上,我们在R中得到了我们的数据,我们现在可以根据需要来处理它。
在这一点上,tidyverse库会派上用场。有了它,我们可以根据需要访问用于清理和转换数据的函数,这样我们就可以有一个整洁的数据作为输出,例如可以导入到一个闪亮的应用程序或用于建模。
Repo上的Datacleaning.Rmd文件有所有的清洗和转换步骤。

# change the columns name - observe the naming syntax for headers/columns
 names(salary_data) <- c("Timestamp","age","industry","job_title","job_title_context","annual_salary",
                         "Other_monetary_comp","currency","currency_other",
                         "income_context","country","state","city","professional_experience_years","field_experience_years",
                         "highest_edu_level","gender","race")
                         
                         
# College degree and Some college kinda means the same. Let us just categorize them into one (College degree). 
salary_data$highest_edu_level <- gsub("Some college","College degree",salary_data$highest_edu_level)
# replace the variants of United states with one name.  
salary_data$country <- stri_replace_all_regex(salary_data$country,
                                  pattern=c('US', 'USA', 'usa',"U.S.","us","Usa","United States of America","united states",
                                            "U.S>","U.S.A","U.S.A.","united states of america","Us","The United States",
                                            "United State of America","United Stated","u.s.","UNITED STATES","united States","USA-- Virgin Islands",
                                            "United Statws","U.S","Unites States","U. S.","United Sates","United States of American","Uniited States",
                                            "Worldwide (based in US but short term trips aroudn the world)","United Sates of America",
                                            "Unted States","United Statesp","United Stattes","United Statea","United Statees","UNited States","Uniyed states",
                                            "Uniyes States","United States of Americas","US of A","United States of america","U.SA","United Status","U.s.",
                                            "U.s.a.","USS","Uniteed States","United Stares","Unites states","Unite States","The US","United states of America",
                                            "For the United States government, but posted overseas","UnitedStates","United statew","United Statues",
                                            "Untied States","USA (company is based in a US territory, I work remote)","Unitied States","Unitied States",
                                            "USAB","United Sttes","united stated","United States Of America","Uniter Statez","U. S","USA tomorrow",
                                            "United Stateds","US govt employee overseas, country withheld","Unitef Stated","United STates","USaa",
                                            "uSA","america","United y","uS","USD","United Statss","UsA","United  States","United States is America",
                                            "United states of United States","United StatesD","United States- Puerto Rico","United Statesaa","United Statest",
                                            "United States govt employee overseas, country withheld","United StatesA tomorrow","United StatesAB",
                                            "United StatesA (company is based in a United States territory, I work remote)","United StatesS",
                                            "United Statesa.","RUnited Statessia","United States of A","United StatesA-- Virgin Islands","United StatesA.",
                                            "United State","United states","United StatesA","United Statess","United StatessA","United States",
                                            "United Statessmerica","United StatUnited Statess","Canada and United StatessA","United States"),
                                  replacement="United States",
                                  vectorize=FALSE)
                                  
                                  
# use gsub to remove the A from states. 
salary_data$country <- gsub("United StatesA","United States",salary_data$country)
# do the grouping and count the records by the country column 
percentage.count <- salary_data[,.(totalcount=.N,country)][,.(count=.N,totalcount,country),by=.(country)][,.(percentage=round(count/totalcount*100)),by=.(count,country)] %>% unique()
salary_data_US <- salary_data[country %like% "United States",]
# for column currency replace other with USD where currency_other == USD
# data.table approach to replacing values in a column based on a condition in another column.
salary_data_US <- salary_data_US[currency_other %in% c("USD","American Dollars","US Dollar"), currency := "USD"]
# There are alternative approaches to this, like using case_when with mutate or if_else with mutate. 
# You could explore such options just to have the knowledge of the alternatives.
salary_data_US <- salary_data_US[currency == "USD",]
# if you wanna check if the currency is USD only you could do uncomment and run the following lines of code.  
#unique(salary_data_US$currency)
# convert annual_salary to numerical data type. 
salary_data_US$annual_salary <- gsub(",","",salary_data_US$annual_salary)
salary_data_US$annual_salary <- as.numeric(salary_data_US$annual_salary)
# we have some values which are null. we drop them  
salary_data_US <- salary_data_US[ highest_edu_level !="",]
# remove the NA
salary_data_US <- salary_data_US[!is.na(annual_salary),]
# for the gender variable, let us only pick the male and female groupings 
salary_data_US <- salary_data_US[gender %in% c("Man","Woman")]
salary_data_US$gender <- as.factor(salary_data_US$gender)
# subset the dataset 
salary_data_US <- salary_data_US[,c("age","annual_salary","professional_experience_years","highest_edu_level","gender","industry","job_title")]

储存/版本化你的数据

一旦我们有了整洁的数据格式,我们就可以使用针式包来存储和版本。
你可能想知道为什么我们要对数据进行版本管理,好吧,想象一下这样的场景:你想回滚到数据的某个状态,那么pins包将是你的首选,因为有了它,你可以对数据进行版本管理,从而让你可以根据情况选择获取一个特定的版本。但默认情况下,获取的是最新版本。

# create a board
board <- board_folder("./data",versioned = TRUE)
board %>% pin_write(salary_data_US,"salary_cleaned", type="csv")

使用数据

经过清理的数据可以被一个闪亮的仪表盘获取,存储在数据库中供将来使用,甚至可以进一步用于建模。
在这种情况下,我们有一个访问数据的仪表板。

薪资分析仪表板