可视化:RStudio NBA 投篮数据可视化

385 阅读3分钟

本文已参加「新人创作礼」活动,一起开启掘金创作之路。

这次带来的是利用 RStudio 进行 NBA 投篮数据的可视化。

NBA投篮数据的可视化

本文内容的起因是看到了这张图片, 原网页在此, 从而萌生了想要更加细致的分析NBA的投篮数据的意愿.

本文我们将讨论如何对某一个球员/某一只球队/全联盟的投篮数据进行可视化. 我们面临的第一个问题就是如何获取投篮位置数据.

网络上有不少NBA数据集, 但基本都是汇总数据, 并没有找到已经整理好的原始数据. 经过一番搜索后发现NBA官方网站其实就提供了这个数据(但是被隐藏起来了).

数据的获取

NBA官方网站爬取投篮数据

比如 James Harden 在 2019-20 常规赛中的投篮数据在这个页面

在此页面上点击 Advance Filters 并选择 Player = James Harden 之后, 可以得到 James Harden 的本赛季数据(截止目前).

在此页面上选择 SHOT PLOT 之后, 网页上会显示这幅图:

image.png

可以在网页元素中检测到类似如下的代码:

<g class="shotplot__shots">
  <!----><g class="shotplot__shot shotplot__miss" ng-repeat="shot in events | filter:{madeflag:false}" ng-attr-transform="translate({{ ::shot.locX }}, {{ ::shot.locY }})" svg-include="#shotplot-miss-template" data-index="0" data-y="124" data-x="221" data-period="1" data-clock="10:59" data-madeflag="false" data-team-id="1610612745" data-team-name="Houston Rockets" data-player-id="201935" data-player-name="James Harden" data-shot-type="3PT Field Goal" transform="translate(124, 221)">
    <title>Harden, James - Q1 10:59</title>
    <line x1="-4" x2="4" y1="-4" y2="4"></line>
    <line x1="4" x2="-4" y1="-4" y2="4"></line>
  </g>
  ...
</g>

下面将对应的网页元素<g class="shotplot__shots">下载到本地保存为HardenJames_Playoffs_201920.html, 然后演示如何从中抽取出我们所需的数据.

抽取数据

下面使用R的字符串处理函数从网页文本中抽取数据.

library(dplyr)
hd = readLines("./data/HardenJames_Playoffs_201920.html")

# --- find the rows containing data of shot MISS ---
rows = grep(pattern = "shotplot__shot shotplot__miss", hd)
hd_miss = hd[rows]

# find these key fields: data-y, data-x, data-period, data-clock, data-madeflag, data-team-id, data-team-name, data-player-id, data-player-name, data-shot-type
char_from = gregexpr("data-y", hd_miss) %>% unlist
char_end = gregexpr(" transform", hd_miss) %>% unlist - 1
miss_key_fields = substr(hd_miss, char_from, char_end)

# get the key fields:
get_keys = function(x){
  strsplit(x, split = " d") %>% 
          unlist() %>%
          gsub(pattern = "data", replacement = "ata") %>%
          gsub(pattern = "=.*$", replacement = "") %>%
          paste0("d",.)
}

miss_keys = get_keys(miss_key_fields[1])

# get the value fields:
get_values = function(x){
  strsplit(x, split = " d") %>% 
          unlist() %>%
          gsub(pattern = "data", replacement = "ata") %>%
          gsub(pattern = "ata.*=", replacement = "")
}

# test
get_values(miss_key_fields[1])

miss_values = sapply(miss_key_fields, get_values, simplify = "matrix")

miss_values_mat = matrix(unlist(miss_values), ncol = 10, byrow = TRUE)

# using miss_keys as column names
colnames(miss_values_mat) = miss_keys

dim(miss_values_mat)


# --- find the rows containing data of shot MAKE---
rows = grep(pattern = "shotplot__shot shotplot__make", hd)
hd_make = hd[rows]

# find these key fields: data-y, data-x, data-period, data-clock, data-madeflag, data-team-id, data-team-name, data-player-id, data-player-name, data-shot-type
char_from = gregexpr("data-y", hd_make) %>% unlist
char_end = gregexpr(" transform", hd_make) %>% unlist - 1
make_key_fields = substr(hd_make, char_from, char_end)

# get the key fields:
get_keys = function(x){
  strsplit(x, split = " d") %>% 
          unlist() %>%
          gsub(pattern = "data", replacement = "ata") %>%
          gsub(pattern = "=.*$", replacement = "") %>%
          paste0("d",.)
}

make_keys = get_keys(make_key_fields[1])

# get the value fields:
get_values = function(x){
  strsplit(x, split = " d") %>% 
          unlist() %>%
          gsub(pattern = "data", replacement = "ata") %>%
          gsub(pattern = "ata.*=", replacement = "")
}

# test
get_values(make_key_fields[1])

make_values = sapply(make_key_fields, get_values, simplify = "matrix")

make_values_mat = matrix(unlist(make_values), ncol = 11, byrow = TRUE)

# using make_keys as column names
colnames(make_values_mat) = make_keys

dim(make_values_mat)

# --- combine miss (10 columns) and make (11 columns) togather ---

values_mat = rbind(miss_values_mat, make_values_mat[,-3]) # remove data-index from make_values_mat and then combine

hd_data = as.data.frame(values_mat)

# save as a csv file.
write.csv(hd_data, "./hd_2019_2020_regular_clean.csv", row.names = F, quote = F)

整理好的数据如下

hd_csv = read.csv("./hd_2019_2020_regular_clean.csv")
hd_csv

image.png

投篮数据的可视化

现在我们需要将missed_shot和made_shot用不同的记号画在NBA球场背景上. 因此需要找到合适的球场背景图.

从对上面网页的代码分析中可以发现, 网页上同时也提供了对应的底图:

<image transform="translate(-260, -60)" width="520" height="490" xlink:href="/stats/media/img/shotchart-blue.png"></image>

思考从上面这段代码可以得到什么信息?

# 球场半场底图的链接:  https://www.nba.com/stats/media/img/shotchart-blue.png
# 球场半场底图的尺寸:  width="520" height="490"
# x,y坐标如何映射到底图上去: transform="translate(-260, -60)"
library(png)
nba_png = readPNG("./img/13_nba_shortchart_blue.png")
par(bg = "white", mar = c(0,0,0,0))

plot(0, type='n', main="", xlab="x", ylab="y", asp=1, xlim = c(-260, 260), ylim = c(460,-60))

rasterImage(nba_png,
            xleft = -260 + 10, xright = 260 - 10, # +-10用于微调边界
            ybottom = 430 - 10 , ytop = -60 + 10,
            angle = 0)

with(hd_csv, 
     points(x = data.y, y = data.x,
            pch = ifelse(data.madeflag == "true", 1, 4),
            col = ifelse(data.madeflag == "true", "darkgreen", "darkred"),
            cex = 1.2, lwd = 2
            )
     )

最终James Harden 在2019-2020常规赛的投篮数据可视化结果(截止2020-12-02)如下:

image.png