sourcing nba/wnba stats with go
all data is sourced from
nba.com with the
go-etl program referenced above. the cli element of the program allows
me
to run the etl code in different ways in different scripts. it is used
in "build" mode in
my database build script
(/scripts/bld.sh in the github repo) to
fetch & insert data for every nba & wnba regular season/post season game
since 1970. it's also run in "daily" mode in a cronjob
(runs
/scripts/dly.sh) every day at approximately midnight to fetch &
insert data only for the previous day
/-|-\
the go-etl process takes advantage of go's concurrency features to make
several http requests to nba.com in quick succession. the package then
processes and structures the data to match the postgres database design,
chunks the large volume
of data into small chunks, and inserts the chunks concurrently
/-|-\
this site was originally powered by a MariaDB database with data sourced
from nba.com using the
nba_api
python package. this system worked well, but i wanted to learn more of
the lower-level
concepts of http requests abstracted away by this package, so i decided
to rewrite the entire etl process, with my own http requests, in Go. the
documentation
from the nba_api was incredibly helpful in figuring out this
process
legacy python ETL | py-nba-mdb
storing the stats in postgres
all stats on the site are served from a postgres database server running
in a docker container. the database was designed following
the data normalization principles outlined in
Codd's third normal form
/-|-\
the database is built by a single shell script - /scripts/bld.sh in the
github repo.
the script builds & runs the docker container, which is configureed in
the Dockerfile & compose.yaml files, executes SQL statements
(scripts
from /sql in the github repo)
to create all schemas, tables, procedures, etc., uses the go-etl
cli to source & insert nba/wnba data since 1970, and runs several
stored procedures to process & load the inserted data into their
destination tables
/-|-\
the go-etl program inserts data only into the tables in the intake
schema. each table in this schema is designed to match the structure of
the json response from a specific endpoint on nba.com. this keeps the
changes made to the source data minimal before being inserted, which
makes errors less likely and the pipeline more maintainable.
the jdeko.me/bball api primarily interacts with tables in the database's
api schema, which contains tables specifically designed for
quickly accessing aggregated player stats. the data in these tables is
deleted and reaggregated each night after new data is inserted into the
database