30 second pitch
highlights
make command will build the Docker containers for Apache Spark and Apache Hadoop, initialize the environment, verify input data and generate output reportproject plan
Timeline may be pulled in by paralleling some tasks. Security, compliance and scope could affect the plancompliance checklist
Figure below shows the topology of how the data pipeline executes on a single node development machine
sales_by_states.pyA mobile friendly presentation can be viewed here It has been generated from the Rmarkdown present inside docs
Prerequisite
Ubuntu 16.04docker 18.09.6 and docker-compose 1.24.0 to be available and ready to usemakeAssumes that input data is clean
You may need to configure a proxy to pull the docker images. Do not run in production !!
Start here
Data pipeline executes inside the docker containers, on a development machine. The entire pipeline is automated through a self documented makefile. Executing make command in the root of this repository will build the docker containers for Spark and Hadoop, start them, verify input data and generate the report
Either execute make in the root of the repository or execute individual commands make setup start verify report. Most commands are idempotent
Explore other commands using make help
$ make help
all setup start verify report
clean-output Delete output data
connect To enter the Spark container
report Print the output report and save it to a file
setup Build Docker containers
start Starts Spark and Hadoop. Jupyter is at localhost:8888
stop Stop and remove the containers
verify Check if the input data is skewed
Output of make report is shown below. It is saved locally as well as in Hadoop
$ make report
AK#2016#8#1#11#123458
AK#2016#8#1##123458
AK#2016#8###123458
AK#2016####123458
AK#####123458
AL#2017#8#1#10#123457
AL#2017#8#1##123457
AL#2017#8###123457
AL#2017####123457
AL#2016#8#1#12#123459
AL#2016#8#1##123459
AL#2016#8###123459
AL#2016####123459
AL#####246916
CA#2016#2#1#9#246912
CA#2016#2#1##246912
CA#2016#2###246912
CA#2016####246912
CA#####246912
OR#2016#2#1#9#123456
OR#2016#2#1##123456
OR#2016#2###123456
OR#2016####123456
OR#####123456
The report has been saved to hdfs://hadoop:9000/output and locally at dataout/sales_by-state.txt
When finished
Execute make stop on your host machine. This stops and removes the containers
Areas of improvement
Run check_input.py from the host machine
$ time make verify
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)
real 0m8.836s
user 0m0.041s
sys 0m0.025sNow run the same script directly inside the container
$ make connect
jovyan@jupyter:~/work$ time spark-submit check_input.py 2>/dev/null
customers-by-state (count: 4, mean: 1.25, stdev: 0.4330127018922193, max: 2.0, min: 1.0)
('CA', 2)
('AK', 1)
('AL', 1)
('OR', 1)
customers-by-transactions (count: 5, mean: 1.2, stdev: 0.4, max: 2.0, min: 1.0)
('123', 1)
('789', 2)
('456', 1)
('124', 1)
('101112', 1)
real 0m8.280s
user 0m14.407s
sys 0m1.373sreal, user and sys timesuser time is more than real time inside the containersysstat Plot the performance metrics as an interactive time seriesThis chart shows sysstat metrics for 4 executions of sales_by_states.py job. The metrics are reported by the quantile given in the table below
| metric | unit | quantile |
|---|---|---|
| cpu | %busy | 75% |
| disk | %util | 99.5% |
| runq | unit | 99% |
Folder dataout contains sysstat metrics for CPU, memory, network, disk and proc, after you run the collect-sar and parse-sar commands inside the jupyter container. head -n2 dataout/*.dat