Welcome to Covid19 Visualization Project’s documentation!

Contents:

Indices and tables

Background and task

A common way of viewing statistics related to the COVID-19 pandemic has been dashboards. It led to the development of various data sources and programmatic ways of access. However, simply showing data without the underlying context may be misleading. Hence, bringing additional information that helps to understand and interpret data is critical. We would love to see your ideas for building pipelines that fetch data and relevant contextual information.

Set up a data processing and visualization pipeline for COVID data. You will retrieve the data from a public API (e.g., covidtracking.com), write code to process the data as needed, and provide visualizations of COVID infections over time.Thetask should:

Allow interactive exploration and interpretation of covid infections in selected countries (e.g., US)
Deliver a reproducible pipeline that re-executes automatically
Provide a clean and well-documented code

Database

The covid tracking project compiled US COVID-19-related data from 02/2022 until 03/2021. It provides data by day and state in three main areas: testing, hospitalization, and patient outcomes. Data is provided via an API that can be used to retrieve a json file.

As a proof of concept, this tool takes from that database the total number of COVID-19 cases by day. This metric is accumulative; therefore, if we want to visualize daily COVID-19 cases, we need to transform the original data and calculate the difference. Some US health bureaus reported cases only on weekdays, while others did it uninterrupted daily. This explains why we see a drop in the number of cases during the weekend. To correct this “noise”, the tool calculates a rolling mean with a window of 7 days. This transformation smoothes the signal and corrects this artifact; however, it tends to hamper the detection of fast changes in the signal.

Solution:

The schema of the solution proposed for this task is represented in the graph below. PySpark will be used for the ETL job and Bokeh for generating the interactive visualization. Selected data is extracted from the database API. Then, after a data validation checking, several transformations are applied to the data, such as the conversion of dates, calculation of daily differences, and time-series sequence smoothing via rolling mean. Next, data is loaded into a parquet database. This loading process checks for duplicates, and therefore it can be run repeatedly without affecting the database.

Finally, the data previously loaded in the parquet database is used to generate a bokeh interactive plot in html.

In this proof of concept, I used the total number of COVID-19 cases. However, it can be adapted to any of the metrics available in the API.

Project Structure

The structure of the project is inspired from this repository.

root/
 |-- configs/
 |   |-- configs.json
 |-- dependencies/
 |   |-- logging.py
 |   |-- spark.py
 |-- COVID19_project/
 |   |-- __main__.py
 |   |-- extract.py
 |   |-- transform.py
 |   |-- load.py
 |   |-- visualize.py
 |-- tests/
 |   |-- test_data/
 |   |-- | -- test_to_date.parquet
 |   |-- | -- ...
 |   |-- test_transform.py
 |   |-- data_generator.py
 |   requirements.txt

The ETL task and visualization tool are contained in COVID19_project. There, each file contains the methods required for each part of the project: extraction, transformation, loading, and visualization. Different parameters can be configured in configs/configs.json and then used in transformation methods. Additional modules that support the pySpark session and logging can be found in dependencies. Finally, unit test modules are stored in tests next to small representative portions of input and output data. Each of the transformations methods has its own test function.

Ideas for further development

Visualization

Extract daily cases per state.
Integrate states databases with their geoboundary in a geoparquet file.
Develop a map visualization of US with a colormap depending on the cases.

Database

Implement a Hadoop/HIVE database to test performance.

Querying

Allow custom SQL queries to retrieve information from the database.