A startup named Sparkify wants to analyze user activities using their song and user data. The current data is spread among several JSON files, making it hard to query and analyze.
This project aims to create an ETL pipeline to load song and user data to a Postgres database, making it easier to query and analyze data.
Data is currently collected for song and user activities, in two directories:
data/log_data and data/song_data, using JSON files.
{
"num_songs": 1,
"artist_id": "ARGSJW91187B9B1D6B",
"artist_latitude": 35.21962,
"artist_longitude": -80.01955,
"artist_location": "North Carolina",
"artist_name": "JennyAnyKind",
"song_id": "SOQHXMF12AB0182363",
"title": "Young Boy Blues",
"duration": 218.77506,
"year": 0
}{
"artist": "Survivor",
"auth": "Logged In",
"firstName": "Jayden",
"gender": "M",
"itemInSession": 0,
"lastName": "Fox",
"length": 245.36771,
"level": "free",
"location": "New Orleans-Metairie, LA",
"method": "PUT",
"page": "NextSong",
"registration": 1541033612796,
"sessionId": 100,
"song": "Eye Of The Tiger",
"status": 200,
"ts": 1541110994796,
"userAgent": "\"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\"",
"userId": "101"
}Records in log data associated with song plays i.e. records with page set to
NextSong.
| Column | Type | Nullable |
|---|---|---|
| songplay_id | integer | not null |
| start_time | timestamp without time zone | not null |
| user_id | integer | not null |
| level | character varying | not null |
| song_id | character varying(18) | |
| artist_id | character varying(18) | |
| session_id | integer | not null |
| location | character varying | not null |
| user_agent | character varying | not null |
Primary key: songplay_id
Users in the app.
| Column | Type | Nullable |
|---|---|---|
| user_id | integer | not null |
| first_name | character varying | not null |
| last_name | character varying | not null |
| gender | character(1) | not null |
| level | character varying | not null |
Primary key: user_id
Songs in music database.
| Column | Type | Nullable |
|---|---|---|
| song_id | character varying(18) | not null |
| title | character varying | not null |
| artist_id | character varying(18) | not null |
| year | integer | not null |
| duration | double precision | not null |
Primary key: song_id
Artists in music database.
| Column | Type | Nullable |
|---|---|---|
| artist_id | character varying(18) | not null |
| name | character varying | not null |
| location | character varying | not null |
| latitude | double precision | |
| longitude | double precision |
Primary key: artist_id
Timestamps of records in songplays broken down into specific units.
| Column | Type | Nullable |
|---|---|---|
| start_time | timestamp without time zone | not null |
| hour | integer | not null |
| day | integer | not null |
| week | integer | not null |
| month | integer | not null |
| year | integer | not null |
| weekday | integer | not null |
Pre-requisites:
- Python 3
- pipenv
- pyenv (optional)
- PostgreSQL Database
To install project python dependencies, you should run:
pipenv installThe database can be installed locally or ran using Docker, which is the preferred method.
To use docker to run Postgres, you should run:
docker run --net=host --name postgres -e POSTGRES_PASSWORD=your_password -d postgresTo initially access the database, you should run:
psql -h localhost -U postgresYou should run the following commands under psql to setup user access to
Postgres and create the initial sparkifydb database:
CREATE ROLE student WITH ENCRYPTED PASSWORD 'student';
ALTER ROLE student WITH LOGIN;
ALTER ROLE student CREATEDB;
CREATE DATABASE sparkifydb OWNER student;To run the project locally, use pipenv to activate the virtual environment:
pipenv shellAnd run the scripts to create database tables:
./create_tables.pyand populate data into tables:
./etl.pyData can be verified using the provided test.ipynb jupyter notebook:
jupyter notebook