The Relational Data Model consisted of four tables such as Users, Brands, Receipts and RewardsReceiptsItemList.
The queries/answers to the questions are in the file with the title “SQL_Queries.pdf”
The data quality issues and solutions to overcome these issues are discussed in the Jupyter Notebook with the title “FetchRewards.ipynb” Here, the data is converted from JSON to CSV format for the easier execution of data analysis part and for the data analysis and exploration purpose, Python programming language is used.
Some of the data quality issues faced during the exploration of the files (Users, Brands, Receipts):
- Null/NA Values present
- Duplicate records – there are multiple instances where duplicate IDs with respective attributes are duplicated
- Wrong formatting of the data – Date is stored in the 13-digit number instead of standard format of timestamp, Title names are written in the wrong format
Through data cleaning and formatting processes, data quality issues from the data have been removed. For these processes, several python functions, and libraries (Pandas, NumPy) are used.
Also, many records from rewardsReceiptItemList possess barcode as ‘4011’ which is nothing but ‘Item not found’ case. As cost of these products are getting added to the Total Spent, it is not correct to remove these products directly during data cleaning process.
Addition to this, this column has many items barcodes whose values are not matching with the barcodes from the Brands table. So, it is difficult to track them down. Therefore, the join between the tables Brands and RewardsReceiptsItemList is with brand name -> description and not with the barcode.
The email to business people/ Stakeholders is in the file with the title "Email.pdf"