How to Generate fake Relational Data in Python and Getting Insight using SQL in Big Query
A relational table, also known as a database table, is a fundamental structure in a relational database. It is a two-dimensional representation of data organized into rows and columns. Each table consists of a specific set of related data, where each row represents a record or entity, and each column represents a specific attribute or field.
The relational table follows the principles of the relational model, which emphasizes the concept of relationships between tables. These relationships are established through keys, such as primary keys and foreign keys, which link records across multiple tables.
Each row in a relational table contains a unique identifier (primary key) that distinguishes it from other rows, while each column represents a distinct attribute or property of the data. The values in the table cells represent the actual data stored in the database.
Relational tables provide a structured and organized way to store and retrieve data in a relational database management system (RDBMS). They enable efficient data management, data integrity through constraints, and support powerful querying capabilities using SQL (Structured Query Language).
I designed the relational database depicted in the provided image during my exploration and experimentation with the transfer service application. This database comprises three tables: the account table, transaction table, and payment table. Additionally, I utilized the Python faker library to generate mock data, resulting in 100 accounts and 10,000 rows.
#import library
from faker import Faker
import pandas as pd
import random
from random import randint
#Generate Country and Language
fake=Faker(locale='id_ID')
#Geberate account table
def create_account(num_account):
account=pd.DataFrame()
for i in range(0,num_account):
account.loc[i,'id']= str(randint(1245968, 9857483))
account.loc[i,'first_name']=fake.first_name()
account.loc[i,'last_name']=fake.last_name()
account.loc[i,'email']=fake.ascii_free_email()
account.loc[i,'phone_number']=fake.phone_number()
account.loc[i,'is_verified']=fake.random_element(elements=("yes","no"))
return account
#create 100 account
account=create_account(100)
#save data into csv
account.to_csv("account.csv",index=False)
#Generate Transaction Table
def transaction(num_trans):
trans=pd.DataFrame()
for i in range(0,num_trans):
trans.loc[i,'transaction_id']= fake.bothify(text='FT#########')
trans.loc[i,'created_at']=fake.date_time_between(start_date='-2y', end_date='now', tzinfo=None)
trans.loc[i,'recipient_bank']=fake.random_element(elements=("Dana","Gopay","LinkAja","Ovo","Shopeepay","BRI","BNI","BSI","BCA",
"Mandiri","JAGO","Maybank","Permata","Seabank","Muamalat","BJB"))
trans.loc[i,'account_number']=fake.aba()
trans.loc[i,'amount']=fake.random_int(min=10000, max=5000000, step=1000)
trans.loc[i,'unique_code']=fake.random_int(min=50, max=999)
trans.loc[i,'transaction_status']=fake.random_element(elements=("Need Confirmation","Checking","Processed","Success","Failed","Cancelled"))
return trans
#generate 1000 transaction
trans=transaction(10000)
#generate admin fee based on random 0 or 1500
trans['admin_fee']=random.choices([0,1500],k=len(trans))
#generate relational user id in account table and transaction table
trans['user_id']=random.choices(account["id"], k=len(trans))
#save data into csv
trans.to_csv("transaction.csv",index=False)
#Make Payment table
def payment(num_trans):
payment=pd.DataFrame()
for i in range(0,num_trans):
payment.loc[i,'payment_id']= fake.bothify(text='FP#####')
payment.loc[i,'send_at']=fake.date_time_between(start_date='-2y', end_date='now', tzinfo=None)
payment.loc[i,'payment_method']=fake.random_element(elements=("BCA","BNI","BRI","BSI","CIMB","Danamon","Digibank","Mandiri","Muamalat","Permata","Jenius"))
payment.loc[i,'account_number']=fake.aba()
payment.loc[i,'payment_status']=fake.random_element(elements=("Success","Failed","Cancelled"))
return payment
#generate 1000 payment process
pay=payment(10000)
#generate relational transaction id in trans table and payment table
pay['transaction_id']=random.choices(trans["transaction_id"], k=len(pay))
#save data into csv
pay.to_csv("payment.csv",index=False)
Getting Insight Fake Relational Table using Big Query
After creating the relational table, the CSV file is uploaded into Big Query. Then, insights will be sought from the data.
Monthly Transaction
SELECT date_trunc(t.created_at, MONTH) AS Month,
sum(t.amount) as amount_transaction FROM `data-test-391009.flip_data_example.transaction` t
LEFT JOIN `data-test-391009.flip_data_example.account` a on t.user_id = a.id
where t.transaction_status ='Success'
group by 1
order by 1 desc
Transaction Status Detail in 2023
SELECT count (transaction_id) as total_transaction, transaction_status
FROM `data-test-391009.flip_data_example.transaction`
#left join `data-test-391009.flip_data_example.account` a on t.user_id=a.id
where created_at > "2022-12-31"
group by 2
Top 10 user account in 2023
SELECT t.user_id,concat(a.first_name,' ',a.last_name) as name,
count(t.transaction_id) as total FROM `data-test-391009.flip_data_example.transaction` t
LEFT JOIN `data-test-391009.flip_data_example.account` a on t.user_id = a.id
where t.created_at > "2022-12-31"
and t.transaction_status ='Success'
group by 1,2
order by 3 desc
LIMIT 10
Top 10 payment method in 2023
SELECT p.payment_method, count(t.transaction_id) as total
FROM `data-test-391009.flip_data_example.payment` p
LEFT JOIN `data-test-391009.flip_data_example.transaction` t on t.transaction_id = p.transaction_id
where t.created_at > "2022-12-31"
and t.transaction_status ='Success'
group by 1
order by 2 desc
LIMIT 10
Top 10 User Saldo based on Unique Code from Transaction
SELECT t.user_id,concat(a.first_name,' ',a.last_name) as name,
sum(t.unique_code) as total_saldo FROM `data-test-391009.flip_data_example.transaction` t
LEFT JOIN `data-test-391009.flip_data_example.account` a on t.user_id = a.id
where t.transaction_status ='Success'
group by 1,2
order by 3 desc
LIMIT 10