SQL Database maintenance
SQL Server Project

Abstract
This project demonstrates the process of loading and managing data in an Azure SQL database. The data consists of a web-scraped Kaggle dataset featuring English app reviews for shopping applications, including the review text, ratings, and whether developers responded to the reviews.
The workflow involves creating a server and SQL database in Azure, connecting to it using Microsoft SQL Server Management Studio (SSMS), importing the dataset into the database, and verifying the import. Additional tasks include creating views, performing SELECT queries, and summarising, sorting, and filtering the data. Database management activities, such as monitoring logs, activity, and transactions, were also conducted.
This project highlights my proficiency in using SQL and the Azure portal to create and manage databases and servers, establishing connections via SSMS, and employing SQL queries to verify, organise, and prepare data for future analysis.
Initial Data assessment
Source: ShoppingAppReviews Dataset on Kaggle
Outline:
This dataset comprises a substantial volume of app reviews from 12 popular online shopping apps. The data was scraped from the following platforms: Alibaba, Aliexpress, Amazon, Daraz, eBay, Flipkart, Lazada, Meesho, Myntra, Shein, Snapdeal, and Walmart. It includes various fields, such as review scores, thumbs-up counts, and developer replies, with all data stored in CSV format.
While primarily intended for testing SQL database management, this dataset also holds potential for future data analysis. By analysing customer feedback, we can gain valuable insights into the user experience of these shopping apps, which handle millions of transactions each year.
Completeness:
The dataset is generally complete, with no significant issues related to missing data or completeness. However, there is a notable presence of null values in the replycontent
column, as many reviews do not have replies from the developer. Although this results in a high number of null entries, retaining these rows is important for analysis, as the absence of replies can provide valuable context in understanding user engagement.
Accuracy:
As the source of the Kaggle dataset is not explicitly stated, the accuracy of the data cannot be fully verified. Nonetheless, the dataset is still valuable for this personal project, particularly in demonstrating SQL server maintenance and management skills.
Key variables:
- Review Content
- Review Score
- Thumbs up count
- Reply content
- App name
Potential relationships:
- App name
- Review score
- ReviewID
Issues and Limitations: See Completeness and Accuracy
Data workflow
- Microsoft Azure Control Panel: Using the business Azure account linked to the website’s domain and email, an Azure server was created, followed by the establishment of a cloud SQL database. From this point forward, all updates to the database will be uploaded to the Azure cloud.
- Microsoft SQL Server Management Studio: Through Microsoft Entra authentication, SSMS was used to connect to the Azure cloud SQL database. The data import function in SSMS was utilised to import all CSV files, with data types being checked and primary keys and null values either accepted or rejected. A combined table of all the app reviews was created, and summarising/filtering views were generated. See the code below
Analysis conclusions
Since only SQL Server was used, the analysis performed was relatively basic. Views were created to display app reviews that contained replies, and summarisations of the data were made, including the calculation of average review scores, the percentage of reviews that received replies per app, and the total number of reviews per app. The code for these operations is provided below.
Files / Code
Code:
USE [app_reviews]; --Select the database being maintained
-- All files are uploaded to an Azure SQL database, SQL Server Management Studio connects to the online SQL database.
-- Loaded in all tables via
-- Tasks --> Import Flat file --> Follow wizard, set file types/primary key/accept null values
View the top 1000 rows of each table to have a quick eyeball of the data to ensure the uploads have worked.
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Alibaba];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Aliexpress];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Amazon shopping];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Daraz Online Shopping App];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Flipkart];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Flipkart];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Lazada];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Meesho];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Myntra];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Shein];
SELECT TOP (1000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[Snapdeal];
-- If there is data analysis planned in the future, it is useful to combine all the tables as they use the same columns.
-- Combine all the columns in each table into a new table called CombinedReviews
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Alibaba' AS Source
INTO [dbo].[CombinedReviews]
FROM [dbo].[Alibaba]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Aliexpress' AS Source
FROM [dbo].[Aliexpress]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Amazon shopping' AS Source
FROM [dbo].[Amazon shopping]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Daraz Online Shopping App' AS Source
FROM [dbo].[Daraz Online Shopping App]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Flipkart' AS Source
FROM [dbo].[Flipkart]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Lazada' AS Source
FROM [dbo].[Lazada]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Meesho' AS Source
FROM [dbo].[Meesho]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Myntra' AS Source
FROM [dbo].[Myntra]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Shein' AS Source
FROM [dbo].[Shein]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Snapdeal' AS Source
FROM [dbo].[Snapdeal]
UNION ALL
SELECT [reviewId], [content], [score], [thumbsUpCount], [at], [replyContent], [repliedAt], [appName], 'Walmart' AS Source
FROM [dbo].[Walmart];
-- Show bottom 1000 rows based off reviewID to check the data has been combined correctly.
SELECT TOP (10000) [reviewId]
,[content]
,[score]
,[thumbsUpCount]
,[at]
,[replyContent]
,[repliedAt]
,[appName]
FROM [dbo].[CombinedReviews]
ORDER BY [reviewId] DESC;
--Creates a view of all rows that contain a reply from the app team.
CREATE OR ALTER VIEW [dbo].[ReviewsWithReplies] AS
SELECT [reviewId],
[content],
[score],
[thumbsUpCount],
[at],
[replyContent],
[repliedAt],
[appName],
[Source]
FROM [dbo].[CombinedReviews]
WHERE [replyContent] IS NOT NULL;
-- Summarising that data
--Creates a view to summarise the scores of each app, the number of reviews, replies, no replies of all the apps.
CREATE OR ALTER VIEW ReviewSummary AS
SELECT
COUNT(DISTINCT reviewID) AS total_reviews,
COUNT(DISTINCT repliedAt) AS total_replies,
COUNT(CASE WHEN repliedAt IS NULL THEN 1 END) AS Total_no_replies,
AVG(CASE WHEN appName = 'Alibaba' THEN score END) AS Average_rating_Alibaba,
AVG(CASE WHEN appName = 'Aliexpress' THEN score END) AS Average_rating_Aliexpress,
AVG(CASE WHEN appName = 'Daraz Online Shopping App' THEN score END) AS Average_rating_Daraz,
AVG(CASE WHEN appName = 'FlipKart' THEN score END) AS Average_rating_FlipKart,
AVG(CASE WHEN appName = 'Lazada' THEN score END) AS Average_rating_Lazada,
AVG(CASE WHEN appName = 'Meesho' THEN score END) AS Average_rating_Meesho,
AVG(CASE WHEN appName = 'Myntra' THEN score END) AS Average_rating_Myntra,
AVG(CASE WHEN appName = 'Shein' THEN score END) AS Average_rating_Shein,
AVG(CASE WHEN appName = 'Snapdeal' THEN score END) AS Average_rating_Snapdeal,
AVG(CASE WHEN appName = 'Walmart' THEN score END) AS Average_rating_Walmart
FROM CombinedReviews;
--Look at the new view
SELECT * FROM ReviewSummary;
--Selecting with filters
--Selects all reviews that have a score of 5 and are of the Amazon app
Select * FROM CombinedReviews
WHERE score = 5 AND appName = 'Amazon shopping';
--Selects the reviews that have a high thumbs up count
select * FROM CombinedReviews
WHERE thumbsUpCount > 1000 AND appNAME = 'Alibaba';
--Updating data in the CombinedReviews
--Updated a specific review to have a score of 3 instead of 2.
update CombinedReviews
SET score = 3
WHERE reviewId = '00004d25-d588-4331-8bc6-e26634a6013e'
--Adding data to a table
INSERT INTO CombinedReviews (reviewID, content, score, thumbsUpCount, at, replyContent, repliedAt, appName, Source)
VALUES ('3478fnfuehufesf', 'I really dislike the interface of the app', 1, 0, 1927463892, NULL, NULL, 'Alibaba', 'Alibaba');
--Summary aggregation
--Looks at each app, the total reviews, total review replies, average score and the percentage of reviews that are replied to.
--CAST is used to make the count of replies a float so the division works.
select
appName,
COUNT(DISTINCT reviewId) AS total_reviews,
COUNT(DISTINCT repliedAt) AS total_replies,
AVG(score) AS Average_score,
CAST(COUNT(DISTINCT repliedAt) AS FLOAT) / COUNT(DISTINCT reviewId) * 100 AS precentage_replied
FROM CombinedReviews
GROUP BY appName;
-- Database management
-- Check database status
select name, state_desc
from sys.databases;
--Check for corruption
DBCC CHECKDB ('app_reviews');
--Creation of local backups. As we are using an Azure SQL database this will not work as Microsoft automatically backs up the data in the cloud.
--Making a local backup would need to be done in the Azure SQL online service system
BACKUP DATABASE app_reviews
TO DISK = 'C:\Users\tyler\Desktop\DATA\SQL Projects\Database maintainance (App Reviews)'
-- Monitoring performance
--Looks at all queries and transactions within the SQL server to indentify any transactions which are slowing the server down.
SELECT
session_id,
start_time,
status,
total_elapsed_time/1000 AS ElapsedTimeSeconds,
blocking_session_id,
wait_type,
wait_time
FROM sys.dm_exec_requests
WHERE total_elapsed_time > 5000; -- Transactions that take more than 5 seconds.
--Look at any active/open transcations
--Helps find if the server is under heavy load or not and where from
SELECT
transaction_id,
name,
transaction_begin_time,
transaction_type
FROM sys.dm_tran_active_transactions;
--Lastly, check the memory and CPU usage of the server, but this would again be checked in the Azure control panel and not in an SQL query
--Log management
--Logs can become too large, reducing performance of the server or storage problems.
--It is important to look at if there are any long-running transations and how big the logs have become in each database
--Checks each database and the size of the logs
DBCC SQLPERF(logspace);
--Indexes and Statistics
--Indexes are key for data retrieval and searching (e.g. filtering, sorting and joining)
--Creating an index of all the rows in the CombinedReviews column that have a reply.
--For all future queries that use only the reviews with replies, this index will be used to speed up the query.
CREATE INDEX idx_CombinedReviews_not_null
ON CombinedReviews (repliedAt)
Where repliedAt IS NOT NULL
--Check that the created index works
DBCC SHOW_STATISTICS ('CombinedReviews', 'idx_CombinedReviews_not_null');
--Statistics are key for speeding up query optimisation process in terms of data distribution in a table.
--Statistics are automatically created if a column has an index
--But if a query in a column occurs a lot, it is working making statistics for that column.
CREATE STATISTICS stats_repliedAt
ON CombinedReviews (repliedAt);
--Check that the created statistic works
DBCC SHOW_STATISTICS ('CombinedReviews', 'stats_repliedAt');
--Updates all indexes and statistics if the automatic update has not already.
EXEC sp_updatestats;
-- Views all the indexes present in a table
SELECT name AS IndexName
FROM sys.indexes
WHERE object_id = OBJECT_ID('CombinedReviews');
-- Views all the statistics present in a table
SELECT name AS StatisticsName
FROM sys.stats
WHERE object_id = OBJECT_ID('CombinedReviews');
--Checking database size, again, more appropriate to check in the Azure control panel.
EXEC sp_spaceused;