Sitemap
4 min readFeb 1, 2023

MY PORTFOLIO

Hi, my name is Emmanuel Musyoki.

I have been a data analyst for 2 years with experience in Business Intelligence (BI) analysis, Operations Research Analysis, and Market Research Analysis.

PROJECTS

DATA EXPLORATION AND CLEANING USING MYSQL

Data link UNICEF_Data .CSV— HIV early infant diagnosis

Purpose: To create a clear visualization and show the trend in HIV infant diagnosis in Africa between 2011 to 2022

The report will help understand the trend in the spread of HIV from mother to child during birth and also understand the rate at which expectant mothers visit centers to know their HIV status.

I have made it as simple as possible for easier navigation and understanding.

UNICEF CSV Data

Zoom image will be displayed
THE ORIGINAL DATA FROM UNICEF_data

Step 1: Importing the above CSV data into MySQL workbench for cleaning

Database name health_db

Table name hiv_Africa_CSV

--creating a table in MySQL
USE health_db;
CREATE table HIV_infant_diagnosis (Country_ID INT PRIMARY KEY NOT NULL,
type VARCHAR(50),country VARCHAR(50), UNICEF_region VARCHAR(100),
indicator VARCHAR(100), Data_source VARCHAR(30), year INT,value INT,
lower INT, upper INT);

Step 2: Import the information into the created table.


--importing data into MySQL table
LOAD DATA LOCAL INFILE "C:\\Users\\USER\\Downloads\\HIV_Early_Infant_Diagnosis_2022"
INTO TABLE HIV_infant_diagnosis
FIELDS TERMINATED BY ","
LINES TERMINATED BY "\n"
IGNORE 1 LINES
(Country_ID,type,country,
UNICEF_region, indicator,Data_source,
year,value,lower,upper);
Zoom image will be displayed
Data Imported Into MySQL Workbench

DATA CLEANING

Filtering the data to retain African countries only

DELETE FROM hiv_africa_csv
WHERE UNICEF_Region NOT LIKE '%africa%'
order by country;
select * from hiv_africa_csv

The Table now contains information for Africa Countries only

Zoom image will be displayed

filtering to remove unnecessary columns.

-- removing unecessary columns from the whole table
ALTER TABLE hiv_africa_csv
DROP COLUMN ISO3,
DROP COLUMN type,
DROP COLUMN Lower,
DROP COLUMN upper;
select * from hiv_africa_csv;
Zoom image will be displayed
The percentage value per country from 2011 to 2021

Creating a table for every year, from 2010 to 2021 (using subquery)

CREATE TABLE 2010_data(
select * from hiv_africa_csv where year = 2010);

CREATE TABLE 2011_data(
select * from hiv_africa_csv where year = 2011);

CREATE TABLE 2012_data(
select * from hiv_africa_csv where year = 2012);

CREATE TABLE 2013_data(
select * from hiv_africa_csv where year = 2013);

CREATE TABLE 2014_data(
select * from hiv_africa_csv where year = 2014);


CREATE TABLE 2015_data(
select * from hiv_africa_csv where year = 2015);


CREATE TABLE 2016_data(
select * from hiv_africa_csv where year = 2016);


CREATE TABLE 2017_data(
select * from hiv_africa_csv where year = 2017);

CREATE TABLE 2018_data(
select * from hiv_africa_csv where year = 2018);

CREATE TABLE 2019_data(
select * from hiv_africa_csv where year = 2019);

CREATE TABLE 2020_data(
select * from hiv_africa_csv where year = 2020);

CREATE TABLE 2021_data(
select * from hiv_africa_csv where year = 2021);

Sample table for each year

Zoom image will be displayed

NB: I have already removed duplicates using an excel filter function.

Now we have the table data for every year, let's compare the data for two or more years using a Join statement.

--comparing 2012 and 2017 data 
SELECT * from 2012_data
join 2017_data on 2012_data.country = 2017_data.country;
Zoom image will be displayed
a screenshot of the joined table after running the above SQL query.

you can easily compare the percentage value of the two years' data by using joins. In the case above, the percentage value for Burundi in 2012 was 9.5% and in 2017 the value had increased to 18.3%.

In Liberia, the value dropped from 28.5% in 2012 to 9.2% in 2017.

Therefore, one can easily see the trend for two or more years in the same country by joining the tables.

Overview

Suppose you were asked to filter the data to show the results for African countries/ regions only, That is how I approached this problem and as you can see, with just a few lines of code we can retrieve information for every year, country, or compare results for 2 years.

You can also perform more analysis on this data example:

  1. Which country has the maximum percentage rate in a certain year
  2. the average HIV infection rate in each year
  3. Perform statistical analysis to predict the spread of HIV from mother to child during birth
  4. Use the window function to fetch certain information from the table.

No responses yet