Advanced SQL Querying
Advanced SQL Querying: Baseball Database Analysis
This project demonstrates advanced SQL techniques through comprehensive analysis of a baseball database using PostgreSQL. The analysis covers four key domains: school player production, team salary dynamics, player career trajectories, and comparative player statistics.
The complete SQL code for this project is available at GitHub Repository
Analysis Overview
The project analyzes a multi-table baseball database containing schools, players, salaries, and team information. The analysis demonstrates:
- Complex JOIN operations across multiple tables
- Window functions for ranking and cumulative calculations
- Common Table Expressions (CTEs) for query organization
- Date manipulation and temporal analysis
- Aggregation and grouping strategies
Part I: School Analysis
The school analysis examines which educational institutions produced the most professional baseball players across different decades.
Key Queries
Decade-by-Decade School Count
Calculates the number of schools producing players in each decade using integer division for decade grouping:
SELECT (yearid/10) * 10 AS decade,
COUNT(DISTINCT(schoolid)) AS num_schools
FROM schools
GROUP BY decade
ORDER BY decade;
Top Schools by Player Production
Identifies the five schools that produced the most players using LEFT JOIN to preserve school detail information:
SELECT sd.name_full,
COUNT(DISTINCT(s.playerid)) AS num_players
FROM schools s LEFT JOIN school_details sd
ON s.schoolid = sd.schoolid
GROUP BY sd.name_full
ORDER BY num_players DESC
LIMIT 5;
Top 3 Schools Per Decade
Uses CTEs and window functions to rank schools within each decade:
WITH school_counts AS (
SELECT
(s.yearid/10) * 10 AS decade,
sd.name_full AS school_name,
COUNT(distinct(s.playerid)) AS num_players
FROM schools s LEFT JOIN school_details sd
ON s.schoolid = sd.schoolid
GROUP BY decade, sd.name_full
),
ranked AS (
SELECT
decade,
school_name,
num_players,
ROW_NUMBER() OVER (PARTITION BY decade ORDER BY num_players DESC, school_name) AS rn
FROM school_counts
)
SELECT decade, school_name, num_players
FROM ranked
WHERE rn <= 3
ORDER BY decade DESC, rn;
Part II: Salary Analysis
The salary analysis examines team spending patterns over time, including identifying high-spending teams and tracking cumulative expenditures.
Key Queries
Top 20% Spending Teams
Uses NTILE window function to segment teams into quintiles based on average annual spending:
WITH ts AS (
SELECT yearid, teamid, SUM(salary) AS total_spend
FROM salaries
GROUP BY yearid, teamid
),
pct AS (
SELECT teamid,
AVG(total_spend) AS avg_spend,
NTILE(5) OVER(ORDER BY AVG(total_spend) DESC) AS spend_pct
FROM ts
GROUP BY teamid
)
SELECT teamid,
ROUND(avg_spend / 1000000, 1) AS avg_spend_millions
FROM pct
WHERE spend_pct = 1;
Cumulative Spending Over Time
Calculates running totals of team spending using window functions with PARTITION BY:
WITH ts AS (
SELECT yearid, teamid, SUM(salary) AS total_spend
FROM salaries
GROUP BY yearid, teamid
)
SELECT yearid, teamid,
ROUND(SUM(total_spend / 1000000.0)
OVER(ORDER BY teamid, yearid), 1) AS cumulative_sum_millions
FROM ts;
First Year Exceeding $1 Billion
Identifies when each team's cumulative spending first surpassed $1 billion using multiple CTEs and ROW_NUMBER:
WITH ts AS (
SELECT yearid, teamid, SUM(salary) AS total_spend
FROM salaries
GROUP BY yearid, teamid
),
csb AS (
SELECT yearid, teamid,
ROUND(SUM(total_spend / 1000000000.0)
OVER(PARTITION BY teamid ORDER BY yearid), 2) AS cumulative_sum_billions
FROM ts
),
srnk AS (
SELECT teamid, yearid, cumulative_sum_billions,
ROW_NUMBER() OVER(PARTITION BY teamid
ORDER BY cumulative_sum_billions) AS rnk
FROM csb
WHERE cumulative_sum_billions >= 1
)
SELECT teamId, yearid, cumulative_sum_billions
FROM srnk
WHERE rnk = 1
ORDER BY teamid, yearid DESC;
Part III: Player Career Analysis
The career analysis examines player longevity, debut and retirement ages, and career trajectories.
Key Queries
Career Length Calculation
Calculates player ages at debut and retirement using PostgreSQL date functions:
WITH fullbd AS (
SELECT playerid,
MAKE_DATE(birthyear, birthmonth, birthday) AS bday,
namegiven, debut, finalgame
FROM players
)
SELECT namegiven,
EXTRACT(YEAR FROM AGE(debut, bday))::INT AS starting_age,
EXTRACT(YEAR FROM AGE(finalgame, bday))::INT AS ending_age,
EXTRACT(YEAR FROM AGE(finalgame, debut))::INT AS career_length
FROM fullbd
WHERE debut IS NOT NULL
ORDER BY career_length DESC;
Same-Team Career Players
Identifies players who started and ended with the same team after 10+ year careers using self-joins on the salaries table:
SELECT
p.nameGiven,
s.yearID AS starting_year,
s.teamID AS starting_team,
e.yearID AS ending_year,
e.teamID AS ending_team
FROM players p
INNER JOIN salaries s
ON p.playerID = s.playerID
AND EXTRACT(YEAR FROM p.debut) = s.yearID
INNER JOIN salaries e
ON p.playerID = e.playerID
AND EXTRACT(YEAR FROM p.finalGame) = e.yearID
WHERE s.teamID = e.teamID
AND e.yearID - s.yearID > 10;
Part IV: Player Comparison Analysis
The comparison analysis examines player attributes including birthdays, batting preferences, and physical characteristics over time.
Key Queries
Shared Birthdays
Identifies players born on the same date using STRING_AGG for aggregation:
WITH bd AS (
SELECT CAST(CONCAT(birthyear, '-', birthmonth, '-', birthday) AS DATE) AS birthdate,
namegiven
FROM players
WHERE birthyear IS NOT NULL
AND birthmonth IS NOT NULL
AND birthday IS NOT NULL
)
SELECT birthdate,
STRING_AGG(namegiven, ',') AS players
FROM bd
WHERE EXTRACT(YEAR FROM birthdate) BETWEEN 1980 AND 1990
GROUP BY birthdate
ORDER BY birthdate;
Batting Preference Distribution
Calculates percentage of right, left, and switch batters per team using CASE statements:
SELECT
s.teamid,
ROUND((SUM(CASE WHEN p.bats = 'R' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_r,
ROUND((SUM(CASE WHEN p.bats = 'L' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_l,
ROUND((SUM(CASE WHEN p.bats = 'B' THEN 1 ELSE 0 END) * 100.0) / COUNT(*), 1) AS pct_b
FROM players p
JOIN salaries s ON p.playerid = s.playerid
GROUP BY s.teamid
ORDER BY s.teamid;
Physical Characteristics Over Time
Tracks decade-over-decade changes in average player height and weight using LAG window function:
WITH awh AS (
SELECT (EXTRACT(YEAR FROM debut)::int/10) * 10 AS decade,
AVG(weight) AS avg_weight,
AVG(height) AS avg_height
FROM players
WHERE debut IS NOT NULL
AND weight IS NOT NULL
AND height IS NOT NULL
GROUP BY (EXTRACT(YEAR FROM debut)::int/10) * 10
)
SELECT decade,
ROUND(avg_height - LAG(avg_height) OVER(ORDER BY decade), 4) AS avg_height_diff,
ROUND(avg_weight - LAG(avg_weight) OVER(ORDER BY decade), 4) AS avg_weight_diff
FROM awh;
Technical Implementation
The project demonstrates several advanced PostgreSQL features:
- Window Functions
- Common Table Expressions (CTEs)
- Date Manipulation
- String Aggregation
- Conditional Aggregation
- Self-Joins
Uses ROW_NUMBER, NTILE, and LAG for ranking and comparative analysis across partitioned data sets.
Employs WITH clauses to organize complex queries into logical, readable steps, particularly for multi-stage aggregations.
Leverages PostgreSQL's MAKE_DATE, EXTRACT, and AGE functions for temporal calculations and decade grouping.
Uses STRING_AGG to concatenate multiple values into single result strings for grouped data.
Implements CASE statements within SUM and COUNT functions for category-specific calculations.
Performs multiple joins on the same table with different temporal conditions to compare player career start and end points.
Database Schema
The analysis works with the following tables:
- schools: Player-school associations with years
- school_details: School names and information
- salaries: Player salaries by year and team
- players: Player biographical and career information
Key Techniques
The queries demonstrate:
- Integer division for decade calculation: (yearid/10) * 10
- Multi-level CTEs for step-by-step query construction
- Partitioned window functions for within-group rankings
- Type casting for date construction and calculations
- Running totals with cumulative window aggregations
- Filtering ranked results to extract top N per category
Development Environment
- PostgreSQL
- SQL