Navigating Flight Delays: A SQL and Tableau Analysis of DFW Departures

Introduction
Embark on a data-driven journey exploring flight delays departing from Dallas/Fort Worth International Airport (DFW)!
In 2022, DFW secured its position as the second busiest airport in the United States, trailing behind Atlanta Hartsfield-Jackson. With a remarkable 73.3 million passengers passing through its terminals, DFW plays a pivotal role in shaping the dynamics of air travel nationwide. Through the lens of SQL analysis, this project aims to unravel the intricacies of domestic flight delays from DFW, shedding light on patterns, insights, and potential improvements within this bustling aviation hub.
I used CSVFiddle to run queries and practice my SQL commands. Let's dig into what I discovered!
The Data
This data is available on Kaggle in a CSV format. There are 9 columns and over 500,000 rows in my cleaned dataset. The dataset was last updated in January of 2022. Every row represents a flight path taken and if this flight experienced delays. Additional relevant data includes:
- Latitude of each Airport
- Longitude of each Airport.
As mentioned before, this project presented an opportunity for me to use SQL to analyze flight delays compiled in the is dataset. Being that DFW is my home airport, the main questions I had when analyzing this data were:
1) How many flights in the dataset departed from DFW?
2) What is the percentage someone will experience a delay leaving DFW?
3) Which airline had the highest chance of experiencing a delay when departing DFW?
4) Which day of the week is experiences the least amount of delays?
5) What airport in the dataset have the highest chance of experiencing a delay?
Analysis
Our first question asks us to isolate data from DFW, specifically the flights departing from the airport. Here is the query I used to find this:
This shows us the number of flights and airline that was operating the flight. Here is how the data was returned:
We can see that there are 22, 154 flights departing, with airlines included. I want to expand this summary and answer our second question , What is the percentage someone will experience a delay leaving DFW. Here is the query I used to find that:
As we can see, the chance of someone experiencing a delay leaving DFW was almost 40%. That's over 5500 flights in this dataset that had 1 or more delay! To narrow this data specifically to airlines, I expanded with the following:
This gives us a broader scope of the dataset. Using this data, we can answer our 3rd question: Which airline had the highest chance of experiencing a delay when departing DFW? As the data shows, AA (American Airlines) experiencing the most delays in our dataset, with 12,924 flights in the dataset.
Given that DFW is the largest headquarter, or "hub", for American Airlines, this data is largely biased towards the number of flight operated by AA our of DFW
Our next question left to answer is: Which day of the week is experiences the least amount of delays? To find this, I presented the following query:
As we can see from the returned info, the day of the week is returned as numbers. Let's change that label using the AS command.
Now that we've cleaned this return, we can see that Saturdays receive the least amount of delays in this dataset.
Our final question I wanted to answer was: What airport in the dataset have the highest chance of experiencing a delay? For this query, we have to give SQL a case study. Here was the query I used for this:
This will show us the destination of flights leaving DFW, grouped by Airport destination. I had to use a number of local values in the commands to receive the information we were looking for. Here is how the first few rows returned:
The results of this query yielded results for 137 rows of different airport in the US. Reading this information takes a small amount of knowledge of airport codes. In row 1, we can see the data returned says "IAH". This is the airport code for George Bush International Airport in Houston, TX.
To better visualize this data, I imported the results of my Query in Excel and included the Latitude and Longitude of each airport code. The first rows of this data are as follows:
After importing this data, I was able to create a Tableau visual to utilize the Lat/Long information to detail each airports location, flight number, and color coded on percentage chance that the passenger would experience a delay of some kind.
Conclusion
- In this dataset, we found that more than 22,00 flights departed DFW for other destinations
- The average chance of experience a delay when departing DFW was 40%
- American Airlines experienced the most delays, having more than 5500 flights with recorded delays.
- Saturday experienced the least amount of delays during the week.
- Luis Muñoz Marin International Airport in Puerto Rico, SJU, experiences the highest percentage chance of having a delay: 66%
Thank you!
This project really challenged me and pushed me out of my comfort zone with my SQL knowledge! Please feel free to leave some feedback for me on this project! I'd love to hear from all of you.
Let's connect!
- Zac
Post a comment