AWS Weather API Project

by L. Mark Coty


Why this project?

I wanted to use the knowledge I gained from studying for the AWS DEA-C01 exam to build a pipeline using data pulled from an API. This gave me the opportunity to gain some hands-on experience with several aspects of data handling in AWS that I did not have.

GitHub repository here. Run the notebook.


Process:

  • I subscribed to the OpenWeather API.

  • I created a Lambda function to pull the 5-day forecast for my zip code, which is updated every 3 hours.

  • I created an EventBridge rule to trigger the Lambda function every 3 hours.

  • After 24 hours, I had accumulated a set of JSON files in my S3 data bucket.

  • I ran a crawler to catalog the data.

  • I used a Glue job to remove some less interesting or unneeded fields (such as city, country, etc.).

  • I used Athena to query the final table and to make a few additional tweaks to the data, such as renaming further fields.

  • I connected QuickSight to Athena and generated several graphs which give a view of the weather forecasts in my area during the relevant period.


Below is a diagram of the pipeline:


The Content and Nature of the Data:

The data came into the S3 bucket in the form of very heavily nested JSON files, which is apparently typical for weather forecasts. After crawling the files, I used a Glue job to create a well-formatted and usable csv file. One issue was the Rain column, where if no rain was predicted the value was NaN instead of 0, so I changed them to 0. This led to the following characteristics of the dataset:


Next I looked at the boxplots for the numerical columns:

We can see that temps were predicted to stay within a narrow range and that wind and precipitation/rain were not forecasted to be much of a factor. Clouds and humidity seem to be the dominant factors in the forecast for the time period.


Here is a correlation heatmap of the numerical columns:

Observations:The negative correlation between humidity/clouds and temperature make sense, since the humid days tend to be cloudy. The strong correlation between rain and precipitation_probablity needs no explanation.


Here is a graph of forecasted temperature over time -- very typical for late August in Atlanta:


And of course, in Atlanta, we must see humidity over time:

The general downward trend probably indicates the forecasted arrival of a "cold" (i.e., not blisteringly hot) front.


Finally, looking at the sky descriptions, we can see clearly that the sun was predicted to be only occasionally present during this period.


Conclusion:The 5-day outlook didn't change much during the period covered, and clouds and rain are definitely dominant.