TDM 10200: Project 9 — Spring 2024
Motivation: Working with pandas can be fun! Learning how to manipulate and clean up data in pandas is a helpful tool to have in your tool belt! Dive into pandas to transform and analyze data, equipping you with key data science skills.
Context: Hopefully, most students are feeling pretty comfortable now, with building functions and using pandas. In this project, we will continue working with pandas. We want to get better at analyzing big datasets and solving problems with data.
Scope: python, pandas
Reading and Resources
Dataset
/anvil/projects/tdm/data/whin/weather.csv
We added five new videos to help you with Project 9. |
You need to use 2 cores for your Jupyter Lab session for Project 9 this week. |
You can use This data is similar to last week’s data, from Project 8, but we are using the csv file instead of the parquet file this week. Please take a look at the
We will assume that you read the data into a data frame called |
Questions
Question 1 (2 points)
-
Use the method
value_counts()
to get the number of records for each station. -
Use the method
groupby()
to get the number of records for each station. -
Explain the difference between the methodologies of using these two methods. We love getting your feedback. Which method do you prefer?
Question 2 (2 points)
-
Find out how many null records exist in
myDF
, within each individual column. (Your answer should specify, for each column, how many null records are in that column.) -
Now count the total number of null values in the entire data frame
myDF
. (In other words, add up the values from all of the counts in part 2a.) -
Drop rows with any null values in
myDF
. Save the resulting cleaned data set into a new DataFrame calledmyDF_cleaned
. -
Just to make sure that you did this properly, check
myDF_cleaned
carefully: Are there any null values remaining inmyDF_cleaned
? (There should not be.) How many rows and columns are inmyDF_cleaned
?
There are a variety of ways to approach this question.
|
Question 3 (2 points)
-
Go back to the original data frame
myDF
. Create a new data frame, by removing all rows frommyDF
in which the columntemperature
is a null value. -
Combine the columns
latitude
andlongitude
into a new column calledlocation
with '_' in between. -
Grouping the data by the
location
column, find the average temperature for each location, and print your results: Each line that you print should have 1 location and 1 average temperature for that location.
|
|
Question 4 (2 points)
-
Wrap your work from Question 3 into a function. This function should take a data frame as a parameter, and should drop records with null value in the data frame’s
temperature
column. The function should also create a newlocation
column, and should calculate the average temperature (grouped by location). The function should return the Series of average temperatures (grouped by location).
Question 5 (2 points)
-
Now considering the column called
wind_gust_speed_mph
(instead of temperature), do the work from questions 3 and 4 again. -
Which location has the largest average value of the
wind_gust_speed_mph
?
Project 09 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project09.ipynb
.
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project09.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |