Unit 2.3
Home | API | Notes |
Unit 2.3 Hacks
Here are my hacks for lesson 2.3
Notes
- Pandas is a popular Python library used for data manipulation and analysis.
- Pandas provides a data structure called DataFrame, which is used to store tabular data.
- Data analysis using Pandas involves exploring, cleaning, and processing the data to extract meaningful insights.
- The ability to process data depends on both the user's capabilities and the tools they use, which underscores the importance of learning how to use tools like Pandas for data analysis.
- One common task in data analysis is combining multiple data sets to extract insights from them.
- Another common challenge in data analysis is dealing with incomplete or "dirty" data, which requires cleaning and processing before analysis can be done.
- Pandas provides many functions and methods for working with DataFrames, making it a powerful tool for data analysis in Python.Pandas is a popular Python library used for data manipulation and analysis.
- Pandas provides a data structure called DataFrame, which is used to store tabular data.
- Data analysis using Pandas involves exploring, cleaning, and processing the data to extract meaningful insights.
- The ability to process data depends on both the user's capabilities and the tools they use, which underscores the importance of learning how to use tools like Pandas for data analysis.
- One common task in data analysis is combining multiple data sets to extract insights from them.
- Another common challenge in data analysis is dealing with incomplete or "dirty" data, which requires cleaning and processing before analysis can be done.
- Pandas provides many functions and methods for working with DataFrames, making it a powerful tool for data analysis in Python.
- When cleaning data, look for 1. Missing Data Points 2. Invalid Data 3. Inaccurate Data
Answers to questions
- What part of the data set needs to be cleaned?
There are a few areas that need cleaning:The "Year in School" column has inconsistent values. Some entries have numerical values, while others have text values such as "Junior" and "9th Grade". To make this column consistent, the text values will need to be converted to numerical values. The "Student ID" column has an entry with the value "nil", which is likely a mistake or missing data. This entry will need to be cleaned or removed.
- From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?
In terms of when to clean data, a good time to clean data is before analyzing or using it in any meaningful way. It is important to clean data early in the process because "garbage in, garbage out" means that any errors or inaccuracies in the data will propagate throughout the analysis and potentially lead to incorrect conclusions or decisions.
2.3 College Board Quiz
I scored a 6/6 on this quiz, here are some of my takeaways...
Data can be incomplete or inconsistent, and it is important to be aware of potential errors or discrepancies when analyzing data.
Data from different sources may need to be merged or combined in order to provide a more complete picture, but this can be challenging if there is no unique identifier to match records.
Data may be organized differently in different contexts, which can make it difficult to process or compare data.
Additional data may be needed to answer specific questions or analyze trends in data.
Questions
(1) A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.
Answer choices:The researcher has access to a database with the following information about each student. Last name
First name
Grade level (9, 10, 11, or 12)
Grade point average (on a 0.0 to 4.0 scale)
The researcher also has access to another database with the following information about each student.
First name
Last name
Number of absences from school
Number of late arrivals to school
Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?
Answer choices:
(A) Students who have the same name may be confused with each other.
(B) Students who have the same grade point average may be confused with each other.
(C) Students who have the same grade level may be confused with each other.
(D) Students who have the same number of absences may be confused with each other.
Correct answer:A> Why? The problem caused by the lack of unique ID numbers is that students who have the same name may be confused with each other. This can lead to inaccurate data analysis as the researcher may mistakenly combine data from two different students who share the same name.
(2) A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?
Answer choices:(A) A computer program cannot combine data from different files. (B) Different counties may organize data in different ways.
(C) The number of counties is too large for the program to process.
(D) The total number of rows of data is too large for the program to process.
Correct answer:B> Why? The most likely challenge in creating the program is that different counties may organize data in different ways. This can make it difficult to combine the data and process it accurately.
(3) A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?
Select two answers.
(A) Users might attempt to use the Web site to search for multiple cities.
(B) Users might enter abbreviations for the names of cities.
(C) Users might misspell the name of the city.
(D) Users might be slow at typing a city name in the text field.
Correct answers:B and C> Why? The likely challenges associated with processing city names that users might provide as input are that users might enter abbreviations for the names of cities, and users might misspell the name of the city. This can make it difficult to accurately retrieve information about the intended city.
(4) A database of information about shows at a concert venue contains the following information.
Name of artist performing at the show
Date of show
Total dollar amount of all tickets sold
Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?
Answer choices:(A) Average ticket price (B) Length of the show in minutes
(C) Start time of the show
(D) Total dollar amount of food and drinks sold during the show
Correct answer:A> Why? The additional piece of information that would be most useful in determining the artist with the greatest attendance during a particular month is the average ticket price. This information can help to determine the number of tickets sold for each artist, which can be used to compare attendance across artists.
(5) A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?
Answer choices:(A) The average number of hours per day that the car is in use (B) The car’s average speed on a particular day
(C) The distance the car traveled on a particular day
(D) The number of bicycles the car passed on a particular day
Correct answer:D> Why? The number of bicycles the car passed on a particular day can best be determined using only the data and none of the metadata. This can be done by analyzing the images captured by the camera and counting the number of bicycles in each image.
(6) A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.
On average, how long does homework take you each night (in minutes)?
On average, how long do you study for each test (in minutes)?
Do you enjoy the subject material of this class (yes or no)?
Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?
I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?
II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?
III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?
Answer choices:(A) I only (B) III only
(C) I and II
(D) I and III
Correct answer:C Why? The teacher can answer the questions "Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?" and "Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?" by analyzing the survey results. However, the teacher cannot answer the question "Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?" as this information was not collected in the survey.
Implementation of a data set into PBl project
The code below allows me to utilize a csv file from the US government. This file holds data for various pieces of information regarding cars from make, model, type of fuel, and more. This can be implemented into my project, as it allows me to show a variety of different information regarding cars. This data set useful as the premise of my PBL is entirely information about cars.
import pandas as pd
fuel_economy_df = pd.read_csv("/home/etran/vscode/fastpages_EthanT/_notebooks/files/vehicles.csv")
# Show the first 100 rows of cars and their data
fuel_economy_df[['make', 'model', 'fuelType', 'year', 'cylinders', 'VClass', 'drive']].head(100)
num_rows, num_cols = fuel_economy_df.shape # Get the total number of rows and columns in the dataset
print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")
top_makes = fuel_economy_df['make'].value_counts().nlargest(10) # Get the top 10 makes of vehicles by number of models in the dataset
print(top_makes)
fuel_cyl_counts = fuel_economy_df.groupby(['fuelType', 'cylinders']).size() # Get the number of vehicles by fuel type and cylinder count
print(fuel_cyl_counts)