Home API Notes

Unit 2.3 Hacks

Here are my hacks for lesson 2.3

Notes

  • Pandas is a popular Python library used for data manipulation and analysis.
  • Pandas provides a data structure called DataFrame, which is used to store tabular data.
  • Data analysis using Pandas involves exploring, cleaning, and processing the data to extract meaningful insights.
  • The ability to process data depends on both the user's capabilities and the tools they use, which underscores the importance of learning how to use tools like Pandas for data analysis.
  • One common task in data analysis is combining multiple data sets to extract insights from them.
  • Another common challenge in data analysis is dealing with incomplete or "dirty" data, which requires cleaning and processing before analysis can be done.
  • Pandas provides many functions and methods for working with DataFrames, making it a powerful tool for data analysis in Python.Pandas is a popular Python library used for data manipulation and analysis.
  • Pandas provides a data structure called DataFrame, which is used to store tabular data.
  • Data analysis using Pandas involves exploring, cleaning, and processing the data to extract meaningful insights.
  • The ability to process data depends on both the user's capabilities and the tools they use, which underscores the importance of learning how to use tools like Pandas for data analysis.
  • One common task in data analysis is combining multiple data sets to extract insights from them.
  • Another common challenge in data analysis is dealing with incomplete or "dirty" data, which requires cleaning and processing before analysis can be done.
  • Pandas provides many functions and methods for working with DataFrames, making it a powerful tool for data analysis in Python.
  • When cleaning data, look for 1. Missing Data Points 2. Invalid Data 3. Inaccurate Data

Answers to questions

  • What part of the data set needs to be cleaned?

There are a few areas that need cleaning:The "Year in School" column has inconsistent values. Some entries have numerical values, while others have text values such as "Junior" and "9th Grade". To make this column consistent, the text values will need to be converted to numerical values. The "Student ID" column has an entry with the value "nil", which is likely a mistake or missing data. This entry will need to be cleaned or removed.

  • From PBL learning, what is a good time to clean data? Hint, remember Garbage in, Garbage out?

In terms of when to clean data, a good time to clean data is before analyzing or using it in any meaningful way. It is important to clean data early in the process because "garbage in, garbage out" means that any errors or inaccuracies in the data will propagate throughout the analysis and potentially lead to incorrect conclusions or decisions.

2.3 College Board Quiz

I scored a 6/6 on this quiz, here are some of my takeaways...

  • Data can be incomplete or inconsistent, and it is important to be aware of potential errors or discrepancies when analyzing data.

  • Data from different sources may need to be merged or combined in order to provide a more complete picture, but this can be challenging if there is no unique identifier to match records.

  • Data may be organized differently in different contexts, which can make it difficult to process or compare data.

  • Additional data may be needed to answer specific questions or analyze trends in data.

Questions

(1) A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

Answer choices:The researcher has access to a database with the following information about each student. Last name

First name

Grade level (9, 10, 11, or 12)

Grade point average (on a 0.0 to 4.0 scale)

The researcher also has access to another database with the following information about each student.

First name

Last name

Number of absences from school

Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

Answer choices:

(A) Students who have the same name may be confused with each other.

(B) Students who have the same grade point average may be confused with each other.

(C) Students who have the same grade level may be confused with each other.

(D) Students who have the same number of absences may be confused with each other.

Correct answer:A> Why? The problem caused by the lack of unique ID numbers is that students who have the same name may be confused with each other. This can lead to inaccurate data analysis as the researcher may mistakenly combine data from two different students who share the same name.

(2) A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

Answer choices:(A) A computer program cannot combine data from different files. (B) Different counties may organize data in different ways.

(C) The number of counties is too large for the program to process.

(D) The total number of rows of data is too large for the program to process.

Correct answer:B> Why? The most likely challenge in creating the program is that different counties may organize data in different ways. This can make it difficult to combine the data and process it accurately.

(3) A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Select two answers.

(A) Users might attempt to use the Web site to search for multiple cities.

(B) Users might enter abbreviations for the names of cities.

(C) Users might misspell the name of the city.

(D) Users might be slow at typing a city name in the text field.

Correct answers:B and C> Why? The likely challenges associated with processing city names that users might provide as input are that users might enter abbreviations for the names of cities, and users might misspell the name of the city. This can make it difficult to accurately retrieve information about the intended city.

(4) A database of information about shows at a concert venue contains the following information.

Name of artist performing at the show

Date of show

Total dollar amount of all tickets sold

Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

Answer choices:(A) Average ticket price (B) Length of the show in minutes

(C) Start time of the show

(D) Total dollar amount of food and drinks sold during the show

Correct answer:A> Why? The additional piece of information that would be most useful in determining the artist with the greatest attendance during a particular month is the average ticket price. This information can help to determine the number of tickets sold for each artist, which can be used to compare attendance across artists.

(5) A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

Answer choices:(A) The average number of hours per day that the car is in use (B) The car’s average speed on a particular day

(C) The distance the car traveled on a particular day

(D) The number of bicycles the car passed on a particular day

Correct answer:D> Why? The number of bicycles the car passed on a particular day can best be determined using only the data and none of the metadata. This can be done by analyzing the images captured by the camera and counting the number of bicycles in each image.

(6) A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

On average, how long does homework take you each night (in minutes)?

On average, how long do you study for each test (in minutes)?

Do you enjoy the subject material of this class (yes or no)?

Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?

II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?

III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

Answer choices:(A) I only (B) III only

(C) I and II

(D) I and III

Correct answer:C Why? The teacher can answer the questions "Do students who enjoy the subject material tend to spend more time on homework each night than the other students do?" and "Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do?" by analyzing the survey results. However, the teacher cannot answer the question "Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?" as this information was not collected in the survey.

Implementation of a data set into PBl project

The code below allows me to utilize a csv file from the US government. This file holds data for various pieces of information regarding cars from make, model, type of fuel, and more. This can be implemented into my project, as it allows me to show a variety of different information regarding cars. This data set useful as the premise of my PBL is entirely information about cars.

import pandas as pd

fuel_economy_df = pd.read_csv("/home/etran/vscode/fastpages_EthanT/_notebooks/files/vehicles.csv")

# Show the first 100 rows of cars and their data
fuel_economy_df[['make', 'model', 'fuelType', 'year', 'cylinders', 'VClass', 'drive']].head(100)
/tmp/ipykernel_1721/1459572364.py:3: DtypeWarning: Columns (70,71,72,73,74,76,79) have mixed types. Specify dtype option on import or set low_memory=False.
  fuel_economy_df = pd.read_csv("/home/etran/vscode/fastpages_EthanT/_notebooks/files/vehicles.csv")
make model fuelType year cylinders VClass drive
0 Alfa Romeo Spider Veloce 2000 Regular 1985 4.0 Two Seaters Rear-Wheel Drive
1 Ferrari Testarossa Regular 1985 12.0 Two Seaters Rear-Wheel Drive
2 Dodge Charger Regular 1985 4.0 Subcompact Cars Front-Wheel Drive
3 Dodge B150/B250 Wagon 2WD Regular 1985 8.0 Vans Rear-Wheel Drive
4 Subaru Legacy AWD Turbo Premium 1993 4.0 Compact Cars 4-Wheel or All-Wheel Drive
... ... ... ... ... ... ... ...
95 Pontiac Grand Prix Regular 1993 6.0 Midsize Cars Front-Wheel Drive
96 Pontiac Grand Prix Regular 1993 6.0 Midsize Cars Front-Wheel Drive
97 Pontiac Grand Prix Regular 1993 6.0 Midsize Cars Front-Wheel Drive
98 Pontiac Grand Prix Regular 1993 6.0 Midsize Cars Front-Wheel Drive
99 Rolls-Royce Brooklands/Brklnds L Premium 1993 8.0 Midsize Cars Rear-Wheel Drive

100 rows × 7 columns

num_rows, num_cols = fuel_economy_df.shape # Get the total number of rows and columns in the dataset
print(f"There are {num_rows} rows and {num_cols} columns in the dataset.")
There are 46024 rows and 83 columns in the dataset.
top_makes = fuel_economy_df['make'].value_counts().nlargest(10) # Get the top 10 makes of vehicles by number of models in the dataset
print(top_makes)
Chevrolet        4336
Ford             3744
GMC              2724
Dodge            2678
Toyota           2330
BMW              2315
Mercedes-Benz    1829
Nissan           1619
Porsche          1390
Volkswagen       1286
Name: make, dtype: int64
fuel_cyl_counts = fuel_economy_df.groupby(['fuelType', 'cylinders']).size() # Get the number of vehicles by fuel type and cylinder count
print(fuel_cyl_counts)
fuelType                     cylinders
CNG                          4.0             21
                             6.0              3
                             8.0             36
Diesel                       4.0            463
                             5.0             26
                             6.0            290
                             8.0            471
                             10.0             4
Gasoline or E85              4.0            149
                             6.0            438
                             8.0            800
Gasoline or natural gas      4.0              5
                             6.0              2
                             8.0             13
Gasoline or propane          8.0              8
Midgrade                     6.0              9
                             8.0            146
Premium                      2.0             22
                             3.0            108
                             4.0           3649
                             5.0            308
                             6.0           5308
                             8.0           3576
                             10.0           184
                             12.0           653
                             16.0            20
Premium Gas or Electricity   2.0             12
                             3.0              3
                             4.0             12
                             6.0             16
                             8.0             12
Premium and Electricity      3.0             14
                             4.0             87
                             6.0             37
                             8.0             15
Premium or E85               4.0             33
                             6.0             56
                             8.0             16
                             12.0            22
Regular                      2.0             29
                             3.0            260
                             4.0          13427
                             5.0            442
                             6.0           9532
                             8.0           4664
                             10.0             8
                             12.0            40
Regular Gas and Electricity  4.0             73
                             6.0             11
Regular Gas or Electricity   4.0              4
dtype: int64