Exercises

R for Data Science by Wickham & Grolemund

Author

Sungkyun Cho

Published

April 22, 2024

Load packages
# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.2f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

The nycflight13 datasets

Combine 섹션에서 다른 nycflight13의 4개의 relational data를 이용하세요.

  1. Add the location of the origin and destination (i.e. the lat and lon in airports) to flights.

  2. Is there a relationship between the age of a plane and its delays?

  3. What weather conditions make it more likely to see a delay?

  4. flights 테이블에서 하루 평균 도착지연(arr_delay)가 가장 큰 10일에 해당하는 항공편을 선택

  5. flights 테이블의 도착지(dest)에 대한 공항정보가 airports 테이블에 없는 그러한 도착지(dest)를 구하면?

  6. Filter flights (항공편) in flights to only show flights with planes that have flown at least 100 flights.

  7. Find the 48 hours (over the course of the whole year) that have the worst (departure) delays. Cross-reference it with the weather data. Can you see any patterns?

    • flights의 hour 열을 이용할 것
  8. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve learned above.

    • 즉, 각 비행기는 특정 항공사에서만 운행되는가의 질문임. 2개 이상의 항공사에서 운항되는 비행기가 있는지 확인해 볼 것
    • 그리고, 2개 이상의 항공사에서 운항되는 비행기들만 포함하고, 그 항공사들의 full name을 함께 포함하는 테이블을 만들어 볼 것