Combine

R for Data Science by Wickham & Grolemund

Author

Sungkyun Cho

Published

May 1, 2024

Load packages

# numerical calculation & data frames
import numpy as np
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns
import seaborn.objects as so

# statistics
import statsmodels.api as sm

# pandas options
pd.set_option('mode.copy_on_write', True)  # pandas 2.0
pd.options.display.float_format = '{:.2f}'.format  # pd.reset_option('display.float_format')
pd.options.display.max_rows = 7  # max number of rows to display

# NumPy options
np.set_printoptions(precision = 2, suppress=True)  # suppress scientific notation

# For high resolution display
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats("retina")

Relational Data

nycflights13

flights, airlines, airports, planes, weather are related

# Load the nycflight13 dataset
flights = sm.datasets.get_rdataset("flights", "nycflights13").data
flights["time_hour"] = pd.to_datetime(flights["time_hour"], utc=False).dt.tz_convert("America/New_York")  # convert to datetime
flights

        year  month  day  dep_time  sched_dep_time  dep_delay  arr_time  \
0       2013      1    1    517.00             515       2.00    830.00   
1       2013      1    1    533.00             529       4.00    850.00   
2       2013      1    1    542.00             540       2.00    923.00   
...      ...    ...  ...       ...             ...        ...       ...   
336773  2013      9   30       NaN            1210        NaN       NaN   
336774  2013      9   30       NaN            1159        NaN       NaN   
336775  2013      9   30       NaN             840        NaN       NaN   

        sched_arr_time  arr_delay carrier  flight tailnum origin dest  \
0                  819      11.00      UA    1545  N14228    EWR  IAH   
1                  830      20.00      UA    1714  N24211    LGA  IAH   
2                  850      33.00      AA    1141  N619AA    JFK  MIA   
...                ...        ...     ...     ...     ...    ...  ...   
336773            1330        NaN      MQ    3461  N535MQ    LGA  BNA   
336774            1344        NaN      MQ    3572  N511MQ    LGA  CLE   
336775            1020        NaN      MQ    3531  N839MQ    LGA  RDU   

        air_time  distance  hour  minute                 time_hour  
0         227.00      1400     5      15 2013-01-01 05:00:00-05:00  
1         227.00      1416     5      29 2013-01-01 05:00:00-05:00  
2         160.00      1089     5      40 2013-01-01 05:00:00-05:00  
...          ...       ...   ...     ...                       ...  
336773       NaN       764    12      10 2013-09-30 12:00:00-04:00  
336774       NaN       419    11      59 2013-09-30 11:00:00-04:00  
336775       NaN       431     8      40 2013-09-30 08:00:00-04:00  

[336776 rows x 19 columns]

airlines = sm.datasets.get_rdataset("airlines", "nycflights13").data
airlines

   carrier                    name
0       9E       Endeavor Air Inc.
1       AA  American Airlines Inc.
2       AS    Alaska Airlines Inc.
..     ...                     ...
13      VX          Virgin America
14      WN  Southwest Airlines Co.
15      YV      Mesa Airlines Inc.

[16 rows x 2 columns]

airports = sm.datasets.get_rdataset("airports", "nycflights13").data
airports

      faa                           name   lat    lon   alt  tz dst  \
0     04G              Lansdowne Airport 41.13 -80.62  1044  -5   A   
1     06A  Moton Field Municipal Airport 32.46 -85.68   264  -6   A   
2     06C            Schaumburg Regional 41.99 -88.10   801  -6   A   
...   ...                            ...   ...    ...   ...  ..  ..   
1455  ZWI      Wilmington Amtrak Station 39.74 -75.55     0  -5   A   
1456  ZWU       Washington Union Station 38.90 -77.01    76  -5   A   
1457  ZYP                   Penn Station 40.75 -73.99    35  -5   A   

                 tzone  
0     America/New_York  
1      America/Chicago  
2      America/Chicago  
...                ...  
1455  America/New_York  
1456  America/New_York  
1457  America/New_York  

[1458 rows x 8 columns]

planes = sm.datasets.get_rdataset("planes", "nycflights13").data
planes

     tailnum    year                     type                   manufacturer  \
0     N10156 2004.00  Fixed wing multi engine                        EMBRAER   
1     N102UW 1998.00  Fixed wing multi engine               AIRBUS INDUSTRIE   
2     N103US 1999.00  Fixed wing multi engine               AIRBUS INDUSTRIE   
...      ...     ...                      ...                            ...   
3319  N998AT 2002.00  Fixed wing multi engine                         BOEING   
3320  N998DL 1992.00  Fixed wing multi engine  MCDONNELL DOUGLAS CORPORATION   
3321  N999DN 1992.00  Fixed wing multi engine  MCDONNELL DOUGLAS CORPORATION   

          model  engines  seats  speed     engine  
0     EMB-145XR        2     55    NaN  Turbo-fan  
1      A320-214        2    182    NaN  Turbo-fan  
2      A320-214        2    182    NaN  Turbo-fan  
...         ...      ...    ...    ...        ...  
3319    717-200        2    100    NaN  Turbo-fan  
3320      MD-88        2    142    NaN  Turbo-jet  
3321      MD-88        2    142    NaN  Turbo-jet  

[3322 rows x 9 columns]

weather = sm.datasets.get_rdataset("weather", "nycflights13").data
weather["time_hour"] = pd.to_datetime(weather["time_hour"], utc=False).dt.tz_convert("America/New_York")  # convert to datetime
weather

      origin  year  month  day  hour  temp  dewp  humid  wind_dir  wind_speed  \
0        EWR  2013      1    1     1 39.02 26.06  59.37    270.00       10.36   
1        EWR  2013      1    1     2 39.02 26.96  61.63    250.00        8.06   
2        EWR  2013      1    1     3 39.02 28.04  64.43    240.00       11.51   
...      ...   ...    ...  ...   ...   ...   ...    ...       ...         ...   
26112    LGA  2013     12   30    16 32.00 15.08  49.19    340.00       14.96   
26113    LGA  2013     12   30    17 30.92 12.92  46.74    320.00       17.26   
26114    LGA  2013     12   30    18 28.94 10.94  46.41    330.00       18.41   

       wind_gust  precip  pressure  visib                 time_hour  
0            NaN    0.00   1012.00  10.00 2013-01-01 01:00:00-05:00  
1            NaN    0.00   1012.30  10.00 2013-01-01 02:00:00-05:00  
2            NaN    0.00   1012.50  10.00 2013-01-01 03:00:00-05:00  
...          ...     ...       ...    ...                       ...  
26112      23.02    0.00   1019.50  10.00 2013-12-30 16:00:00-05:00  
26113        NaN    0.00   1019.90  10.00 2013-12-30 17:00:00-05:00  
26114        NaN    0.00   1020.90  10.00 2013-12-30 18:00:00-05:00  

[26115 rows x 15 columns]

Keys: 두 쌍의 테이블을 연결하는데 사용되는 변수

primary key ~ foreign key: relation
- A primary key: a variable or set of variables that uniquely identifies each observation.
  - planes 데이터셋의 tailnum,
  - airports 데이터셋의 faa
- A foreign key: a variable or set of variables that corresponds to a primary key in another table.
  - flights 데이터셋의 tailnum in relation to planes 데이터셋의 tailnum,
  - flights 데이터셋의 origin in relation to airports 데이터셋의 faa
relation은 1-1, 1-many 일수 있음

Primary key를 확인하는 방법

planes의 경우 tailnum가 primary key?

planes.duplicated(subset="tailnum").sum()
# 또는
(planes.value_counts("tailnum") > 1).sum()

weather의 경우 네 변수 year, month, day, hour, origin의 조합이 primary key일 수 있으나…

(
    weather[["year", "month", "day", "hour", "origin"]]
    .value_counts()
    .reset_index(name="n")
    .query("n > 1")
)

   year  month  day  hour origin  n
0  2013     11    3     1    EWR  2
1  2013     11    3     1    JFK  2
2  2013     11    3     1    LGA  2

primary key가 되려면 다음과 같이 불명확한 정보를 처리한 후 사용

weather[
    weather.duplicated(
        subset=["year", "month", "day", "hour", "origin"], keep=False
    )  # keep=False: 중복된 모든 행을 True로 표시
]

      origin  year  month  day  hour  temp  dewp  humid  wind_dir  wind_speed  \
7318     EWR  2013     11    3     1 51.98 39.02  61.15    310.00        6.90   
7319     EWR  2013     11    3     1 50.00 39.02  65.80    290.00        5.75   
16023    JFK  2013     11    3     1 53.96 37.94  54.51    320.00        9.21   
16024    JFK  2013     11    3     1 51.98 37.94  58.62    310.00        6.90   
24729    LGA  2013     11    3     1 55.04 39.02  54.67    330.00        9.21   
24730    LGA  2013     11    3     1 53.96 39.92  58.89    310.00        8.06   

       wind_gust  precip  pressure  visib                 time_hour  
7318         NaN    0.00   1009.80  10.00 2013-11-03 01:00:00-04:00  
7319         NaN    0.00   1010.50  10.00 2013-11-03 01:00:00-05:00  
16023        NaN    0.00   1009.80  10.00 2013-11-03 01:00:00-04:00  
16024        NaN    0.00   1010.50  10.00 2013-11-03 01:00:00-05:00  
24729        NaN    0.00   1009.30  10.00 2013-11-03 01:00:00-04:00  
24730        NaN    0.00   1010.20  10.00 2013-11-03 01:00:00-05:00

flights에 primary key가 있는가?

flights.duplicated(subset=["year", "month", "day", "flight"]).sum()

flights.duplicated(subset=["year", "month", "day", "flight", "carrier"]).sum()

flights.duplicated(subset=["year", "month", "day", "flight", "carrier", "sched_dep_time"]).sum()

단, 중복된 값이 없다는 것이 좋은 primary key가 된다는 보장은 없음.
위의 경우 (날짜/시간, 항공편 번호, 항공사)의 정보가 unique한 것이 자연스러움.

Combine

merge(): Key에 해당하는 변수들의 값이 매치되는 방식으로 두 DataFrame이 combine
concat(): 공통의 행 또는 열이 align되어 join

Merge

Inner join

df_x = pd.DataFrame({"key": [1, 2, 3], "val_x": ["x1", "x2", "x3"]})
df_y = pd.DataFrame({"key": [1, 2, 4], "val_y": ["y1", "y2", "y3"]})

   key val_x
0    1    x1
1    2    x2
2    3    x3

   key val_y
0    1    y1
1    2    y2
2    4    y3

pd.merge(df_x, df_y, on="key")  # 공통의 column이 있을 시 "on=" 생략 가능

   key val_x val_y
0    1    x1    y1
1    2    x2    y2

df_x.merge(df_y, on="key")  # as a method

   key val_x val_y
0    1    x1    y1
1    2    x2    y2

Note

merge()는 default로 inner 방식으로 join하고, how="inner"가 위에서 생략되었고, 다음과 동일

pd.merge(df_x, df_y, on="key", how="inner")

다른 방식으로는 “left”, “right”, “outer”가 있음

left keeps all observations in x. : 가장 흔하게 쓰는 join. 기준이 되는 데이터가 존재
right keeps all observations in y.
outer keeps all observations in x and y.

pd.merge(df_x, df_y, how="left")

   key val_x val_y
0    1    x1    y1
1    2    x2    y2
2    3    x3   NaN

pd.merge(df_x, df_y, how="right")

   key val_x val_y
0    1    x1    y1
1    2    x2    y2
2    4   NaN    y3

pd.merge(df_x, df_y, how="outer")

   key val_x val_y
0    1    x1    y1
1    2    x2    y2
2    3    x3   NaN
3    4   NaN    y3

Duplicate keys

한쪽만 중복이 있는 경우

flights2 = flights[
    ["year", "month", "day", "hour", "origin", "dest", "tailnum", "carrier"]
]
flights2

        year  month  day  hour origin dest tailnum carrier
0       2013      1    1     5    EWR  IAH  N14228      UA
1       2013      1    1     5    LGA  IAH  N24211      UA
2       2013      1    1     5    JFK  MIA  N619AA      AA
...      ...    ...  ...   ...    ...  ...     ...     ...
336773  2013      9   30    12    LGA  BNA  N535MQ      MQ
336774  2013      9   30    11    LGA  CLE  N511MQ      MQ
336775  2013      9   30     8    LGA  RDU  N839MQ      MQ

[336776 rows x 8 columns]

# flights에 항공사의 full name을 추가하고자 할때,
(
    flights2
    .merge(airlines, on="carrier", how="left")
)

        year  month  day  hour origin dest tailnum carrier  \
0       2013      1    1     5    EWR  IAH  N14228      UA   
1       2013      1    1     5    LGA  IAH  N24211      UA   
2       2013      1    1     5    JFK  MIA  N619AA      AA   
...      ...    ...  ...   ...    ...  ...     ...     ...   
336773  2013      9   30    12    LGA  BNA  N535MQ      MQ   
336774  2013      9   30    11    LGA  CLE  N511MQ      MQ   
336775  2013      9   30     8    LGA  RDU  N839MQ      MQ   

                          name  
0        United Air Lines Inc.  
1        United Air Lines Inc.  
2       American Airlines Inc.  
...                        ...  
336773               Envoy Air  
336774               Envoy Air  
336775               Envoy Air  

[336776 rows x 9 columns]

두 쪽 모두 중복이 있는 경우: 조심!
can’t uniquely identify an observation; 가능한 모든 조합이 나타남

Defining the key columns

The default, uses all variables that appear in both tables, the so called natural join.

flights2.merge(weather, how="left").head(3)  # on=["year", "month", "day", "hour", "origin"]

   year  month  day  hour origin dest tailnum carrier  temp  dewp  humid  \
0  2013      1    1     5    EWR  IAH  N14228      UA 39.02 28.04  64.43   
1  2013      1    1     5    LGA  IAH  N24211      UA 39.92 24.98  54.81   
2  2013      1    1     5    JFK  MIA  N619AA      AA 39.02 26.96  61.63   

   wind_dir  wind_speed  wind_gust  precip  pressure  visib  \
0    260.00       12.66        NaN    0.00   1011.90  10.00   
1    250.00       14.96      21.86    0.00   1011.40  10.00   
2    260.00       14.96        NaN    0.00   1012.10  10.00   

                  time_hour  
0 2013-01-01 05:00:00-05:00  
1 2013-01-01 05:00:00-05:00  
2 2013-01-01 05:00:00-05:00

flights의 year와 planes의 year는 다른 의미의 year임
중복된 이름은 따로 표기

# year_x, year_y로 구분되어 표시
flights2.merge(planes, on="tailnum", how="left").head(3)

   year_x  month  day  hour origin dest tailnum carrier  year_y  \
0    2013      1    1     5    EWR  IAH  N14228      UA 1999.00   
1    2013      1    1     5    LGA  IAH  N24211      UA 1998.00   
2    2013      1    1     5    JFK  MIA  N619AA      AA 1990.00   

                      type manufacturer    model  engines  seats  speed  \
0  Fixed wing multi engine       BOEING  737-824     2.00 149.00    NaN   
1  Fixed wing multi engine       BOEING  737-824     2.00 149.00    NaN   
2  Fixed wing multi engine       BOEING  757-223     2.00 178.00    NaN   

      engine  
0  Turbo-fan  
1  Turbo-fan  
2  Turbo-fan

airports 데이터프레임에서 공항이름이 faa라는 이름의 column으로 존재.
도착지(dest)의 공항정보를 얻으려면, faa와 dest를 매치시키고,
출발지(origin)의 공항정보를 얻으려면, faa와 origin를 매치시켜야 함.

airports.head(3)

   faa                           name   lat    lon   alt  tz dst  \
0  04G              Lansdowne Airport 41.13 -80.62  1044  -5   A   
1  06A  Moton Field Municipal Airport 32.46 -85.68   264  -6   A   
2  06C            Schaumburg Regional 41.99 -88.10   801  -6   A   

              tzone  
0  America/New_York  
1   America/Chicago  
2   America/Chicago

flights2.merge(airports, left_on="dest", right_on="faa", how="left").head(5)

   year  month  day  hour origin dest tailnum carrier  faa  \
0  2013      1    1     5    EWR  IAH  N14228      UA  IAH   
1  2013      1    1     5    LGA  IAH  N24211      UA  IAH   
2  2013      1    1     5    JFK  MIA  N619AA      AA  MIA   
3  2013      1    1     5    JFK  BQN  N804JB      B6  NaN   
4  2013      1    1     6    LGA  ATL  N668DN      DL  ATL   

                              name   lat    lon     alt    tz  dst  \
0     George Bush Intercontinental 29.98 -95.34   97.00 -6.00    A   
1     George Bush Intercontinental 29.98 -95.34   97.00 -6.00    A   
2                       Miami Intl 25.79 -80.29    8.00 -5.00    A   
3                              NaN   NaN    NaN     NaN   NaN  NaN   
4  Hartsfield Jackson Atlanta Intl 33.64 -84.43 1026.00 -5.00    A   

              tzone  
0   America/Chicago  
1   America/Chicago  
2  America/New_York  
3               NaN  
4  America/New_York

flights2.merge(airports, left_on="origin", right_on="faa", how="left").head(5)

   year  month  day  hour origin dest tailnum carrier  faa  \
0  2013      1    1     5    EWR  IAH  N14228      UA  EWR   
1  2013      1    1     5    LGA  IAH  N24211      UA  LGA   
2  2013      1    1     5    JFK  MIA  N619AA      AA  JFK   
3  2013      1    1     5    JFK  BQN  N804JB      B6  JFK   
4  2013      1    1     6    LGA  ATL  N668DN      DL  LGA   

                  name   lat    lon  alt  tz dst             tzone  
0  Newark Liberty Intl 40.69 -74.17   18  -5   A  America/New_York  
1           La Guardia 40.78 -73.87   22  -5   A  America/New_York  
2  John F Kennedy Intl 40.64 -73.78   13  -5   A  America/New_York  
3  John F Kennedy Intl 40.64 -73.78   13  -5   A  America/New_York  
4           La Guardia 40.78 -73.87   22  -5   A  America/New_York

Concatenate

pd.concat([df1, df2, ...], axis=)

행과 열의 index를 매치시켜 두 DataFrame/Series를 합침

df1 = pd.DataFrame(
    np.arange(6).reshape(3, 2), index=["a", "b", "c"], columns=["one", "two"]
)
df2 = pd.DataFrame(
    5 + np.arange(4).reshape(2, 2), index=["a", "c"], columns=["three", "four"]
)

   one  two
a    0    1
b    2    3
c    4    5

   three  four
a      5     6
c      7     8

pd.concat([df1, df2], axis=1)

   one  two  three  four
a    0    1   5.00  6.00
b    2    3    NaN   NaN
c    4    5   7.00  8.00

pd.concat([df1, df2])  # default: axis=0

   one  two  three  four
a 0.00 1.00    NaN   NaN
b 2.00 3.00    NaN   NaN
c 4.00 5.00    NaN   NaN
a  NaN  NaN   5.00  6.00
c  NaN  NaN   7.00  8.00

Filtering

merge()를 이용해 필요로하는 부분을 필터링 할 수 있음

예를 들어, flights2 테이블에서 가장 인기있는 도착지 10군데를 찾는 경우

top_dest = flights2.value_counts("dest").head(10).reset_index(name="n")
top_dest

   dest      n
0   ORD  17283
1   ATL  17215
2   LAX  16174
..  ...    ...
7   FLL  12055
8   MIA  11728
9   DCA   9705

[10 rows x 2 columns]

flights2.merge(top_dest, on="dest")  # inner join

        year  month  day  hour origin dest tailnum carrier      n
0       2013      1    1     5    JFK  MIA  N619AA      AA  11728
1       2013      1    1     6    LGA  ATL  N668DN      DL  17215
2       2013      1    1     5    EWR  ORD  N39463      UA  17283
...      ...    ...  ...   ...    ...  ...     ...     ...    ...
141142  2013      9   30    20    JFK  MCO  N804JB      B6  14082
141143  2013      9   30    22    JFK  BOS  N565JB      B6  15508
141144  2013      9   30    14    JFK  DCA     NaN      9E   9705

[141145 rows x 9 columns]

Tip

다음과 같은 isin()을 이용할 수 있음

flights2[flights2["dest"].isin(top_dest["dest"])]

Note

Index 값을 key로 하는 merge의 경우: left_on, right_on 대신 left_index=True, right_index=True

간단히 .join() method를 이용해 중복된 column이 없는 두 DataFrame을 merge할 수 있음 (index 매치 & left-merge): df1.join(df2)
다른 방식의 .join() method 사용은 docstring 참고

다양한 combine 방식은 교재를 참고.
8.2 Combining and Merging Datasets in McKinney’s

Merge problems:

merge은 매우 조심스러운 작업!

Start by identifying the variables that form the primary key in each table.

You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.

# 확인작업은 기본
weather[
 weather.duplicated(subset=["year", "month", "day", "hour", "origin"], keep=False)
]
#       origin  year  month  day  hour  temp  dewp  humid  wind_dir  wind_speed  ...
# 7318     EWR  2013     11    3     1 51.98 39.02  61.15    310.00        6.90   
# 7319     EWR  2013     11    3     1 50.00 39.02  65.80    290.00        5.75   
# 16023    JFK  2013     11    3     1 53.96 37.94  54.51    320.00        9.21   
# 16024    JFK  2013     11    3     1 51.98 37.94  58.62    310.00        6.90   
# 24729    LGA  2013     11    3     1 55.04 39.02  54.67    330.00        9.21   
# 24730    LGA  2013     11    3     1 53.96 39.92  58.89    310.00        8.06

Check that none of the variables in the primary key are missing. If a value is missing then it can’t identify an observation!
Check that your foreign keys match primary keys in another table.
- It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.
- If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.