Views and Copies

R for Data Science by Wickham & Grolemund

Author

Sungkyun Cho

Published

October 4, 2023

Load Packages

# numerical calculation & data frames
import numpy as np
import pandas as pd

What’s new in 2.0.0 (April 3, 2023)

Argument dtype_backend, to return pyarrow-backed or numpy-backed nullable dtypes
Copy-on-Write improvements: link

아래와 같이 copy_on_write을 적용하여 쓰는 것을 추천

pd.set_option("mode.copy_on_write", True)
# or
pd.options.mode.copy_on_write = True

NumPy

NumPy에서 subsetting을 하는 경우 view로 나타날 수 있음.
속도와 메모리의 효율적 관리가 가능하나 혼동의 여지 있음.

arr = np.array([1, 2, 3, 4, 5])
arr

array([1, 2, 3, 4, 5])

sub_arr = arr[1:3]
sub_arr # view

array([2, 3])

Subset을 수정하면

sub_arr[0] = 99

print(arr)
print(sub_arr)

[ 1 99  3  4  5]
[99  3]

반대로 “original” arr를 수정하도

arr[2] = -11

print(arr)
print(sub_arr)

[  1  99 -11   4   5]
[ 99 -11]

사실, arr, sub_arr는 같은 메모리 주소를 reference함

Note

View가 되지 않고 copy로 되는 경우가 있음.

Simple indexing을 제외하면 copy가 된다고 보면 됨
즉, arr[2] 또는 arr[2:4] 같은 경우는 view로, 그 이외에 integer array로 subsetting을 하거나 (fancy indexing); arr[[2, 3]], 또는 boolean indexing; arr[arr > 2]의 경우 copy가 됨

arr = np.array([1, 2, 3, 4, 5])

sub_arr = arr[[2, 3]]  # copy
sub_arr[0] = 99

print(arr)
print(sub_arr)

[1 2 3 4 5]
[99  4]

sub_arr = arr[arr > 2]  # copy
sub_arr[0] = 99

print(arr)
print(sub_arr)

[1 2 3 4 5]
[99  4  5]

Note

Assign operator의 왼편에 [:] 없이, view에서 수정된 array를 assign하면 copy로 전달

arr = np.array([1, 2, 3, 4, 5])

sub_arr = arr[1:4]     # view
sub_arr = sub_arr * 2  # copy

print(arr)
print(sub_arr)

[1 2 3 4 5]
[4 6 8]

arr = np.array([1, 2, 3, 4, 5])

sub_arr = arr[1:4]        # view
sub_arr[:] = sub_arr * 2  # view

print(arr)
print(sub_arr)

[1 4 6 8 5]
[4 6 8]

강제로 copy: sub_arr.copy()

pandas

훨씬 복잡함…
데이터 타입도 데이터가 어떻게 만들어졌는지도 관계가 있음.

df = pd.DataFrame(np.arange(8).reshape(4, 2), columns=["one", "two"])
df

   one  two
0    0    1
1    2    3
2    4    5
3    6    7

sub_df = df.iloc[1:3]  # view
sub_df

   one  two
1    2    3
2    4    5

df.iloc[1, 1] = 99

print(df)
print(sub_df)

   one  two
0    0    1
1    2   99
2    4    5
3    6    7
   one  two
1    2   99
2    4    5

Note

copy_on_write = True (v2.0) 일 때는 copy가 되어

print(sub_df)
#    one  two
# 1    2    3
# 2    4    5

df.iloc[1, 0] = 0.9  # copy; 컬럼의 데이터 타입이 int에서 float로 바뀌면서 copy됨)

print(df)
print(sub_df)

   one  two
0  0.0    1
1  0.9   99
2  4.0    5
3  6.0    7
   one  two
1    2   99
2    4    5

df.iloc[2, 1] = -99  # view

print(df)
print(sub_df)

   one  two
0  0.0    1
1  0.9   99
2  4.0  -99
3  6.0    7
   one  two
1    2   99
2    4  -99

`SettingWithCopyWarning`

Subsetting된 DataFrame을 수정하려할 때 경고를 내어주지만, 항상 믿을만 한 것은 아님.
경고가 발생할 시, 앞 어디에선가 view나 copy가 이루어진 곳을 찾아 .copy()로 수정

df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=["one", "two", "three"])
df

   one  two  three
0    0    1      2
1    3    4      5
2    6    7      8
3    9   10     11

df_cols = df[["two", "three"]]  # copy
df_cols

   two  three
0    1      2
1    4      5
2    7      8
3   10     11

df.iloc[0, 1] = -55

print(df)
print(df_cols)

   one  two  three
0    0  -55      2
1    3    4      5
2    6    7      8
3    9   10     11
   two  three
0    1      2
1    4      5
2    7      8
3   10     11

Subset을 수정하려하면 warning message!
copy_on_save가 True일 때는 copy가 되면서 경고가 발생하지 않음 (v2.0)

df_cols.iloc[0, 1] = -99

/var/folders/tv/fwb_421x50z8bj5v37vw680r0000gn/T/ipykernel_1691/2609376290.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cols.iloc[0, 1] = -99

print(df_cols)
print(df)

   two  three
0    1    -99
1    4      5
2    7      8
3   10     11
   one  two  three
0    0  -55      2
1    3    4      5
2    6    7      8
3    9   10     11

다음과 비교

df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=["one", "two", "three"])
df

   one  two  three
0    0    1      2
1    3    4      5
2    6    7      8
3    9   10     11

df_cols_3 = df.loc[:, ["two", "three"]]  # copy
df_cols_3.iloc[0, 1] = -99

# No warning

print(df_cols_3)
print(df)

   two  three
0    1    -99
1    4      5
2    7      8
3   10     11
   one  two  three
0    0    1      2
1    3    4      5
2    6    7      8
3    9   10     11

df = pd.DataFrame(np.arange(12).reshape(4, 3), columns=["one", "two", "three"])
df

   one  two  three
0    0    1      2
1    3    4      5
2    6    7      8
3    9   10     11

df_cols_2 = df.loc[:, "two":"three"]  # view
df_cols_2.iloc[0, 1] = -99

/var/folders/tv/fwb_421x50z8bj5v37vw680r0000gn/T/ipykernel_1691/4245876664.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cols_2.iloc[0, 1] = -99

print(df_cols_2)
print(df)

   two  three
0    1    -99
1    4      5
2    7      8
3   10     11
   one  two  three
0    0    1    -99
1    3    4      5
2    6    7      8
3    9   10     11

Note

copy_on_save가 True일 때는 copy가 되면서 경고가 발생하지 않음 (v2.0)

print(df)
#    one  two  three
# 0    0    1      2
# 1    3    4      5
# 2    6    7      8
# 3    9   10     11

강제로 copy: df_cols.copy()

df_cols_4 = df[["two", "three"]].copy()
df_cols_4.iloc[0, 1] = -99

Tip

Subset을 만들고 바로 분석을 할 것이 아니라면, 안전하게 .copy()를 쓰는 것을 추천