Pandas의 join과 merge의 차이점은 무엇입니까?
두 개의 DataFrame이 있다고 가정합니다.
left = pd.DataFrame({'key1': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key2': ['foo', 'bar'], 'rval': [4, 5]})
그것들을 병합하고 싶기 때문에 다음과 같이 시도하십시오.
pd.merge(left, right, left_on='key1', right_on='key2')
그리고 나는 행복하다
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
그러나 나는 조인 방법을 사용하려고 노력하고 있습니다.
left.join(right, on=['key1', 'key2'])
그리고 나는 이것을 얻는다 :
//anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _validate_specification(self)
406 if self.right_index:
407 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408 raise AssertionError()
409 self.right_on = [None] * n
410 elif self.right_on is not None:
AssertionError:
내가 무엇을 놓치고 있습니까?
나는 항상 join
인덱스에 사용 합니다.
import pandas as pd
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]}).set_index('key')
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]}).set_index('key')
left.join(right, lsuffix='_l', rsuffix='_r')
val_l val_r
key
foo 1 4
bar 2 5
merge
다음 열 을 사용하여 동일한 기능을 수행 할 수 있습니다 .
left = pd.DataFrame({'key': ['foo', 'bar'], 'val': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'val': [4, 5]})
left.merge(right, on=('key'), suffixes=('_l', '_r'))
key val_l val_r
0 foo 1 4
1 bar 2 5
pandas.merge()
모든 병합 / 결합 동작에 사용되는 기본 함수입니다.
DataFrame은의 기능에 액세스하는 편리한 방법으로 pandas.DataFrame.merge()
및 pandas.DataFrame.join()
메소드를 제공합니다 pandas.merge()
. 예를 df1.merge(right=df2, ...)
들어와 같습니다 pandas.merge(left=df1, right=df2, ...)
.
와의 주요 차이점은 다음 df.join()
과 df.merge()
같습니다.
- 오른쪽 테이블에서 찾아보기 :
df1.join(df2)
항상의 인덱스를 통해 조인df2
하지만df1.merge(df2)
하나 이상의 열df2
(기본값) 또는 인덱스df2
(withright_index=True
)에 조인 할 수 있습니다 . - 왼쪽 룩업 테이블에 기본적으로
df1.join(df2)
의 사용 지수df1
와df1.merge(df2)
의 사용 칼럼 (들)df1
. 즉 지정하여 대체 할 수 있습니다df1.join(df2, on=key_or_keys)
또는df1.merge(df2, left_index=True)
. - 내부 대에 가입 왼쪽 :
df1.join(df2)
왼쪽은 기본적으로 가입 않습니다 (모든 행을 유지df1
)하지만,df.merge
내부는 기본적으로 가입하지 (만 반환의 행을 일치df1
하고df2
).
So, the generic approach is to use pandas.merge(df1, df2)
or df1.merge(df2)
. But for a number of common situations (keeping all rows of df1
and joining to an index in df2
), you can save some typing by using df1.join(df2)
instead.
Some notes on these issues from the documentation at http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging:
merge
is a function in the pandas namespace, and it is also available as a DataFrame instance method, with the calling DataFrame being implicitly considered the left object in the join.The related
DataFrame.join
method, usesmerge
internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior formerge
). If you are joining on index, you may wish to useDataFrame.join
to save yourself some typing.
...
These two function calls are completely equivalent:
left.join(right, on=key_or_keys) pd.merge(left, right, left_on=key_or_keys, right_index=True, how='left', sort=False)
I believe that join()
is just a convenience method. Try df1.merge(df2)
instead, which allows you to specify left_on
and right_on
:
In [30]: left.merge(right, left_on="key1", right_on="key2")
Out[30]:
key1 lval key2 rval
0 foo 1 foo 4
1 bar 2 bar 5
pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:
merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False)
And :
DataFrame.join is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame. Here is a very basic example: The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus additional arguments instructing it to use the indexes: result = pd.merge(left, right, left_index=True, right_index=True, how='outer')
One of the difference is that merge
is creating a new index, and join
is keeping the left side index. It can have a big consequence on your later transformations if you wrongly assume that your index isn't changed with merge
.
For example:
import pandas as pd
df1 = pd.DataFrame({'org_index': [101, 102, 103, 104],
'date': [201801, 201801, 201802, 201802],
'val': [1, 2, 3, 4]}, index=[101, 102, 103, 104])
df1
date org_index val
101 201801 101 1
102 201801 102 2
103 201802 103 3
104 201802 104 4
-
df2 = pd.DataFrame({'date': [201801, 201802], 'dateval': ['A', 'B']}).set_index('date')
df2
dateval
date
201801 A
201802 B
-
df1.merge(df2, on='date')
date org_index val dateval
0 201801 101 1 A
1 201801 102 2 A
2 201802 103 3 B
3 201802 104 4 B
-
df1.join(df2, on='date')
date org_index val dateval
101 201801 101 1 A
102 201801 102 2 A
103 201802 103 3 B
104 201802 104 4 B
- Join: Default Index (If any same column name then it will throw an error in default mode because u have not defined lsuffix or rsuffix))
df_1.join(df_2)
- Merge: Default Same Column Names (If no same column name it will throw an error in default mode)
df_1.merge(df_2)
on
parameter has different meaning in both cases
df_1.merge(df_2, on='column_1')
df_1.join(df_2, on='column_1') // It will throw error
df_1.join(df_2.set_index('column_1'), on='column_1')
To put it analogously to SQL "Pandas merge is to outer/inner join and Pandas join is to natural join". Hence when you use merge in pandas, you want to specify which kind of sqlish join you want to use whereas when you use pandas join, you really want to have a matching column label to ensure it joins
참고URL : https://stackoverflow.com/questions/22676081/what-is-the-difference-between-join-and-merge-in-pandas
'Programing' 카테고리의 다른 글
Android 디자인 지원 라이브러리 확장 가능 부동 작업 버튼 (FAB) 메뉴 (0) | 2020.06.05 |
---|---|
vim에서 HTML 태그 사이에 텍스트를 삭제 하시겠습니까? (0) | 2020.06.05 |
Spring Boot-예외로 모든 요청과 응답을 한곳에 기록하는 방법은 무엇입니까? (0) | 2020.06.05 |
SQL Server에서 누계 계산 (0) | 2020.06.05 |
검도 그리드 재 장전 / 갱신 (0) | 2020.06.04 |