한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

Programing

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

lottogame 2020. 9. 20. 10:28

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

다음 데이터 프레임이 있습니다.

x <- read.table(text = "  id1 id2 val1 val2
1   a   x    1    9
2   a   x    2    4
3   a   y    3    5
4   a   y    4    9
5   b   x    1    7
6   b   y    4    4
7   b   x    3    9
8   b   y    2    8", header = TRUE)

id1과 id2로 그룹화 된 val1과 val2의 평균을 계산하고 동시에 각 id1-id2 조합의 행 수를 계산하고 싶습니다. 각 계산을 개별적으로 수행 할 수 있습니다.

# calculate mean
aggregate(. ~ id1 + id2, data = x, FUN = mean)

# count rows
aggregate(. ~ id1 + id2, data = x, FUN = length)

한 번의 호출로 두 가지 계산을 모두 수행하기 위해

do.call("rbind", aggregate(. ~ id1 + id2, data = x, FUN = function(x) data.frame(m = mean(x), n = length(x))))

그러나 경고와 함께 잘못된 출력이 표시됩니다.

#     m   n
# id1 1   2
# id2 1   1
#     1.5 2
#     2   2
#     3.5 2
#     3   2
#     6.5 2
#     8   2
#     7   2
#     6   2
# Warning message:
#   In rbind(id1 = c(1L, 2L, 1L, 2L), id2 = c(1L, 1L, 2L, 2L), val1 = list( :
#   number of columns of result is not a multiple of vector length (arg 1)

나는 plyr 패키지를 사용할 수 있지만 데이터 세트의 크기가 커지면 데이터 세트가 상당히 크고 plyr가 매우 느립니다 (거의 사용할 수 없음).

aggregate또는 다른 함수를 사용하여 한 번의 호출로 여러 계산을 수행 하려면 어떻게 해야합니까?

한 번에 모든 작업을 수행하고 적절한 라벨링을 얻을 수 있습니다.

> aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )
#   id1 id2 val1.mn val1.n val2.mn val2.n
# 1   a   x     1.5    2.0     6.5    2.0
# 2   b   x     2.0    2.0     8.0    2.0
# 3   a   y     3.5    2.0     7.0    2.0
# 4   b   y     3.0    2.0     6.0    2.0

그러면 두 개의 id 열과 두 개의 행렬 열이있는 데이터 프레임이 생성됩니다.

str( aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) )
'data.frame':   4 obs. of  4 variables:
 $ id1 : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2 : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1: num [1:4, 1:2] 1.5 2 3.5 3 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"
 $ val2: num [1:4, 1:2] 6.5 8 7 6 2 2 2 2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr  "mn" "n"

As pointed out by @lord.garbage below, this can be converted to a dataframe with "simple" columns by using do.call(data.frame, ...)

str( do.call(data.frame, aggregate(. ~ id1+id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) ) ) 
    )
'data.frame':   4 obs. of  6 variables:
 $ id1    : Factor w/ 2 levels "a","b": 1 2 1 2
 $ id2    : Factor w/ 2 levels "x","y": 1 1 2 2
 $ val1.mn: num  1.5 2 3.5 3
 $ val1.n : num  2 2 2 2
 $ val2.mn: num  6.5 8 7 6
 $ val2.n : num  2 2 2 2

This is the syntax for multiple variables on the LHS:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = function(x) c(mn = mean(x), n = length(x) ) )

Given this in the question :

I could use the plyr package, but my data set is quite large and plyr is very slow (almost unusable) when the size of the dataset grows.

Then in data.table (1.9.4+) you could try :

> DT
   id1 id2 val1 val2
1:   a   x    1    9
2:   a   x    2    4
3:   a   y    3    5
4:   a   y    4    9
5:   b   x    1    7
6:   b   y    4    4
7:   b   x    3    9
8:   b   y    2    8

> DT[ , .(mean(val1), mean(val2), .N), by = .(id1, id2)]   # simplest
   id1 id2  V1  V2 N
1:   a   x 1.5 6.5 2
2:   a   y 3.5 7.0 2
3:   b   x 2.0 8.0 2
4:   b   y 3.0 6.0 2

> DT[ , .(val1.m = mean(val1), val2.m = mean(val2), count = .N), by = .(id1, id2)]  # named
   id1 id2 val1.m val2.m count
1:   a   x    1.5    6.5     2
2:   a   y    3.5    7.0     2
3:   b   x    2.0    8.0     2
4:   b   y    3.0    6.0     2

> DT[ , c(lapply(.SD, mean), count = .N), by = .(id1, id2)]   # mean over all columns
   id1 id2 val1 val2 count
1:   a   x  1.5  6.5     2
2:   a   y  3.5  7.0     2
3:   b   x  2.0  8.0     2
4:   b   y  3.0  6.0     2

For timings comparing aggregate (used in question and all 3 other answers) to data.table see this benchmark (the agg and agg.x cases).

You could add a count column, aggregate with sum, then scale back to get the mean:

x$count <- 1
agg <- aggregate(. ~ id1 + id2, data = x,FUN = sum)
agg
#   id1 id2 val1 val2 count
# 1   a   x    3   13     2
# 2   b   x    4   16     2
# 3   a   y    7   14     2
# 4   b   y    6   12     2

agg[c("val1", "val2")] <- agg[c("val1", "val2")] / agg$count
agg
#   id1 id2 val1 val2 count
# 1   a   x  1.5  6.5     2
# 2   b   x  2.0  8.0     2
# 3   a   y  3.5  7.0     2
# 4   b   y  3.0  6.0     2

It has the advantage of preserving your column names and creating a single count column.

Perhaps you want to merge?

x.mean <- aggregate(. ~ id1+id2, p, mean)
x.len  <- aggregate(. ~ id1+id2, p, length)

merge(x.mean, x.len, by = c("id1", "id2"))

  id1 id2 val1.x val2.x val1.y val2.y
1   a   x    1.5    6.5      2      2
2   a   y    3.5    7.0      2      2
3   b   x    2.0    8.0      2      2
4   b   y    3.0    6.0      2      2

Using the dplyr package you could achieve this by using summarise_all. With this summarise-function you can apply other functions (in this case mean and n()) to each of the non-grouping columns:

x %>%
  group_by(id1, id2) %>%
  summarise_all(funs(mean, n()))

which gives:

     id1    id2 val1_mean val2_mean val1_n val2_n
1      a      x       1.5       6.5      2      2
2      a      y       3.5       7.0      2      2
3      b      x       2.0       8.0      2      2
4      b      y       3.0       6.0      2      2

If you don't want to apply the function(s) to all non-grouping columns, you specify the columns to which they should be applied or by excluding the non-wanted with a minus using the summarise_at() function:

# inclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(val1, val2), funs(mean, n()))

# exclusion
x %>%
  group_by(id1, id2) %>%
  summarise_at(vars(-val2), funs(mean, n()))

You can also use the plyr::each() to introduce multiple functions:

aggregate(cbind(val1, val2) ~ id1 + id2, data = x, FUN = plyr::each(avg = mean, n = length))

참고URL : https://stackoverflow.com/questions/12064202/apply-several-summary-functions-on-several-variables-by-group-in-one-call

'Programing' 카테고리의 다른 글

데이터 프레임의 선택한 열에서 NA (결측) 값을 포함하는 행의 하위 집합 (0)	2020.09.20
장고에서 단위 테스트 파일 업로드하는 방법 (0)	2020.09.20
Where 절의 SQL Row_Number () 함수 (0)	2020.09.20
XmlDocument-문자열에서로드 하시겠습니까? (0)	2020.09.20
* nix에서 ipython에서 vi 키를 어떻게 사용합니까? (0)	2020.09.20

현재글한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

Javascript, 볼거리, 공연, 관광, 무비순위, 여행, 연극, c++, 놀거리, 행사, Spring3, 축제, 뮤지컬, 자바, java, JQuery, spring, 극장순위, 가족나들이, c#,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

lottogame

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

한 번의 호출로 그룹별로 여러 변수에 여러 요약 함수 적용

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바