深入理解 Pandas 的 groupby 函数

序

最近在学习 Pandas，在处理数据时，经常需要对数据的某些字段进行分组分析，这就需要用到 groupby 函数，这篇文章做一个详细记录

Pandas 版本 1.4.3

Pandas 中的 groupby 函数先将 DataFrame 或 Series 按照关注字段进行拆分，将相同属性划分为一组，然后可以对拆分后的各组执行相应的转换操作，最后返回汇总转换后的各组结果

一、基本用法

先初始化一些数据，方便演示

import pandas as pd

df = pd.DataFrame({
            'name': ['香蕉', '菠菜', '糯米', '糙米', '丝瓜', '冬瓜', '柑橘', '苹果', '橄榄油'],
            'category': ['水果', '蔬菜', '米面', '米面', '蔬菜', '蔬菜', '水果', '水果', '粮油'],
            'price': [3.5, 6, 2.8, 9, 3, 2.5, 3.2, 8, 18],
            'count': [2, 1, 3, 6, 4, 8, 5, 3, 2]
        })

按 category 分组

grouped = df.groupby('category')
print(type(grouped))
print(grouped)

输出结果

grouped 的类型是 DataFrameGroupBy，直接尝试输出，打印是内存地址，不太直观，这里写一个函数来展示（可以这么写的原理，后面会介绍）

def view_group(the_pd_group):
    for name, group in the_pd_group:
        print(f'group name: {name}')
        print('-' * 30)
        print(group)
        print('=' * 30, '
')
view_group(grouped)

输出结果

group name: 水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
============================== 
group name: 米面
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
============================== 
group name: 粮油
------------------------------
   name    category  price  count
8  橄榄油       粮油   18.0      2
============================== 
group name: 蔬菜
------------------------------
    name  category  price  count
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
==============================

二、参数源码探析

接下来看一下源码中的方法定义 DataFrame 的 groupby

def groupby(
        self,
        by=None,
        axis: Axis = 0,
        level: Level | None = None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> DataFrameGroupBy:
    pass

Series 的 groupby

def groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index: bool = True,
        sort: bool = True,
        group_keys: bool = True,
        squeeze: bool | lib.NoDefault = no_default,
        observed: bool = False,
        dropna: bool = True,
    ) -> SeriesGroupBy:
    pass

Series 的 groupby 函数操作与 DataFrame 类似，这篇文章只以 DataFrame 作为示例

入参

by

再来回忆一下基本用法里的写法

grouped = df.groupby('category')

这里传入的 category 就是第 1 个参数 by，表示要按照什么进行分组，根据官方文档介绍，by 可以是 mapping, function, label, list of labels 中的一种，这里是用的 label，也就是说，还可以像下面这样写

label 列表

grouped = df.groupby(['category'])

mapping

这种方式需要按 DataFrame 的 index 进行映射，这里把水果和蔬菜划分到大组蔬菜水果，米面和粮油划分到大组米面粮油

category_dict = {'水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}
the_map = {}
for i in range(len(df.index)):
    the_map[i] = category_dict[df.iloc[i]['category']]
grouped = df.groupby(the_map)
view_group(grouped)

输出结果如下

group name: 米面粮油
------------------------------
    name  category  price  count
2   糯米       米面    2.8      3
3   糙米       米面    9.0      6
8  橄榄油      粮油   18.0      2
============================== 

group name: 蔬菜水果
------------------------------
    name  category  price  count
0   香蕉       水果    3.5      2
1   菠菜       蔬菜    6.0      1
4   丝瓜       蔬菜    3.0      4
5   冬瓜       蔬菜    2.5      8
6   柑橘       水果    3.2      5
7   苹果       水果    8.0      3
==============================

function

这种方式下，自定义函数的入参也是 DataFrame 的 index，输出结果与 mapping 的例子相同

category_dict = {'水果': '蔬菜水果', '蔬菜': '蔬菜水果', '米面': '米面粮油', '粮油': '米面粮油'}

def to_big_category(the_idx):
    return category_dict[df.iloc[the_idx]['category']]
grouped = df.groupby(to_big_category)
view_group(grouped)

axis

axis 表示以哪个轴作为分组的切分依据

0 - 等价于index, 表示按行切分，默认

1 - 等价于columns，表示按列切分

这里看一下按列切分的示例

def group_columns(column_name: str):
    if column_name in ['name', 'category']:
        return 'Group 1'
    else:
        return 'Group 2'
# 等价写法 grouped = df.head(3).groupby(group_columns, axis='columns')
grouped = df.head(3).groupby(group_columns, axis=1)
view_group(grouped)

输出结果如下

group name: Group 1
------------------------------
    name  category
0   香蕉       水果
1   菠菜       蔬菜
2   糯米       米面
============================== 

group name: Group 2
------------------------------
   price  count
0    3.5      2
1    6.0      1
2    2.8      3
==============================

相当于把表从垂直方向上切开，左半部分为 Group 1，右半部分为 Group 2

level

当 axis 是 MultiIndex (层级结构)时，按特定的 level 进行分组，注意这里的 level 是 int 类型，从 0 开始，0 表示第 1 层，以此类推

构造另一组带 MultiIndex 的测试数据

the_arrays = [['A', 'A', 'A', 'B', 'A', 'A', 'A', 'B', 'A', 'A'],
              ['蔬菜水果', '蔬菜水果', '米面粮油', '休闲食品', '米面粮油', '蔬菜水果', '蔬菜水果', '休闲食品', '蔬菜水果', '米面粮油'],
              ['水果', '蔬菜', '米面', '糖果', '米面', '蔬菜', '蔬菜', '饼干', '水果', '粮油']]
the_index = pd.MultiIndex.from_arrays(arrays=the_arrays, names=['one ', 'two', 'three'])
df_2 = pd.DataFrame(data=[3.5, 6, 2.8, 4, 9, 3, 2.5, 3.2, 8, 18], index=the_index, columns=['price'])
print(df_2)

输出结果如下

                      price
one  two    three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
     米面粮油 米面       2.8
B    休闲食品 糖果       4.0
A    米面粮油 米面       9.0
     蔬菜水果 蔬菜       3.0
             蔬菜       2.5
B    休闲食品 饼干       3.2
A    蔬菜水果 水果       8.0
     米面粮油 粮油      18.0

1. 按第 3 层分组

grouped = df_2.groupby(level=2)
view_group(grouped)

输出结果如下

group name: 水果
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             水果       8.0
============================== 

group name: 米面
------------------------------
                     price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
============================== 

group name: 粮油
------------------------------
                      price
one  two    three       
A    米面粮油 粮油      18.0
============================== 

group name: 糖果
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
============================== 

group name: 蔬菜
------------------------------
                     price
one  two    three       
A    蔬菜水果 蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
============================== 

group name: 饼干
------------------------------
                      price
one  two    three       
B    休闲食品 饼干       3.2
==============================

共 6 个分组

2. 按第 1, 2 层分组

grouped = df_2.groupby(level=[0, 1])
view_group(grouped)

输出结果如下

group name: ('A', '米面粮油')
------------------------------
                      price
one  two    three       
A    米面粮油 米面       2.8
             米面       9.0
             粮油      18.0
============================== 

group name: ('A', '蔬菜水果')
------------------------------
                      price
one  two    three       
A    蔬菜水果 水果       3.5
             蔬菜       6.0
             蔬菜       3.0
             蔬菜       2.5
             水果       8.0
============================== 

group name: ('B', '休闲食品')
------------------------------
                      price
one  two    three       
B    休闲食品 糖果       4.0
             饼干       3.2
==============================

共 3 个分组，可以看到，分组名称变成了元组

as_index

bool 类型，默认值为 True。对于聚合输出，返回对象以分组名作为索引

grouped = self.df.groupby('category', as_index=True)
print(grouped.sum())

as_index 为 True 的输出结果如下

            price  count
category              
水果         14.7     10
米面         11.8      9
粮油         18.0      2
蔬菜         11.5     13

grouped = self.df.groupby('category', as_index=False)
print(grouped.sum())

as_index 为 False 的输出结果如下，与 SQL 的 groupby 输出风格相似

    category  price  count
0       水果   14.7     10
1       米面   11.8      9
2       粮油   18.0      2
3       蔬菜   11.5     13

sort

bool 类型，默认为 True。是否对分组名进行排序，关闭自动排序可以提高性能。注意：对分组名排序并不影响分组内的顺序

group_keys

bool 类型，默认为 True

如果为 True，调用 apply 时，将分组的 keys 添加到索引中

squeeze

1.1.0 版本已废弃，不解释

observed

bool 类型，默认值为 False

仅适用于任何 groupers 是分类(Categoricals)的

如果为 True，仅显示分类分组的观察值；如果为 False ，显示分类分组的所有值

dropna

bool 类型，默认值为 True，1.1.0 版本新增参数

如果为 True，且分组的 keys 中包含 NA 值，则 NA 值连同行(axis=0) / 列(axis=1)将被删除

如果为 False，NA 值也被视为分组的 keys，不做处理

返回值

DateFrame 的 gropuby 函数，返回类型是 DataFrameGroupBy，而 Series 的 groupby 函数，返回类型是 SeriesGroupBy

查看源码后发现他们都继承了 BaseGroupBy，继承关系如图所示

BaseGroupBy 类中有一个 grouper 属性，是 ops.BaseGrouper 类型，但 BaseGroupBy 类没有 init 方法，因此进入 GroupBy 类，该类重写了父类的 grouper 属性，在 init 方法中调用了 grouper.py 的 get_grouper，下面是抽取出来的伪代码

groupby.py 文件

class GroupBy(BaseGroupBy[NDFrameT]):
	grouper: ops.BaseGrouper
	
	def __init__(self, ...):
		# ...
		if grouper is None:
			from pandas.core.groupby.grouper import get_grouper
			grouper, exclusions, obj = get_grouper(...)

grouper.py 文件

def get_grouper(...) -> tuple[ops.BaseGrouper, frozenset[Hashable], NDFrameT]:
	# ...
	# create the internals grouper
    grouper = ops.BaseGrouper(
        group_axis, groupings, sort=sort, mutated=mutated, dropna=dropna
    )
	return grouper, frozenset(exclusions), obj

class Grouping：
	"""
	obj : DataFrame or Series
	"""
	def __init__(
        self,
        index: Index,
        grouper=None,
        obj: NDFrame | None = None,
        level=None,
        sort: bool = True,
        observed: bool = False,
        in_axis: bool = False,
        dropna: bool = True,
    ):
    	pass

ops.py 文件

class BaseGrouper:
    """
    This is an internal Grouper class, which actually holds
    the generated groups
    
    ......
    """
    def __init__(self, axis: Index, groupings: Sequence[grouper.Grouping], ...):
    	# ...
    	self._groupings: list[grouper.Grouping] = list(groupings)
    
    @property
    def groupings(self) -> list[grouper.Grouping]:
        return self._groupings

BaseGrouper 中包含了最终生成的分组信息，是一个 list，其中的元素类型为 grouper.Grouping，每个分组对应一个 Grouping，而 Grouping 中的 obj 对象为分组后的 DataFrame 或者 Series

在第一部分写了一个函数来展示 groupby 返回的对象，这里再来探究一下原理，对于可迭代对象，会实现 iter() 方法，先定位到 BaseGroupBy 的对应方法

class BaseGroupBy:
	grouper: ops.BaseGrouper
	
	@final
    def __iter__(self) -> Iterator[tuple[Hashable, NDFrameT]]:
        return self.grouper.get_iterator(self._selected_obj, axis=self.axis)

接下来进入 BaseGrouper 类中

class BaseGrouper:
    def get_iterator(
        self, data: NDFrameT, axis: int = 0
    ) -> Iterator[tuple[Hashable, NDFrameT]]:
        splitter = self._get_splitter(data, axis=axis)
        keys = self.group_keys_seq
        for key, group in zip(keys, splitter):
            yield key, group.__finalize__(data, method="groupby")

Debug 模式进入 group.finalize() 方法，发现返回的确实是 DataFrame 对象

三、4 大函数

有了上面的基础，接下来再看 groupby 之后的处理函数，就简单多了

agg

聚合操作是 groupby 后最常见的操作，常用来做数据分析

比如，要查看不同 category 分组的最大值，以下三种写法都可以实现，并且 grouped.aggregate 和 grouped.agg 完全等价，因为在 SelectionMixin 类中有这样的定义：agg = aggregate

但是要聚合多个字段时，就只能用 aggregate 或者 agg 了，比如要获取不同 category 分组下 price 最大，count 最小的记录

还可以结合 numpy 里的聚合函数

import numpy as np
grouped.agg({'price': np.max, 'count': np.min})

常见的聚合函数如下

聚合函数	功能
max	最大值
mean	平均值
median	中位数
min	最小值
sum	求和
std	标准差
var	方差
count	计数

其中，count 在 numpy 中对应的调用方式为 np.size

transform

现在需要新增一列 price_mean，展示每个分组的平均价格

transform 函数刚好可以实现这个功能，在指定分组上产生一个与原 df 相同索引的 DataFrame，返回与原对象有相同索引且已填充了转换后的值的 DataFrame，然后可以把转换结果新增到原来的 DataFrame 上

示例代码如下

grouped = df.groupby('category', sort=False)
df['price_mean'] = grouped['price'].transform('mean')
print(df)

输出结果如下

apply

现在需要获取各个分组下价格最高的数据，调用 apply 可以实现这个功能，apply 可以传入任意自定义的函数，实现复杂的数据操作

from pandas import DataFrame
grouped = df.groupby('category', as_index=False, sort=False)

def get_max_one(the_df: DataFrame):
    sort_df = the_df.sort_values(by='price', ascending=True)
    return sort_df.iloc[-1, :]
max_price_df = grouped.apply(get_max_one)
max_price_df

输出结果如下

filter

filter 函数可以对分组后数据做进一步筛选，该函数在每一个分组内，根据筛选函数排除不满足条件的数据并返回一个新的 DataFrame

假设现在要把平均价格低于 4 的分组排除掉，根据 transform 小节的数据，会把蔬菜分类过滤掉

grouped = df.groupby('category', as_index=False, sort=False)
filtered = grouped.filter(lambda sub_df: sub_df['price'].mean() > 4)
print(filtered)

输出结果如下

四、总结

groupby 的过程就是将原有的 DataFrame/Series 按照 groupby 的字段，划分为若干个分组 DataFrame/Series，分成多少个组就有多少个分组 DataFrame/Series。因此，在 groupby 之后的一系列操作（如 agg、apply 等），均是基于子 DataFrame/Series 的操作。理解了这点，就理解了 Pandas 中 groupby 操作的主要原理

五、参考文档

Pandas 官网关于 pandas.DateFrame.groupby 的介绍

Pandas 官网关于 pandas.Series.groupby 的介绍

展开阅读全文

页面更新：2024-04-25

标签：函数切分米面粮油蔬菜蔬菜水果水果类型操作数据

1 2 3 4 5