介绍
Pandas is a widely used data manipulation library in Python that offers extensive capabilities for handling various types of data. One of its notable features is the ability to work with MultiIndexes, also known as hierarchical indexes. In this blog post, we will delve into the concept of MultiIndexes and explore how they can be leveraged to tackle complex, multidimensional datasets.
Pandas 是 Python 中广泛使用的数据操作库,提供处理各种类型数据的广泛功能。其显着功能之一是能够使用 MultiIndexes(也称为分层索引)。在这篇博文中,我们将深入研究多索引的概念,并探讨如何利用它们来处理复杂的多维数据集。
了解多重索引:分析运动表现数据
A MultiIndex is a Pandas data structure that allows indexing and accessing data across multiple dimensions or levels. It enables the creation of hierarchical structures for rows and columns, providing a flexible way to organize and analyze data. To illustrate this, let's consider a scenario where you are a personal trainer or coach monitoring the health parameters of your athletes during their sports activities. You want to track various parameters such as heart rate, running pace, and cadence over a specific time interval.
MultiIndex 是一种 Pandas 数据结构,允许跨多个维度或级别索引和访问数据。它可以创建行和列的层次结构,提供灵活的方式来组织和分析数据。为了说明这一点,让我们考虑这样一个场景:您是一名私人教练或教练,负责监测运动员在体育活动期间的健康参数。您想要跟踪特定时间间隔内的各种参数,例如心率、跑步配速和步频。
综合健康表现数据
To work with this type of data, let's begin by writing Python code that simulates health performance data, specifically heart rates and running cadences:
为了处理此类数据,我们首先编写 Python 代码来模拟健康表现数据,特别是心率和跑步节奏:
Python
from __future__ import annotations
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
start = datetime(2023, 6, 8, 14)
end = start + timedelta(hours=1, minutes=40)
timestamps = pd.date_range(start, end, freq=timedelta(minutes=1), inclusive='left')
def get_heart_rate(begin_hr: int, end_hr: int, break_point: int) -> pd.Series[float]:
noise = np.random.normal(loc=0.0, scale=3, size=100)
heart_rate = np.concatenate((np.linspace(begin_hr, end_hr, num=break_point),
[end_hr] * (100 - break_point))) + noise
return pd.Series(data=heart_rate, index=timestamps)
def get_cadence(mean_cadence: int) -> pd.Series[float]:
noise = np.random.normal(loc=0.0, scale=1, size=100)
cadence = pd.Series(data=[mean_cadence] * 100 + noise, index=timestamps)
cadence[::3] = np.NAN
cadence[1::3] = np.NAN
return cadence.ffill().fillna(mean_cadence)
The code snippet provided showcases the generation of synthetic data for heart rate and cadence during a sports activity. It begins by importing the necessary modules such as datetime, numpy, and pandas.
提供的代码片段展示了体育活动期间心率和步频合成数据的生成。首先导入必要的模块,例如 datetime 、 numpy 和 pandas 。
The duration of the sports activity is defined as 100 minutes, and the **pd.date_range** function is utilized to generate a series of timestamps at one-minute intervals to cover this period.
体育活动的持续时间被定义为100分钟,并且利用 **pd.date_range** 函数以一分钟的间隔生成一系列时间戳来覆盖该时间段。
The get_heart_rate function generates synthetic heart rate data, assuming a linear increase in heart rate up to a certain level, followed by a constant level for the remainder of the activity. Gaussian noise is introduced to add variability to the heart rate data, making it more realistic.
get_heart_rate 函数生成合成心率数据,假设心率线性增加到一定水平,然后在剩余活动中保持恒定水平。引入高斯噪声以增加心率数据的可变性,使其更加真实。
Similarly, the get_cadence function generates synthetic cadence data, assuming a relatively constant cadence throughout the activity. Gaussian noise is added to create variability in the cadence values, with the noise values being updated every three minutes instead of every minute, reflecting the stability of cadence compared to heart rates.
类似地,假设整个活动的节奏相对恒定, get_cadence 函数会生成合成节奏数据。添加高斯噪声以产生步频值的可变性,噪声值每三分钟更新一次,而不是每分钟更新一次,反映了步频与心率相比的稳定性。
With the data generation functions in place, it is now possible to create synthetic data for two athletes, Bob and Alice:
数据生成功能到位后,现在可以为鲍勃和爱丽丝这两名运动员创建合成数据:
Python
bob_hr = get_heart_rate(begin_hr=110, end_hr=160, break_point=20)
alice_hr = get_heart_rate(begin_hr=90, end_hr=140, break_point=50)
bob_cadence = get_cadence(mean_cadence=175)
alice_cadence = get_cadence(mean_cadence=165)
At this point, we have the heart rates and cadences of Bob and Alice. Let's plot them using matplotlib to get some more insight into the data:
此时,我们就有了鲍勃和爱丽丝的心率和节奏。让我们使用 matplotlib 绘制它们,以更深入地了解数据:
Python
from __future__ import annotations
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
date_formatter = mdates.DateFormatter('%H:%M:%S') # Customize the date format as needed
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.xaxis.set_major_formatter(date_formatter)
ax.plot(bob_hr, color="red", label="Heart Rate Bob", marker=".")
ax.plot(alice_hr, color="red", label="Heart Rate Alice", marker="v")
ax.grid()
ax.legend()
ax.set_ylabel("Heart Rate [BPM]")
ax.set_xlabel("Time")
ax_cadence = ax.twinx()
ax_cadence.plot(bob_cadence, color="purple",
label="Cadence Bob", marker=".", alpha=0.5)
ax_cadence.plot(alice_cadence, color="purple",
label="Cadence Alice", marker="v", alpha=0.5)
ax_cadence.legend()
ax_cadence.set_ylabel("Cadence [SPM]")
ax_cadence.set_ylim(158, 180)
Great! The initial analysis of the data provides interesting observations. We can easily distinguish the differences between Bob and Alice in terms of their maximum heart rate and the rate at which it increases. Additionally, Bob's cadence appears to be notably higher than Alice's.
伟大的!对数据的初步分析提供了有趣的观察结果。我们可以很容易地区分鲍勃和爱丽丝在最大心率及其增加速率方面的差异。此外,鲍勃的节奏似乎明显高于爱丽丝的。
使用数据框实现可扩展性
However, as you might have already noticed, the current approach of using separate variables (bob_hr, alice_hr, bob_cadence, and alice_cadence) for each health parameter and athlete is not scalable. In real-world scenarios with a larger number of athletes and health parameters, this approach quickly becomes impractical and cumbersome.
但是,您可能已经注意到,当前使用单独变量的方法( bob_hr 、 alice_hr 、 bob_cadence 和 alice_cadence )对于每个健康参数和运动员来说都是不可扩展的。在具有大量运动员和健康参数的现实场景中,这种方法很快就会变得不切实际且麻烦。
To address this issue, we can leverage the power of pandas by utilizing a pandas DataFrame to represent the data for multiple athletes and health parameters. By organizing the data in a tabular format, we can easily manage and analyze multiple variables simultaneously.
为了解决这个问题,我们可以利用 pandas 的强大功能,利用 pandas DataFrame 来表示多个运动员和健康参数的数据。通过以表格格式组织数据,我们可以轻松地同时管理和分析多个变量。
Each row of the DataFrame can correspond to a specific timestamp, and each column can represent a health parameter for a particular athlete. This structure allows for efficient storage and manipulation of multidimensional data.
DataFrame 的每一行可以对应一个特定的时间戳,每一列可以代表特定运动员的健康参数。这种结构允许有效存储和操作多维数据。
By using a DataFrame, we can eliminate the need for separate variables and store all the data in a single object. This enhances code clarity, simplifies data handling, and provides a more intuitive representation of the overall dataset.
通过使用 DataFrame ,我们可以消除对单独变量的需要并将所有数据存储在单个对象中。这增强了代码清晰度,简化了数据处理,并提供了整个数据集的更直观的表示。
Python
bob_df = pd.concat([bob_hr.rename("heart_rate"),
bob_cadence.rename("cadence")], axis="columns")
This is what the Dataframe for Bob’s health data looks like:
Bob 健康数据的 Dataframe 如下所示:
| heart_rate 心率 | cadence 节奏 | |
|---|---|---|
2023-06-08 14:00:00 | 112.359 | 175 |
2023-06-08 14:01:00 | 107.204 | 175 |
2023-06-08 14:02:00 | 116.617 | 175.513 |
2023-06-08 14:03:00 | 121.151 | 175.513 |
2023-06-08 14:04:00 | 123.27 | 175.513 |
2023-06-08 14:05:00 | 120.901 | 174.995 |
2023-06-08 14:06:00 | 130.24 | 174.995 |
2023-06-08 14:07:00 | 131.15 | 174.995 |
2023-06-08 14:08:00 | 131.402 | 174.669 |
引入分层数据框架
The last dataframe looks better already! But now, we still have to create a new dataframe for each athlete. This is where pandas MultiIndex can help. Let's take a look at how we can elegantly merge the data of multiple athletes and health parameters into one dataframe:
最后一个 dataframe 看起来已经好多了!但现在,我们仍然需要为每个运动员创建一个新的 dataframe 。这就是 pandas MultiIndex 可以提供帮助的地方。我们来看看如何优雅地将多个运动员的数据和健康参数合并到一个 dataframe 中:
Python
Shrink ▲ 收缩▲
from itertools import product
bob_df = bob_hr.to_frame("value")
bob_df["athlete"] = "Bob"
bob_df["parameter"] = "heart_rate"
values = {
"Bob": {
"heart_rate": bob_hr,
"cadence": bob_cadence,
},
"Alice": {
"heart_rate": alice_hr,
"cadence": alice_cadence
}
}
sub_dataframes: list[pd.DataFrame] = []
for athlete, parameter in product(["Bob", "Alice"], ["heart_rate", "cadence"]):
sub_df = values[athlete][parameter].to_frame("values")
sub_df["athlete"] = athlete
sub_df["parameter"] = parameter
sub_dataframes.append(sub_df)
df = pd.concat(sub_dataframes).set_index(["athlete", "parameter"], append=True)
df.index = df.index.set_names(["timestamps", "athlete", "parameter"])
This code processes heart rate and cadence data for athletes, Bob and Alice. It performs the following steps:
该代码处理运动员 Bob 和 Alice 的心率和步频数据。它执行以下步骤:
- Create a
DataFramefor Bob's heart rate data and add metadata columns for athlete and parameter.
为 Bob 的心率数据创建DataFrame并添加运动员和参数的元数据列。 - Define a dictionary that stores heart rate and cadence data for Bob and Alice.
定义一个字典来存储 Bob 和 Alice 的心率和节奏数据。 - Generate combinations of athletes and parameters (Bob/Alice and heart_rate/cadence).
生成运动员和参数的组合(Bob/Alice 和心率/节奏)。 - For each combination, create a sub-dataframe with the corresponding data and metadata columns.
对于每个组合,使用相应的数据和元数据列创建一个子数据框。 - Concatenate all sub-dataframes into a single dataframe.
将所有子数据帧连接成一个数据帧。 - Set the index to include levels for timestamps, athlete, and parameter. This is where the actual
MultiIndexis created
设置索引以包括时间戳、运动员和参数的级别。这是实际的MultiIndex创建的地方
This is what the hierarchical dataframe df looks like:
这就是分层数据框 df 的样子:
| values 价值观 | |
|---|---|
(Timestamp('2023-06-08 14:00:00'), 'Bob', 'heart_rate') | 112.359 |
(Timestamp('2023-06-08 14:01:00'), 'Bob', 'heart_rate') | 107.204 |
(Timestamp('2023-06-08 14:02:00'), 'Bob', 'heart_rate') | 116.617 |
(Timestamp('2023-06-08 14:03:00'), 'Bob', 'heart_rate') | 121.151 |
(Timestamp('2023-06-08 14:04:00'), 'Bob', 'heart_rate') | 123.27 |
(Timestamp('2023-06-08 14:05:00'), 'Bob', 'heart_rate') | 120.901 |
(Timestamp('2023-06-08 14:06:00'), 'Bob', 'heart_rate') | 130.24 |
(Timestamp('2023-06-08 14:07:00'), 'Bob', 'heart_rate') | 131.15 |
(Timestamp('2023-06-08 14:08:00'), 'Bob', 'heart_rate') | 131.402 |
At this point, we have got ourselves a single dataframe that holds all information for an arbitrary amount of athletes and health parameters. We can now easily use the .xs method to query the hierarchical dataframe:
此时,我们已经获得了一个数据框,其中包含任意数量的运动员和健康参数的所有信息。我们现在可以轻松地使用 .xs 方法来查询分层数据框:
Python
df.xs("Bob", level="athlete") # get all health data for Bob
| values 价值观 | |
|---|---|
(Timestamp('2023-06-08 14:00:00'), 'heart_rate') ( Timestamp('2023-06-08 14:00:00'), 'heart_rate') | 112.359 |
(Timestamp('2023-06-08 14:01:00'), 'heart_rate') | 107.204 |
(Timestamp('2023-06-08 14:02:00'), 'heart_rate') | 116.617 |
(Timestamp('2023-06-08 14:03:00'), 'heart_rate') | 121.151 |
(Timestamp('2023-06-08 14:04:00'), 'heart_rate') | 123.27 |
Python
df.xs("heart_rate", level="parameter") *# get all heart rates*
| values 价值观 | |
|---|---|
(Timestamp('2023-06-08 14:00:00'), 'Bob') | 112.359 |
(Timestamp('2023-06-08 14:01:00'), 'Bob') | 107.204 |
(Timestamp('2023-06-08 14:02:00'), 'Bob') | 116.617 |
(Timestamp('2023-06-08 14:03:00'), 'Bob') | 121.151 |
(Timestamp('2023-06-08 14:04:00'), 'Bob') | 123.27 |
Python
df.xs("Bob", level="athlete").xs
("heart_rate", level="parameter") # get heart_rate data for Bob
| timestamps 时间戳 | values 价值观 |
|---|---|
2023-06-08 14:00:00 | 112.359 |
2023-06-08 14:01:00 | 107.204 |
2023-06-08 14:02:00 | 116.617 |
2023-06-08 14:03:00 | 121.151 |
2023-06-08 14:04:00 | 123.27 |
用例:地球温度变化
To demonstrate the power of hierarchical dataframes, let's explore a real-world and complex use case: analyzing the changes in Earth's surface temperatures over the last decades. For this task, we'll utilize a dataset available on Kaggle, which summarizes the Global Surface Temperature Change data distributed by the National Aeronautics and Space Administration Goddard Institute for Space Studies (NASA-GISS).
为了展示分层数据框架的强大功能,让我们探索一个现实世界的复杂用例:分析过去几十年地球表面温度的变化。对于此任务,我们将利用 Kaggle 上提供的数据集,该数据集总结了美国国家航空航天局戈达德空间研究所 (NASA-GISS) 分发的全球表面温度变化数据。
检查和转换原始数据
Let's begin by reading and inspecting the data. This step is crucial to gain a better understanding of the dataset's structure and contents before delving into the analysis. Here's how we can accomplish that using pandas:
让我们从读取和检查数据开始。此步骤对于在深入分析之前更好地了解数据集的结构和内容至关重要。以下是我们如何使用 pandas 来实现这一点:
Python
from pathlib import Path
file_path = Path() / "data" / "Environment_Temperature_change_E_All_Data_NOFLAG.csv"
df = pd.read_csv(file_path , encoding='cp1252')
df.describe()
Click on image for full size
点击图片为全尺寸
From this initial inspection, it becomes evident that the data is organized in a single dataframe, with separate rows for different months and countries. However, the values for different years are spread across several columns in the dataframe, labeled with the prefix 'Y'. This format makes it challenging to read and visualize the data effectively. To address this issue, we will transform the data into a more structured and hierarchical dataframe format, enabling us to query and visualize the data more conveniently.
从初步检查中可以明显看出,数据被组织在单个 dataframe 中,不同月份和国家/地区具有单独的行。但是,不同年份的值分布在数据框中的多个列中,并用前缀“ Y ”进行标记。这种格式使得有效读取和可视化数据变得具有挑战性。为了解决这个问题,我们将把数据转换为更加结构化和分层的数据帧格式,使我们能够更方便地查询和可视化数据。
Python
Shrink ▲ 收缩▲
from dataclasses import dataclass, field
from datetime import date
from pydantic import BaseModel
MONTHS = {
"January": 1,
"February": 2,
"March": 3,
"April": 4,
"May": 5,
"June": 6,
"July": 7,
"August": 8,
"September": 9,
"October": 10,
"November": 11,
"December": 12
}
class GistempDataElement(BaseModel):
area: str
timestamp: date
value: float
@dataclass
class GistempTransformer:
temperature_changes: list[GistempDataElement] = field(default_factory=list)
standard_deviations: list[GistempDataElement] = field(default_factory=list)
def _process_row(self, row) -> None:
relevant_elements = ["Temperature change", "Standard Deviation"]
if (element := row["Element"]) not in relevant_elements or
(month := MONTHS.get(row["Months"])) is None:
return None
for year, value in row.filter(regex="Y.*").items():
new_element = GistempDataElement(
timestamp=date(year=int(year.replace("Y", "")), month=month, day=1),
area=row["Area"],
value=value
)
if element == "Temperature change":
self.temperature_changes.append(new_element)
else:
self.standard_deviations.append(new_element)
@property
def df(self) -> pd.DataFrame:
temp_changes_df = pd.DataFrame.from_records([elem.dict()
for elem in self.temperature_changes])
temp_changes = temp_changes_df.set_index
(["timestamp", "area"]).rename(columns={"value": "temp_change"})
std_deviations_df = pd.DataFrame.from_records([elem.dict()
for elem in self.standard_deviations])
std_deviations = std_deviations_df.set_index
(["timestamp", "area"]).rename(columns={"value": "std_deviation"})
return pd.concat([temp_changes, std_deviations], axis="columns")
def process(self):
environment_data = Path() / "data" /
"Environment_Temperature_change_E_All_Data_NOFLAG.csv"
df = pd.read_csv(environment_data, encoding='cp1252')
df.apply(self._process_row, axis="columns")
This code introduces the GistempTransformer class, which demonstrates the processing of temperature data from a CSV file and the creation of a hierarchical DataFrame containing temperature changes and standard deviations.
此代码引入了 GistempTransformer 类,该类演示了如何处理 CSV 文件中的温度数据以及创建包含温度变化和标准差的分层 DataFrame 。
The GistempTransformer class, defined as a dataclass, includes two lists, temperature_changes and standard_deviations, to store the processed data elements. The _process_row method is responsible for handling each row of the input DataFrame. It checks for relevant elements, such as "Temperature change" and "Standard Deviation," extracts the month from the Months column, and creates instances of the GistempDataElement class. These instances are then appended to the appropriate lists based on the element type.
GistempTransformer 类定义为 dataclass ,包括两个列表: temperature_changes 和 standard_deviations ,用于存储处理后的数据元素。 _process_row 方法负责处理输入 DataFrame 的每一行。它检查相关元素,例如“ Temperature change ”和“ Standard Deviation ”,从 Months 列中提取月份,并创建 GistempDataElement 类。然后,根据元素类型将这些实例附加到适当的列表中。
The df property returns a DataFrame by combining the temperature_changes and standard_deviations lists. This hierarchical DataFrame has a MultiIndex with levels representing the timestamp and area, providing a structured organization of the data.
df 属性通过组合 temperature_changes 和 standard_deviations 列表返回 DataFrame 。此分层 DataFrame 有一个 MultiIndex ,其级别表示时间戳和区域,提供数据的结构化组织。
Python
transformer = GistempTransformer()
transformer.process()
df = transformer.df
| temp_change 临时更改 | std_deviation 标准偏差 | |
|---|---|---|
(datetime.date(1961, 1, 1), 'Afghanistan') | 0.777 | 1.95 |
(datetime.date(1962, 1, 1), 'Afghanistan') | 0.062 | 1.95 |
(datetime.date(1963, 1, 1), 'Afghanistan') | 2.744 | 1.95 |
(datetime.date(1964, 1, 1), 'Afghanistan') | -5.232 | 1.95 |
(datetime.date(1965, 1, 1), 'Afghanistan') | 1.868 | 1.95 |
分析气候数据
Now that we have consolidated all the relevant data into a single dataframe, we can proceed with inspecting and visualizing the data. Our focus is on examining the linear regression lines for each area, as they provide insights into the overall trend of temperature changes over the past decades. To facilitate this visualization, we will create a function that plots the temperature changes along with their corresponding regression lines.
现在我们已将所有相关数据合并到一个 dataframe 中,我们可以继续检查和可视化数据。我们的重点是检查每个区域的线性回归线,因为它们提供了对过去几十年温度变化总体趋势的见解。为了促进这种可视化,我们将创建一个函数来绘制温度变化及其相应的回归线。
Python
def plot_temperature_changes(areas: list[str]) -> None:
fig = plt.figure(figsize=(12, 6))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)
for area in areas:
df_country = df[df.index.get_level_values("area") == area].reset_index()
dates = df_country["timestamp"].map(datetime.toordinal)
gradient, offset = np.polyfit(dates, df_country.temp_change, deg=1)
ax1.scatter(df_country.timestamp, df_country.temp_change, label=area, s=5)
ax2.plot(df_country.timestamp, gradient * dates + offset, label=area)
ax1.grid()
ax2.grid()
ax2.legend()
ax2.set_ylabel("Regression Lines [°C]")
ax1.set_ylabel("Temperature change [°C]")
In this function, we are using the **get_level_values** method on a pandas MultiIndex to efficiently query the data in our hierarchical Dataframe on different levels. Let's use this function to visualize temperature changes in the different continents:
在此函数中,我们在 pandas MultiIndex 上使用 **get_level_values ** 方法来有效地查询不同级别的分层 Dataframe 中的数据。让我们使用这个函数来可视化不同大陆的温度变化:
Python
plot_temperature_changes
(["Africa", "Antarctica", "Americas", "Asia", "Europe", "Oceania"])
From this plot, we can draw several key conclusions:
从这个图中,我们可以得出几个关键结论:
- The regression lines for all continents have a positive gradient, indicating a global trend of increasing Earth surface temperatures.
所有大陆的回归线都具有正梯度,表明地球表面温度上升的全球趋势。 - The regression line for Europe is notably steeper compared to other continents, implying that the temperature increase in Europe has been more pronounced. This finding aligns with observations of accelerated warming in Europe compared to other regions.
与其他大陆相比,欧洲的回归线明显更陡,这意味着欧洲的气温上升更为明显。这一发现与欧洲与其他地区相比加速变暖的观察结果一致。 - The specific factors contributing to the higher temperature increase in Europe compared to Antarctica are complex and require detailed scientific research. However, one contributing factor may be the influence of ocean currents. Europe is influenced by warm ocean currents, such as the Gulf Stream, which transport heat from the tropics towards the region. These currents play a role in moderating temperatures and can contribute to the relatively higher warming observed in Europe. In contrast, Antarctica is surrounded by cold ocean currents, and its climate is heavily influenced by the Southern Ocean and the Antarctic Circumpolar Current, which act as barriers to the incursion of warmer waters, thereby limiting the warming effect.
与南极洲相比,导致欧洲气温升高的具体因素很复杂,需要详细的科学研究。然而,影响因素之一可能是洋流的影响。欧洲受到温暖洋流的影响,例如墨西哥湾流,它将热量从热带输送到该地区。这些洋流在调节气温方面发挥着作用,并可能导致欧洲观测到的相对较高的变暖程度。相比之下,南极洲被寒冷的洋流包围,其气候深受南大洋和南极绕极流的影响,这成为了温暖水域入侵的屏障,从而限制了变暖效应。
Now, let's focus our analysis on Europe itself by examining temperature changes in different regions within Europe. We can achieve this by creating individual plots for each European region:
现在,让我们通过检查欧洲不同地区的气温变化来重点分析欧洲本身。我们可以通过为每个欧洲地区创建单独的地块来实现这一目标:
Python
plot_temperature_changes
(["Southern Europe", "Eastern Europe", "Northern Europe", "Western Europe"])
From the plotted temperature changes in different regions of Europe, we observe that the overall temperature rises across the European continent are quite similar. While there may be slight variations in the steepness of the regression lines between regions, such as Eastern Europe having a slightly steeper line compared to Southern Europe, no significant differences can be observed among the regions.
从绘制的欧洲不同地区的气温变化来看,我们观察到整个欧洲大陆的整体气温上升情况非常相似。虽然各地区之间回归线的陡度可能略有不同,例如东欧的回归线比南欧稍陡一些,但各地区之间没有观察到显着差异。
受气候变化影响最大和最小的十个国家
Now, let's shift our focus to identifying the top 10 countries that have experienced the highest average temperature increase since the year 2000. Here's an example of how we can retrieve the list of countries:
现在,让我们将重点转移到确定自 2000 年以来平均气温增幅最高的前 10 个国家/地区。以下是我们如何检索国家/地区列表的示例:
Python
df[df.index.get_level_values(level="timestamp") >
date(2000, 1, 1)].groupby("area").mean().sort_values
(by="temp_change",ascending=False).head(10)
| area 区域 | temp_change 临时更改 | std_deviation 标准偏差 |
|---|---|---|
Svalbard and Jan Mayen Islands | 2.61541 | 2.48572 |
Estonia | 1.69048 | nan |
Kuwait | 1.6825 | 1.12843 |
Belarus | 1.66113 | nan |
Finland | 1.65906 | 2.15634 |
Slovenia | 1.6555 | nan |
Russian Federation | 1.64507 | nan |
Bahrain | 1.64209 | 0.937431 |
Eastern Europe | 1.62868 | 0.970377 |
Austria | 1.62721 | 1.56392 |
To extract the top 10 countries with the highest average temperature increase since the year 2000, we perform the following steps:
为了提取自 2000 年以来平均气温增幅最高的前 10 个国家,我们执行以下步骤:
- Filter the
dataframeto include only rows where the year is greater than or equal to2000usingdf.index.get_level_values(level='timestamp') >= date(2000, 1, 1).
使用df.index.get_level_values(level='timestamp') >= date(2000, 1, 1)过滤dataframe以仅包含年份大于或等于2000的行。 - Group the data by the '
Area' (country) using.groupby('area').
使用.groupby('area')按“Area”(国家/地区)对数据进行分组。 - Calculate the mean temperature change for each country using
.mean().
使用.mean()计算每个国家/地区的平均温度变化。 - Select the top 10 countries with the largest mean temperature change using
**.sort_values(by="temp_change",ascending=True).head(10)**.
使用**.sort_values(by="temp_change",ascending=True).head(10)**选择平均温度变化最大的前 10 个国家/地区。
This result aligns with our previous observations, confirming that Europe experienced the highest rise in temperature compared to other continents.
这一结果与我们之前的观察结果一致,证实欧洲与其他大陆相比经历了最高的气温上升。
Continuing with our analysis, let's now explore the ten countries that are least affected by the rise in temperature. We can utilize the same method as before to extract this information. Here's an example of how we can retrieve the list of countries:
继续我们的分析,现在让我们探讨一下受气温上升影响最小的十个国家。我们可以利用与之前相同的方法来提取此信息。以下是我们如何检索国家/地区列表的示例:
Python
df[df.index.get_level_values(level="timestamp") > date(2000, 1, 1)].groupby
("area").mean().sort_values(by="temp_change",ascending=True).head(10)
| area 区域 | temp_change 临时更改 | std_deviation 标准偏差 |
|---|---|---|
Pitcairn Islands | 0.157284 | 0.713095 |
Marshall Islands | 0.178335 | nan |
South Georgia and the South Sandwich Islands | 0.252101 | 1.11 |
Micronesia (Federated States of) | 0.291996 | nan |
Chile | 0.297607 | 0.534071 |
Wake Island | 0.306269 | nan |
Norfolk Island | 0.410659 | 0.594073 |
Argentina | 0.488159 | 0.91559 |
Zimbabwe | 0.493519 | 0.764067 |
Antarctica | 0.527987 | 1.55841 |
We observe that the majority of countries in this list are small, remote islands located in the southern hemisphere. This finding further supports our previous conclusions that southern continents, particularly Antarctica, are less affected by climate change compared to other regions.
我们观察到,这份名单中的大多数国家都是位于南半球的偏远小岛屿。这一发现进一步支持了我们之前的结论,即与其他地区相比,南部大陆,特别是 Antarctica ,受气候变化的影响较小。
夏季和冬季的温度变化
Now, let's delve into more complex queries using the hierarchical dataframe. In this specific use case, our focus is on analyzing temperature changes during winters and summers. For the purpose of this analysis, we define winters as the months of December, January, and February, while summers encompass June, July, and August. By leveraging the power of pandas and the hierarchical dataframe, we can easily visualize the temperature changes during these seasons in Europe. Here's an example code snippet to accomplish that:
现在,让我们使用分层数据框深入研究更复杂的查询。在这个特定的用例中,我们的重点是分析冬季和夏季的温度变化。出于本分析的目的,我们将冬季定义为 12 月、1 月和 2 月,而夏季包括 6 月、7 月和 8 月。通过利用 pandas 和分层数据框架的强大功能,我们可以轻松可视化欧洲这些季节的温度变化。下面是实现此目的的示例代码片段:
Python
all_winters = df[df.index.get_level_values(level="timestamp").map
(lambda x: x.month in [12, 1, 2])]
all_summers = df[df.index.get_level_values(level="timestamp").map
(lambda x: x.month in [6, 7, 8])]
winters_europe = all_winters.xs("Europe", level="area").sort_index()
summers_europe = all_summers.xs("Europe", level="area").sort_index()
fig = plt.figure(figsize=(12, 6))
ax = fig.add_subplot(111)
ax.plot(winters_europe.index, winters_europe.temp_change,
label="Winters", marker="o", markersize=4)
ax.plot(summers_europe.index, summers_europe.temp_change,
label="Summers", marker='o', markersize=4)
ax.grid()
ax.legend()
ax.set_ylabel("Temperature Change [°C]")
From this figure, we can observe that temperature changes during the winters exhibit greater volatility compared to temperature changes during the summers. To quantify this difference, let's calculate the standard deviation of the temperature changes for both seasons:
从该图中,我们可以观察到,与夏季的温度变化相比,冬季的温度变化表现出更大的波动性。为了量化这种差异,我们来计算两个季节温度变化的标准差:
Python
pd.concat([winters_europe.std().rename("winters"),
summers_europe.std().rename("summers")], axis="columns")
| winters 温特斯 | summers 夏天 | |
|---|---|---|
temp_change | 1.82008 | 0.696666 |
结论
In conclusion, mastering MultiIndexes in Pandas provides a powerful tool for handling complex data analysis tasks. By leveraging MultiIndexes, users can efficiently organize and analyze multidimensional datasets in a flexible and intuitive manner. The ability to work with hierarchical structures for rows and columns enhances code clarity, simplifies data handling, and enables simultaneous analysis of multiple variables. Whether it's tracking health parameters of athletes or analyzing Earth's temperature changes over time, understanding and utilizing MultiIndexes in Pandas unlocks the full potential of the library for handling complex data scenarios.
总之,掌握 Pandas 中的多重索引为处理复杂的数据分析任务提供了强大的工具。通过利用多索引,用户可以以灵活直观的方式有效地组织和分析多维数据集。使用行和列的分层结构的能力增强了代码清晰度,简化了数据处理,并支持同时分析多个变量。无论是跟踪运动员的健康参数还是分析地球温度随时间的变化,理解和利用 Pandas 中的多索引都可以释放该库处理复杂数据场景的全部潜力。
ou can find all code included in this post here: github.com/GlennViroux….
您可以在此处找到本文中包含的所有代码:github.com/GlennViroux…