如何合并数据在R使用R合并,dplyr,或data.table

R有许多通过公共列连接数据帧的快速、优雅的方法。我想向你们展示其中的三个:

1. 基数R的merge()函数

2. Dplyr的join函数族

3. 数据。表的括号语法


一、获取并导入数据

在这个例子中,我将使用我最喜欢的演示数据集之一——来自美国交通统计局的航班延误时间。如果您想跟随,请访问http://bit.ly/USFlightDelays并下载您选择的时间段的数据,包括航班日期、Reporting_Airline、出发地、目的地和出发时间。还可以获取Reporting_Airline的查找表。

或者,你可以下载这两个数据集,加上我在一个文件中的R代码和一个解释不同类型的数据合并的PowerPoint,在这里:

要用基本R读入文件,我首先解压缩航班延误文件,然后用read.csv()导入航班延误数据和代码查找文件。如果您正在运行该代码,则您下载的延迟文件的名称可能与下面代码中的名称不同。另外,请注意查找文件不寻常的.csv_扩展名。

unzip("673598238_T_ONTIME_REPORTING.zip")
mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote=""")
mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",
quote=""", sep = "," )

接下来,我将用head()查看这两个文件:

head(mydf)

FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA

head(mylookup)

Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern


二、与底R合并

mydf延迟数据帧只有航空公司信息的代码。我想用mylookup中的航空公司名称添加一列。一种基于R的方法是使用merge()函数,使用基本语法merge(df1, df2)。数据帧1和数据帧2的顺序无关紧要,但无论哪个是第一个都被认为是x,第二个是y。华东CIO大会、华东CIO联盟、CDLC中国数字化灯塔大会、CXO数字化研学之旅、数字化江湖-讲武堂,数字化江湖-大侠传、数字化江湖-论剑、CXO系列管理论坛(陆家嘴CXO管理论坛、宁波东钱湖CXO管理论坛等)、数字化转型网,走进灯塔工厂系列、ECIO大会等

如果你想要连接的列没有相同的名称,你需要告诉归并你想要连接的列:by。X为X数据帧的列名,由。Y表示Y,比如merge(df1, df2, by。x = "df1ColName", by。y = "df2ColName")。

您还可以告诉归并是否需要包含参数all的所有行,包括没有匹配的行,还是只需要匹配的行。X和所有。y。在这种情况下,我想要所有的行从延迟数据;如果查找表中没有航空公司代码,我仍然需要该信息。但我不需要查找表中不在延迟数据中的行(其中有一些已不再飞行的旧航空公司的代码)。因此,所有。x = TRUE但所有。y = FALSE。代码如下:

joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",
by.y = "Code", all.x = TRUE, all.y = FALSE)

新的连接数据帧包括一个名为Description的列,其中包含基于航空公司代码的航空公司名称:

head(joined_df)

OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.


三、与dplyr连接

dplyr包的连接函数使用SQL数据库语法。左连接意味着:包括左边的所有内容(merge()中的x数据帧是什么)和从右边(y)数据帧匹配的所有行。如果联接列有相同的名称,你只需要left_join(x, y).如果它们没有相同的名称,你需要一个by参数,比如left_join(x, y, by = c("df1ColName"= "df2ColName"))。

注意by的语法:它是一个命名向量,左右列名都用引号括起来。

更新:从dplyr 1.1.0版本开始(2023年1月29日在CRAN上),dplyr连接有一个额外的by语法,使用join_by():

left_join(x, y, by = join_by(df1ColName == df2ColName))

新的join_by()帮助函数使用了不带引号的列名和==布尔运算符,包的作者说,这个运算符在R上下文中比在c上下文中更有意义("col1" = "col2"),因为=是为了给变量赋值,而不是测试是否相等。

左连接保留左数据帧中的所有行,只匹配来自右数据帧的行。

下面是使用left_join()导入和合并两个数据集的代码。它首先加载dplyr和readr包,然后用read_csv()读入这两个文件。当使用read_csv()时,我不需要先解压缩文件。

library(dplyr)
library(readr)

mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

joined_tibble <- left_join(mytibble, mylookup_tibble,

by = join_by(OP_UNIQUE_CARRIER == Code))

注意,dplyr的旧by语法没有join_by()仍然有效

joined_tibble <- left_join(mytibble, mylookup_tibble,
by = c("OP_UNIQUE_CARRIER" = "Code"))

Read_csv()创建tibbles,这是一种具有一些额外功能的数据帧类型。Left_join()将两者合并。看一下语法:在这种情况下,顺序很重要。Left_join()意味着包含左边或第一个数据集的所有行,但只包含与第二个数据集匹配的行。并且,因为我需要通过两个不同名称的列来连接,所以我包含了一个by参数。

在dplyr的开发版中,新的连接语法是:

joined_tibble2 <- left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))

但是,由于大多数人可能都有CRAN版本,所以在本文的其余部分中,我将使用dplyr最初的命名向量语法,直到join_by()成为CRAN版本的一部分。

我们可以使用dplyr的glimpse()函数查看结果的结构,这是查看数据帧顶部几项的另一种方式:

glimpse(joined_tibble)Observations: 658,461Variables: 7

$ FL_DATE 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01

$ OP_UNIQUE_CARRIER "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…

$ ORIGIN "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…

$ DEST "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS

$ DEP_DELAY_NEW 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0,

$ X6 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

$ Description "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

这个合并的数据集现在有一个新列,列中包含航空公司的名称。如果您自己运行这段代码的一个版本,您可能会注意到dplyr比基数R快得多。

原文:

R has a number of quick, elegant ways to join data frames by a common column. I’d like to show you three of them:

· base R’s merge() function

· dplyr’s join family of functions

· data.table’s bracket syntax

Get and import the data

For this example I’ll use one of my favorite demo data sets—flight delay times from the U.S. Bureau of Transportation Statistics. If you want to follow along, head to http://bit.ly/USFlightDelays and download data for the time frame of your choice with the columns Flight Date, Reporting_Airline, Origin, Destination, and DepartureDelayMinutes. Also get the lookup table for Reporting_Airline.

Or, you can download these two data sets—plus my R code in a single file and a PowerPoint explaining different types of data merges—here:

To read in the file with base R, I’d first unzip the flight delay file and then import both flight delay data and the code lookup file with read.csv(). If you’re running the code, the delay file you downloaded will likely have a different name than in the code below. Also, note the lookup file’s unusual .csv_ extension.

unzip("673598238_T_ONTIME_REPORTING.zip")
mydf <- read.csv("673598238_T_ONTIME_REPORTING.csv",
sep = ",", quote=""")
mylookup <- read.csv("L_UNIQUE_CARRIERS.csv_",
quote=""", sep = "," )

Next, I’ll take a peek at both files with head():

head(mydf)

FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW X1 2019-08-01 DL ATL DFW 31 NA2 2019-08-01 DL DFW ATL 0 NA3 2019-08-01 DL IAH ATL 40 NA4 2019-08-01 DL PDX SLC 0 NA5 2019-08-01 DL SLC PDX 0 NA6 2019-08-01 DL DTW ATL 10 NA

head(mylookup)

Code Description1 02Q Titan Airways2 04Q Tradewind Aviation3 05Q Comlux Aviation, AG4 06Q Master Top Linhas Aereas Ltd.5 07Q Flair Airlines Ltd.6 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern

Merges with base R

The mydf delay data frame only has airline information by code. I’d like to add a column with the airline names from mylookup. One base R way to do this is with the merge() function, using the basic syntax merge(df1, df2). The order of data frame 1 and data frame 2 doesn't matter, but whichever one is first is considered x and the second one is y.

If the columns you want to join by don’t have the same name, you need to tell merge which columns you want to join by: by.x for the x data frame column name, and by.y for the y one, such as merge(df1, df2, by.x = "df1ColName", by.y = "df2ColName").

You can also tell merge whether you want all rows, including ones without a match, or just rows that match, with the arguments all.x and all.y. In this case, I’d like all the rows from the delay data; if there’s no airline code in the lookup table, I still want the information. But I don’t need rows from the lookup table that aren’t in the delay data (there are some codes for old airlines that don’t fly anymore in there). So, all.x equals TRUE but all.y equals FALSE. Here's the code:

joined_df <- merge(mydf, mylookup, by.x = "OP_UNIQUE_CARRIER",
by.y = "Code", all.x = TRUE, all.y = FALSE)

The new joined data frame includes a column called Description with the name of the airline based on the carrier code:

head(joined_df)

OP_UNIQUE_CARRIER FL_DATE ORIGIN DEST DEP_DELAY_NEW X Description1 9E 2019-08-12 JFK SYR 0 NA Endeavor Air Inc.2 9E 2019-08-12 TYS DTW 0 NA Endeavor Air Inc.3 9E 2019-08-12 ORF LGA 0 NA Endeavor Air Inc.4 9E 2019-08-13 IAH MSP 6 NA Endeavor Air Inc.5 9E 2019-08-12 DTW JFK 58 NA Endeavor Air Inc.6 9E 2019-08-12 SYR JFK 0 NA Endeavor Air Inc.

Joins with dplyr

The dplyr package uses SQL database syntax for its join functions. A left join means: Include everything on the left (what was the x data frame in merge()) and all rows that match from the right (y) data frame. If the join columns have the same name, all you need is left_join(x, y). If they don’t have the same name, you need a by argument, such as left_join(x, y, by = c("df1ColName" = "df2ColName")).

Note the syntax for by: It’s a named vector, with both the left and right column names in quotation marks.

Update: Starting with dplyr version 1.1.0 (on CRAN as of January 29, 2023), dplyr joins have an additional by syntax using join_by():

left_join(x, y, by = join_by(df1ColName == df2ColName))

The new join_by() helper function uses unquoted column names and the == boolean operator, which package authors say makes more sense in an R context than c("col1" = "col2"), since = is meant for assigning a value to a variable, not testing for equality.

A left join keeps all rows in the left data frame and only matching rows from the right data frame.

The code to import and merge both data sets using left_join() is below. It starts by loading the dplyr and readr packages, and then reads in the two files with read_csv(). When using read_csv(), I don’t need to unzip the file first.

library(dplyr)
library(readr)


mytibble <- read_csv("673598238_T_ONTIME_REPORTING.zip")
mylookup_tibble <- read_csv("L_UNIQUE_CARRIERS.csv_")

joined_tibble <- left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))

Note that dplyr's older by syntax without join_by() still works


joined_tibble <- left_join(mytibble, mylookup_tibble,
by = c("OP_UNIQUE_CARRIER" = "Code"))

read_csv() creates tibbles, which are a type of data frame with some extra features. left_join() merges the two. Take a look at the syntax: In this case, order matters. left_join() means include all rows on the left, or first, data set, but only rows that match from the second one. And, because I need to join by two differently named columns, I included a by argument.

The new join syntax in the development-only version of dplyr would be:

joined_tibble2 <- left_join(mytibble, mylookup_tibble,
by = join_by(OP_UNIQUE_CARRIER == Code))

Since most people likely have the CRAN version, however, I will use dplyr's original named-vector syntax in the rest of this article, until join_by() becomes part of the CRAN version.

We can look at the structure of the result with dplyr’s glimpse() function, which is another way to see the top few items of a data frame:

glimpse(joined_tibble)Observations: 658,461Variables: 7

$ FL_DATE 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01, 2019-08-01

$ OP_UNIQUE_CARRIER "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL", "DL",…

$ ORIGIN "ATL", "DFW", "IAH", "PDX", "SLC", "DTW", "ATL", "MSP", "JF…

$ DEST "DFW", "ATL", "ATL", "SLC", "PDX", "ATL", "DTW", "JFK", "MS

$ DEP_DELAY_NEW 31, 0, 40, 0, 0, 10, 0, 22, 0, 0, 0, 17, 5, 2, 0, 0, 8, 0,

$ X6 NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

$ Description "Delta Air Lines Inc.", "Delta Air Lines Inc.", "Delta Air …

This joined data set now has a new column with the name of the airline. If you run a version of this code yourself, you’ll probably notice that dplyr is way faster than base R.

CXO联盟(CXO union)是一家聚焦于CIO,CDO,cto,ciso,cfo,coo,chro,cpo,ceo等人群的平台组织,其中在CIO会议领域的领头羊,目前举办了大量的CIO大会、CIO论坛、CIO活动、CIO会议、CIO峰会、CIO会展。如华东CIO会议、华南cio会议、华北cio会议、中国cio会议、西部CIO会议。在这里,你可以参加大量的IT大会、IT行业会议、IT行业论坛、IT行业会展、数字化论坛、数字化转型论坛,在这里你可以认识很多的首席信息官、首席数字官、首席财务官、首席技术官、首席人力资源官、首席运营官、首席执行官、IT总监、财务总监、信息总监、运营总监、采购总监、供应链总监。

数字化转型网(www.szhzxw.cn资讯媒体,是企业数字化转型的必读参考,在这里你可以学习大量的知识,如财务数字化转型、供应链数字化转型、运营数字化转型、生产数字化转型、人力资源数字化转型、市场营销数字化转型。通过关注我们的公众号,你就知道如何实现企业数字化转型?数字化转型如何做?

【CXO联盟部分会员】兴达投资集团CEO、江苏中超投资集团CEO、江苏江润铜业CEO、浙江协和集团CEO、山东垦利石化集团CEO、江苏长电科技股份CEO、重庆钢铁股份CEO、天合光能股份CEO、江苏恒瑞医药股份CEO、奥盛集团CEO、山鹰国际控股股份公司CEO、成都蛟龙投资CEO、宜宾天原集团股份CEO、胜达集团CEO、三宝集团股份CEO、山西晋城钢铁控股集团CEO、河北鑫达钢铁集团CEO、深圳市中金岭南有色金属股份CEO、宁波申洲针织CEO、广西柳工集团CEO、苏州创元投资发展(集团)CEO、邯郸正大制管CEO、人福医药集团股份公司CEO、浙江升华控股集团CEO、河南济源钢铁(集团)CEO、达利食品集团CEO、广西汽车集团CEO、孝义市鹏飞实业CEO、宗申产业集团CEO、天津纺织集团(控股)CEO、晶澳太阳能科技股份CEO、唐山瑞丰钢铁(集团)CEO、重庆万达薄板CEO、唐山三友集团CEO、凌源钢铁集团CEO、宁波博洋控股集团CEO、天津市医药集团CEO、福建省汽车工业集团CEO、山西安泰控股集团CEO、牧原食品股份CEO、上海仪电(集团)CEO、上海胜华电缆(集团)CEO、大亚科技集团CEO、天津恒兴集团CEO、攀枝花钢城集团CEO、桂林力源粮油食品集团CEO、万马联合控股集团CEO、山东鑫海科技股份CEO、江苏上上电缆集团CEO、广西贵港钢铁集团CEO、久立集团股份CEO

展开阅读全文

页面更新:2024-04-20

标签:数据   江苏   航空公司   语法   函数   名称   代码   会议   文件   集团

1 2 3 4 5

上滑加载更多 ↓
推荐阅读:
友情链接:
更多:

本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828  

© CopyRight 2008-2024 All Rights Reserved. Powered By bs178.com 闽ICP备11008920号-3
闽公网安备35020302034844号

Top