连接两个数据框,从一个中选择所有列,从另一个中选择一些列

假设我有一个火花数据帧 df1,它有几列(其中包括列 id) ,数据帧 df2有两列 idother

有没有办法复制以下命令:

sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2 ON df1.id = df2.id")

只使用火花功能,如 join()select()等?

我必须在函数中实现这个连接,而且我不希望强制使用 sqlContext 作为函数参数。

273045 次浏览

Here is a solution that does not require a SQL context, but maintains the metadata of a DataFrame.

a = sc.parallelize([['a', 'foo'], ['b', 'hem'], ['c', 'haw']]).toDF(['a_id', 'extra'])
b = sc.parallelize([['p1', 'a'], ['p2', 'b'], ['p3', 'c']]).toDF(["other", "b_id"])


c = a.join(b, a.a_id == b.b_id)

Then, c.show() yields:

+----+-----+-----+----+
|a_id|extra|other|b_id|
+----+-----+-----+----+
|   a|  foo|   p1|   a|
|   b|  hem|   p2|   b|
|   c|  haw|   p3|   c|
+----+-----+-----+----+

Not sure if the most efficient way, but this worked for me:

from pyspark.sql.functions import col


df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

The trick is in:

[col('a.'+xx) for xx in a.columns] : all columns in a


[col('b.other1'),col('b.other2')] : some columns of b

Asterisk (*) works with alias. Ex:

from pyspark.sql.functions import *


df1 = df1.alias('df1')
df2 = df2.alias('df2')


df1.join(df2, df1.id == df2.id).select('df1.*')

drop duplicate b_id

c = a.join(b, a.a_id == b.b_id).drop(b.b_id)

Without using alias.

df1.join(df2, df1.id == df2.id).select(df1["*"],df2["other"])

I got an error: 'a not found' using the suggested code:

from pyspark.sql.functions import col df1.alias('a').join(df2.alias('b'),col('b.id') == col('a.id')).select([col('a.'+xx) for xx in a.columns] + [col('b.other1'),col('b.other2')])

I changed a.columns to df1.columns and it worked out.

function to drop duplicate columns after joining.

check it

def dropDupeDfCols(df): newcols = [] dupcols = []

for i in range(len(df.columns)):
if df.columns[i] not in newcols:
newcols.append(df.columns[i])
else:
dupcols.append(i)


df = df.toDF(*[str(i) for i in range(len(df.columns))])
for dupcol in dupcols:
df = df.drop(str(dupcol))


return df.toDF(*newcols)

I believe that this would be the easiest and most intuitive way:

final = (df1.alias('df1').join(df2.alias('df2'),
on = df1['id'] == df2['id'],
how = 'inner')
.select('df1.*',
'df2.other')
)

Here is the code snippet that does the inner join and select the columns from both dataframe and alias the same column to different column name.

emp_df  = spark.read.csv('Employees.csv', header =True);
dept_df = spark.read.csv('dept.csv', header =True)




emp_dept_df = emp_df.join(dept_df,'DeptID').select(emp_df['*'], dept_df['Name'].alias('DName'))
emp_df.show()
dept_df.show()
emp_dept_df.show()
Output  for 'emp_df.show()':


+---+---------+------+------+
| ID|     Name|Salary|DeptID|
+---+---------+------+------+
|  1|     John| 20000|     1|
|  2|    Rohit| 15000|     2|
|  3|    Parth| 14600|     3|
|  4|  Rishabh| 20500|     1|
|  5|    Daisy| 34000|     2|
|  6|    Annie| 23000|     1|
|  7| Sushmita| 50000|     3|
|  8| Kaivalya| 20000|     1|
|  9|    Varun| 70000|     3|
| 10|Shambhavi| 21500|     2|
| 11|  Johnson| 25500|     3|
| 12|     Riya| 17000|     2|
| 13|    Krish| 17000|     1|
| 14| Akanksha| 20000|     2|
| 15|   Rutuja| 21000|     3|
+---+---------+------+------+


Output  for 'dept_df.show()':
+------+----------+
|DeptID|      Name|
+------+----------+
|     1|     Sales|
|     2|Accounting|
|     3| Marketing|
+------+----------+


Join Output:
+---+---------+------+------+----------+
| ID|     Name|Salary|DeptID|     DName|
+---+---------+------+------+----------+
|  1|     John| 20000|     1|     Sales|
|  2|    Rohit| 15000|     2|Accounting|
|  3|    Parth| 14600|     3| Marketing|
|  4|  Rishabh| 20500|     1|     Sales|
|  5|    Daisy| 34000|     2|Accounting|
|  6|    Annie| 23000|     1|     Sales|
|  7| Sushmita| 50000|     3| Marketing|
|  8| Kaivalya| 20000|     1|     Sales|
|  9|    Varun| 70000|     3| Marketing|
| 10|Shambhavi| 21500|     2|Accounting|
| 11|  Johnson| 25500|     3| Marketing|
| 12|     Riya| 17000|     2|Accounting|
| 13|    Krish| 17000|     1|     Sales|
| 14| Akanksha| 20000|     2|Accounting|
| 15|   Rutuja| 21000|     3| Marketing|
+---+---------+------+------+----------+

I just dropped the columns I didn't need from df2 and joined:

sliced_df = df2.select(columns_of_interest)
df1.join(sliced_df, on=['id'], how='left')
**id should be in `columns_of_interest` tho
df1.join(df2, ['id']).drop(df2.id)

If you need multiple columns from other pyspark dataframe then you can use this

based on single join condition

x.join(y, x.id == y.id,"left").select(x["*"],y["col1"],y["col2"],y["col3"])

based on multiple join condition

x.join(y, (x.id == y.id) & (x.no == y.no),"left").select(x["*"],y["col1"],y["col2"],y["col3"])