基于列表的数据帧过滤器或包含

我试图过滤一个数据框架在火花使用列表。我想要基于列表进行筛选,或者只包含列表中具有值的那些记录。我下面的代码不起作用:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])


# define a list of scores
l = [10,18,20]


# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)


# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

出现以下错误: ValueError: 无法将列转换为布尔值: 在构建 DataFrame 布尔表达式时,请使用’&’表示‘ and’,’|’表示‘ or’,’~’表示‘ not’。

187210 次浏览

what it says is "df.score in l" can not be evaluated because df.score gives you a column and "in" is not defined on that column type use "isin"

The code should be like this:

# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])


# define a list of scores
l = [10,18,20]


# filter out records by scores by list l
records = df.filter(~df.score.isin(l))
# expected: (0,1), (0,1), (0,2), (1,2)


# include only records with these scores in list l
df.filter(df.score.isin(l))
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)

Note that ABC0 is an alias for filter(), so both are interchangeable.

I found the join implementation to be significantly faster than where for large dataframes:

def filter_spark_dataframe_by_list(df, column_name, filter_list):
""" Returns subset of df where df[column_name] is in filter_list """
spark = SparkSession.builder.getOrCreate()
filter_df = spark.createDataFrame(filter_list, df.schema[column_name].dataType)
return df.join(filter_df, df[column_name] == filter_df["value"])

based on @user3133475 answer, it is also possible to call the isin() function from F.col() like this:

import pyspark.sql.functions as F




l = [10,18,20]
df.filter(F.col("score").isin(l))