Thursday, January 28, 2021

Spark Code -- How to drop Null values in DataFrame/Dataset

Goal:

This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.

Solution:

DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.

Let's create a sample Dataframe firstly as the data source:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType}
  
val simpleData = Seq(Row("Jim","","Green",33333,3000.12,19605466456L,true),
    Row("Tom","A","Smith",44444,4000.45,19886546456L,null),
    Row("Jerry ",null,"Brown",null,5000.67,null,false),
    Row("Henry ","B","Jones",66666,null,20015464564L,true)
   )
  
val simpleSchema = StructType(Array(
    StructField("firstname",StringType,true),
    StructField("middlename",StringType,true),
    StructField("lastname",StringType,true),
    StructField("zipcode", IntegerType, true),
    StructField("salary", DoubleType, true),
    StructField("account", LongType, true),
    StructField("isAlive", BooleanType, true)
  ))
  
val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema)

Data source and its schema look as below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
scala> df.printSchema()
root
 |-- firstname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- zipcode: integer (nullable = true)
 |-- salary: double (nullable = true)
 |-- account: long (nullable = true)
 |-- isAlive: boolean (nullable = true)
  
  
scala> df.show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       nullfalse|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

1. Drop rows containing NULL in any columns.(version 1)

Here only one row does not have NULL in any columns.

1
2
3
4
5
6
scala> df.na.drop().show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

2. Drop rows containing NULL in any columns. (version 2)

Same as above. This is just another version.

1
2
3
4
5
6
scala> df.na.drop("any").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

 3. Drop rows containing NULL in all columns.

Here it shows all rows because there is no such all-NULL rows.

1
2
3
4
5
6
7
8
9
scala> df.na.drop("all").show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       nullfalse|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

 4. Drop rows containing NULL in any of specified column(s).

1
2
3
4
5
6
7
scala> df.na.drop(Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
+---------+----------+--------+-------+-------+-----------+-------+

5. Drop rows containing NULL in all of specified column(s).

1
2
3
4
5
6
7
8
9
scala> df.na.drop("all",Seq("salary","account")).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
|      Tom|         A|   Smith|  44444|4000.45|19886546456|   null|
|   Jerry |      null|   Brown|   null|5000.67|       nullfalse|
|   Henry |         B|   Jones|  66666|   null|20015464564|   true|
+---------+----------+--------+-------+-------+-----------+-------+

6. Drop rows containing less than minNonNulls non-null values. 

It means we keep the rows with at least minNonNulls non-null values. 

Here I want to keep the rows with all 7 non-null values.

1
2
3
4
5
6
scala> df.na.drop(7).show()
+---------+----------+--------+-------+-------+-----------+-------+
|firstname|middlename|lastname|zipcode| salary|    account|isAlive|
+---------+----------+--------+-------+-------+-----------+-------+
|      Jim|          |   Green|  33333|3000.12|19605466456|   true|
+---------+----------+--------+-------+-------+-----------+-------+

Note: Here is the Complete Sample Code.

No comments:

Post a Comment

Popular Posts