Goal:
This article shares some Scala example codes to explain how to drop Null values in DataFrame/Dataset.
Solution:
DataFrameNaFunctions has methods named "drop" with different signatures to drop NULL values under different scenarios.
Let's create a sample Dataframe firstly as the data source:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | import org.apache.spark.sql.Row import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType, LongType, BooleanType} val simpleData = Seq(Row( "Jim" , "" , "Green" , 33333 , 3000.12 , 19605466456 L, true ), Row( "Tom" , "A" , "Smith" , 44444 , 4000.45 , 19886546456 L, null ), Row( "Jerry " , null , "Brown" , null , 5000.67 , null , false ), Row( "Henry " , "B" , "Jones" , 66666 , null , 20015464564 L, true ) ) val simpleSchema = StructType(Array( StructField( "firstname" ,StringType, true ), StructField( "middlename" ,StringType, true ), StructField( "lastname" ,StringType, true ), StructField( "zipcode" , IntegerType, true ), StructField( "salary" , DoubleType, true ), StructField( "account" , LongType, true ), StructField( "isAlive" , BooleanType, true ) )) val df = spark.createDataFrame(spark.sparkContext.parallelize(simpleData),simpleSchema) |
Data source and its schema look as below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | scala> df.printSchema() root |-- firstname : string (nullable = true ) |-- middlename : string (nullable = true ) |-- lastname : string (nullable = true ) |-- zipcode : integer (nullable = true ) |-- salary : double (nullable = true ) |-- account : long (nullable = true ) |-- isAlive : boolean (nullable = true ) scala> df.show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | | Tom| A| Smith| 44444 | 4000.45 | 19886546456 | null | | Jerry | null | Brown| null | 5000.67 | null | false | | Henry | B| Jones| 66666 | null | 20015464564 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
1. Drop rows containing NULL in any columns.(version 1)
Here only one row does not have NULL in any columns.
1 2 3 4 5 6 | scala> df.na.drop().show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
2. Drop rows containing NULL in any columns. (version 2)
Same as above. This is just another version.
1 2 3 4 5 6 | scala> df.na.drop( "any" ).show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
3. Drop rows containing NULL in all columns.
Here it shows all rows because there is no such all-NULL rows.
1 2 3 4 5 6 7 8 9 | scala> df.na.drop( "all" ).show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | | Tom| A| Smith| 44444 | 4000.45 | 19886546456 | null | | Jerry | null | Brown| null | 5000.67 | null | false | | Henry | B| Jones| 66666 | null | 20015464564 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
4. Drop rows containing NULL in any of specified column(s).
1 2 3 4 5 6 7 | scala> df.na.drop(Seq( "salary" , "account" )).show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | | Tom| A| Smith| 44444 | 4000.45 | 19886546456 | null | +---------+----------+--------+-------+-------+-----------+-------+ |
5. Drop rows containing NULL in all of specified column(s).
1 2 3 4 5 6 7 8 9 | scala> df.na.drop( "all" ,Seq( "salary" , "account" )).show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | | Tom| A| Smith| 44444 | 4000.45 | 19886546456 | null | | Jerry | null | Brown| null | 5000.67 | null | false | | Henry | B| Jones| 66666 | null | 20015464564 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
6. Drop rows containing less than minNonNulls non-null values.
It means we keep the rows with at least minNonNulls non-null values.
Here I want to keep the rows with all 7 non-null values.
1 2 3 4 5 6 | scala> df.na.drop( 7 ).show() +---------+----------+--------+-------+-------+-----------+-------+ |firstname|middlename|lastname|zipcode| salary| account|isAlive| +---------+----------+--------+-------+-------+-----------+-------+ | Jim| | Green| 33333 | 3000.12 | 19605466456 | true | +---------+----------+--------+-------+-------+-----------+-------+ |
Note: Here is the Complete Sample Code.
No comments:
Post a Comment