Quantcast
Channel: How can I change column types in Spark SQL's DataFrame? - Stack Overflow
Viewing all articles
Browse latest Browse all 24

Answer by Svend for How can I change column types in Spark SQL's DataFrame?

$
0
0

[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn, withColumnRenamed and cast put forward by msemelman, Martin Senne and others are simpler and cleaner].

I think your approach is ok, recall that a Spark DataFrame is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame each time with a new schema.

Assuming you have an original df with the following schema:

scala> df.printSchemaroot |-- Year: string (nullable = true) |-- Month: string (nullable = true) |-- DayofMonth: string (nullable = true) |-- DayOfWeek: string (nullable = true) |-- DepDelay: string (nullable = true) |-- Distance: string (nullable = true) |-- CRSDepTime: string (nullable = true)

And some UDF's defined on one or several columns:

import org.apache.spark.sql.functions._val toInt    = udf[Int, String]( _.toInt)val toDouble = udf[Double, String]( _.toDouble)val toHour   = udf((t: String) => "%04d".format(t.toInt).take(2).toInt ) val days_since_nearest_holidays = udf(   (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12 )

Changing column types or even building a new DataFrame from another can be written like this:

val featureDf = df.withColumn("departureDelay", toDouble(df("DepDelay"))).withColumn("departureHour",  toHour(df("CRSDepTime"))).withColumn("dayOfWeek",      toInt(df("DayOfWeek")))              .withColumn("dayOfMonth",     toInt(df("DayofMonth")))              .withColumn("month",          toInt(df("Month")))              .withColumn("distance",       toDouble(df("Distance")))              .withColumn("nearestHoliday", days_since_nearest_holidays(              df("Year"), df("Month"), df("DayofMonth"))            )              .select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth", "month", "distance", "nearestHoliday")            

which yields:

scala> df.printSchemaroot |-- departureDelay: double (nullable = true) |-- departureHour: integer (nullable = true) |-- dayOfWeek: integer (nullable = true) |-- dayOfMonth: integer (nullable = true) |-- month: integer (nullable = true) |-- distance: double (nullable = true) |-- nearestHoliday: integer (nullable = true)

This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf vals make the code more readable and re-usable.


Viewing all articles
Browse latest Browse all 24

Trending Articles