[EDIT: March 2016: thanks for the votes! Though really, this is not the best answer, I think the solutions based on withColumn
, withColumnRenamed
and cast
put forward by msemelman, Martin Senne and others are simpler and cleaner].
I think your approach is ok, recall that a Spark DataFrame
is an (immutable) RDD of Rows, so we're never really replacing a column, just creating new DataFrame
each time with a new schema.
Assuming you have an original df with the following schema:
scala> df.printSchemaroot |-- Year: string (nullable = true) |-- Month: string (nullable = true) |-- DayofMonth: string (nullable = true) |-- DayOfWeek: string (nullable = true) |-- DepDelay: string (nullable = true) |-- Distance: string (nullable = true) |-- CRSDepTime: string (nullable = true)
And some UDF's defined on one or several columns:
import org.apache.spark.sql.functions._val toInt = udf[Int, String]( _.toInt)val toDouble = udf[Double, String]( _.toDouble)val toHour = udf((t: String) => "%04d".format(t.toInt).take(2).toInt ) val days_since_nearest_holidays = udf( (year:String, month:String, dayOfMonth:String) => year.toInt + 27 + month.toInt-12 )
Changing column types or even building a new DataFrame from another can be written like this:
val featureDf = df.withColumn("departureDelay", toDouble(df("DepDelay"))).withColumn("departureHour", toHour(df("CRSDepTime"))).withColumn("dayOfWeek", toInt(df("DayOfWeek"))) .withColumn("dayOfMonth", toInt(df("DayofMonth"))) .withColumn("month", toInt(df("Month"))) .withColumn("distance", toDouble(df("Distance"))) .withColumn("nearestHoliday", days_since_nearest_holidays( df("Year"), df("Month"), df("DayofMonth")) ) .select("departureDelay", "departureHour", "dayOfWeek", "dayOfMonth", "month", "distance", "nearestHoliday")
which yields:
scala> df.printSchemaroot |-- departureDelay: double (nullable = true) |-- departureHour: integer (nullable = true) |-- dayOfWeek: integer (nullable = true) |-- dayOfMonth: integer (nullable = true) |-- month: integer (nullable = true) |-- distance: double (nullable = true) |-- nearestHoliday: integer (nullable = true)
This is pretty close to your own solution. Simply, keeping the type changes and other transformations as separate udf val
s make the code more readable and re-usable.