Edit: Newest newest version
Since spark 2.x you should use dataset api instead when using Scala [1]. Check docs here:
If working with python, even though easier, I leave the link here as it's a very highly voted question:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html
>>> df.withColumn('age2', df.age + 2).collect()[Row(age=2, name='Alice', age2=4), Row(age=5, name='Bob', age2=7)]
[1] https://spark.apache.org/docs/latest/sql-programming-guide.html:
In the Scala API, DataFrame is simply a type alias of Dataset[Row].While, in Java API, users need to use Dataset to represent aDataFrame.
Edit: Newest version
Since spark 2.x you can use .withColumn
. Check the docs here:
Oldest answer
Since Spark version 1.4 you can apply the cast method with DataType on the column:
import org.apache.spark.sql.types.IntegerTypeval df2 = df.withColumn("yearTmp", df.year.cast(IntegerType)) .drop("year") .withColumnRenamed("yearTmp", "year")
If you are using sql expressions you can also do:
val df2 = df.selectExpr("cast(year as int) year", "make", "model", "comment", "blank")
For more info check the docs:http://spark.apache.org/docs/1.6.0/api/scala/#org.apache.spark.sql.DataFrame