Spark中DataFrame、Dataset和RDD的区别

我只是想知道在Apache Spark中RDD和DataFrame (Spark 2.0.0 DataFrame只是数据集[行]的类型别名)之间的区别是什么?

你能把一个转换成另一个吗?

当前回答

大部分答案都是正确的，我只想补充一点

在Spark 2.0中，这两个API (DataFrame +DataSet)将统一为一个API。

统一DataFrame和Dataset:在Scala和Java中，DataFrame和Dataset是统一的，即DataFrame只是Dataset of Row的类型别名。在Python和R中，由于缺乏类型安全，DataFrame是主要的编程接口。”

数据集类似于rdd，但是，它们不使用Java序列化或Kryo，而是使用专门的Encoder来序列化对象，以便在网络上进行处理或传输。

Spark SQL支持两种将现有rdd转换为数据集的方法。第一种方法使用反射来推断包含特定类型对象的RDD的模式。这种基于反射的方法可以生成更简洁的代码，如果在编写Spark应用程序时已经知道模式，这种方法也能很好地工作。

创建数据集的第二种方法是通过编程接口，该接口允许您构造一个模式，然后将其应用于现有的RDD。虽然此方法更详细，但它允许您在运行时之前不知道列及其类型时构造数据集。

在这里你可以找到RDD tof数据帧对话的答案

如何将rdd对象转换为数据帧在火花

2016-11-20 13:53:39

其他回答

大部分答案都是正确的，我只想补充一点

在Spark 2.0中，这两个API (DataFrame +DataSet)将统一为一个API。

数据集类似于rdd，但是，它们不使用Java序列化或Kryo，而是使用专门的Encoder来序列化对象，以便在网络上进行处理或传输。

在这里你可以找到RDD tof数据帧对话的答案

如何将rdd对象转换为数据帧在火花

2016-11-20 13:53:39

因为DataFrame是弱类型的，开发人员没有得到类型系统的好处。例如，假设你想从SQL中读取一些东西，并对其运行一些聚合:

val people = sqlContext.read.parquet("...")
val department = sqlContext.read.parquet("...")

people.filter("age > 30")
  .join(department, people("deptId") === department("id"))
  .groupBy(department("name"), "gender")
  .agg(avg(people("salary")), max(people("age")))

当你说people("deptId")时，你得到的不是Int或Long对象，你得到的是你需要操作的Column对象。在具有丰富类型系统的语言(如Scala)中，您最终失去了所有类型安全，这增加了在编译时可以发现的运行时错误的数量。

相反，输入数据集[T]。当你这样做时:

val people: People = val people = sqlContext.read.parquet("...").as[People]

您实际上得到了一个People对象，其中deptId是一个实际的整型而不是列型，从而利用了类型系统。

从Spark 2.0开始，DataFrame和DataSet api将是统一的，其中DataFrame将是DataSet[Row]的类型别名。

2016-05-18 13:39:42

A DataFrame is an RDD that has a schema. You can think of it as a relational database table, in that each column has a name and a known type. The power of DataFrames comes from the fact that, when you create a DataFrame from a structured dataset (Json, Parquet..), Spark is able to infer a schema by making a pass over the entire (Json, Parquet..) dataset that's being loaded. Then, when calculating the execution plan, Spark, can use the schema and do substantially better computation optimizations. Note that DataFrame was called SchemaRDD before Spark v1.3.0

2018-01-07 22:35:19

从使用的角度来看，RDD vs DataFrame:

RDDs are amazing! as they give us all the flexibility to deal with almost any kind of data; unstructured, semi structured and structured data. As, lot of times data is not ready to be fit into a DataFrame, (even JSON), RDDs can be used to do preprocessing on the data so that it can fit in a dataframe. RDDs are core data abstraction in Spark. Not all transformations that are possible on RDD are possible on DataFrames, example subtract() is for RDD vs except() is for DataFrame. Since DataFrames are like a relational table, they follow strict rules when using set/relational theory transformations, for example if you wanted to union two dataframes the requirement is that both dfs have same number of columns and associated column datatypes. Column names can be different. These rules don't apply to RDDs. Here is a good tutorial explaining these facts. There are performance gains when using DataFrames as others have already explained in depth. Using DataFrames you don't need to pass the arbitrary function as you do when programming with RDDs. You need the SQLContext/HiveContext to program dataframes as they lie in SparkSQL area of spark eco-system, but for RDD you only need SparkContext/JavaSparkContext which lie in Spark Core libraries. You can create a df from a RDD if you can define a schema for it. You can also convert a df to rdd and rdd to df.

我希望这能有所帮助!

2017-08-25 21:10:09

Spark RDD(弹性分布式数据集):

RDD is the core data abstraction API and is available since very first release of Spark (Spark 1.0). It is a lower-level API for manipulating distributed collection of data. The RDD APIs exposes some extremely useful methods which can be used to get very tight control over underlying physical data structure. It is an immutable (read only) collection of partitioned data distributed on different machines. RDD enables in-memory computation on large clusters to speed up big data processing in a fault tolerant manner. To enable fault tolerance, RDD uses DAG (Directed Acyclic Graph) which consists of a set of vertices and edges. The vertices and edges in DAG represent the RDD and the operation to be applied on that RDD respectively. The transformations defined on RDD are lazy and executes only when an action is called

Spark DataFrame

Spark 1.3 introduced two new data abstraction APIs – DataFrame and DataSet. The DataFrame APIs organizes the data into named columns like a table in relational database. It enables programmers to define schema on a distributed collection of data. Each row in a DataFrame is of object type row. Like an SQL table, each column must have same number of rows in a DataFrame. In short, DataFrame is lazily evaluated plan which specifies the operations needs to be performed on the distributed collection of the data. DataFrame is also an immutable collection.

Spark数据集:

作为DataFrame api的扩展，Spark 1.3还引入了DataSet api，在Spark中提供严格类型和面向对象的编程接口。它是不可变的、类型安全的分布式数据集合。像DataFrame一样，DataSet APIs也使用Catalyst引擎来实现执行优化。DataSet是DataFrame api的扩展。

〇其他差异

2019-07-02 18:37:51

Spark中DataFrame、Dataset和RDD的区别

推荐文章

最新文章

标签