map和flatMap之间的区别是什么，以及它们各自的良好用例?

谁能给我解释一下map和flatMap之间的区别，以及它们各自的良好用例是什么?

“flatten the results”是什么意思? 它有什么好处?

当前回答

抽样。Map返回单个数组中的所有元素

抽样。flatMap返回数组数组中的元素

让我们假设在text.txt文件中有文本

Spark is an expressive framework
This text is to understand map and faltMap functions of Spark RDD

使用地图

val text=sc.textFile("text.txt").map(_.split(" ")).collect

输出:

text: **Array[Array[String]]** = Array(Array(Spark, is, an, expressive, framework), Array(This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD))

使用flatMap

val text=sc.textFile("text.txt").flatMap(_.split(" ")).collect

输出:

 text: **Array[String]** = Array(Spark, is, an, expressive, framework, This, text, is, to, understand, map, and, faltMap, functions, of, Spark, RDD)

2018-09-12 06:50:09

其他回答

下面是一个不同的例子，作为一个spark-shell会话:

首先是一些数据——两行文本:

val rdd = sc.parallelize(Seq("Roses are red", "Violets are blue"))  // lines

rdd.collect

    res0: Array[String] = Array("Roses are red", "Violets are blue")

现在，map将一个长度为N的RDD转换为另一个长度为N的RDD。

例如，它将两行映射为两行长度:

rdd.map(_.length).collect

    res1: Array[Int] = Array(13, 16)

但是flatMap(松散地说)将长度为N的RDD转换为N个集合的集合，然后将这些集合平展为单个结果RDD。

rdd.flatMap(_.split(" ")).collect

    res2: Array[String] = Array("Roses", "are", "red", "Violets", "are", "blue")

我们每行有多个单词，而且每行有多行，但我们最终得到一个单词输出数组

为了说明这一点，从一个行集合到一个单词集合的flatMapping如下:

["aa bb cc", "", "dd"] => [["aa","bb","cc"],[],["dd"]] => ["aa","bb","cc","dd"]

因此，对于flatMap，输入和输出rdd通常具有不同的大小。

如果我们试图使用map与我们的split函数，我们将以嵌套结构结束(RDD的单词数组，类型为RDD[Array[String]])，因为我们必须对每个输入只有一个结果:

rdd.map(_.split(" ")).collect

    res3: Array[Array[String]] = Array(
                                     Array(Roses, are, red), 
                                     Array(Violets, are, blue)
                                 )

最后，一个有用的特殊情况是映射到一个可能不返回答案的函数，因此返回一个Option。我们可以使用flatMap过滤出返回None的元素，并从返回Some的元素中提取值:

val rdd = sc.parallelize(Seq(1,2,3,4))

def myfn(x: Int): Option[Int] = if (x <= 2) Some(x * 10) else None

rdd.flatMap(myfn).collect

    res3: Array[Int] = Array(10,20)

(注意这里Option的行为很像一个只有一个元素或者没有元素的列表)

2014-03-19 15:21:23

map和flatMap是相似的，从某种意义上说，它们从输入RDD中获取一行并在其上应用一个函数。它们的不同之处在于map中的函数只返回一个元素，而flatMap中的函数可以返回一个元素列表(0或更多)作为迭代器。

同样，flatMap的输出是扁平的。尽管flatMap中的函数返回一个元素列表，但flatMap返回一个RDD，其中以平面方式(而不是列表)包含列表中的所有元素。

2015-09-01 12:44:33

对于所有想要PySpark相关的人:

示例转换:flatMap

>>> a="hello what are you doing"
>>> a.split()

['hello'， 'what'， 'are'， 'you'， 'doing']

>>> b=["hello what are you doing","this is rak"]
>>> b.split()

回溯(最近一次调用): 文件“”，第1行，在 AttributeError: 'list'对象没有属性'split'

>>> rline=sc.parallelize(b)
>>> type(rline)

>>> def fwords(x):
...     return x.split()


>>> rword=rline.map(fwords)
>>> rword.collect()

[[‘你好’,‘什么’,‘是’,‘你’,‘做’],[‘这个’,‘是’,'爱你']]

>>> rwordflat=rline.flatMap(fwords)
>>> rwordflat.collect()

[‘你好’,‘什么’,‘是’,‘你’,‘做’,‘这’,‘是’,‘爱’)

希望能有所帮助。

2017-05-01 16:55:23

map返回相同数量元素的RDD，而flatMap可能不会。

flatMap过滤丢失或不正确数据的示例用例。

map在各种各样的情况下使用，其中输入和输出的元素数量是相同的。

number.csv

Map.py添加add.csv中的所有数字。

from operator import *

def f(row):
  try:
    return float(row)
  except Exception:
    return 0

rdd = sc.textFile('a.csv').map(f)

print(rdd.count())      # 7
print(rdd.reduce(add))  # 15.0

py使用flatMap在添加之前过滤掉缺失的数据。与以前的版本相比，增加的数字更少。

from operator import *

def f(row):
  try:
    return [float(row)]
  except Exception:
    return []

rdd = sc.textFile('a.csv').flatMap(f)

print(rdd.count())      # 5
print(rdd.reduce(add))  # 15.0

2016-02-14 23:20:29

区别可以从下面的pyspark代码示例中看到:

rdd = sc.parallelize([2, 3, 4])
rdd.flatMap(lambda x: range(1, x)).collect()
Output:
[1, 1, 2, 1, 2, 3]


rdd.map(lambda x: range(1, x)).collect()
Output:
[[1], [1, 2], [1, 2, 3]]

2017-11-17 11:00:09

map和flatMap之间的区别是什么，以及它们各自的良好用例?

推荐文章

最新文章

标签