Reproduced:Introducing Apache Spark 2.0 Now generally available on Databricks

Easier: ANSI SQL and Streamlined APIs

One thing we are proud of in Spark is APIs that are simple, intuitive, and expressive. Spark 2.0 continues this tradition, focusing on two areas: (1) standard SQL support and (2) unifying DataFrame/Dataset API.

On the SQL side, we have significantly expanded Spark’s SQL support, with the introduction of a new ANSI SQL parser and subqueriesSpark 2.0 can run all the 99 TPC-DS queries, which require many of the SQL:2003 features. Because SQL has been one of the primary interfaces to Spark, these extended capabilities drastically reduce the effort of porting legacy applications.

On the programmatic API side, we have streamlined Spark’s APIs:

  • Unifying DataFrames and Datasets in Scala/Java: Starting in Spark 2.0, DataFrame is just a type alias for Dataset of Row. Both the typed methods (e.g. mapfiltergroupByKey) and the untyped methods (e.g. selectgroupBy) are available on the Dataset class. Also, this new combined Dataset interface is the abstraction used for Structured Streaming. Since compile-time type-safety is not a feature in Python and R, the concept of Dataset does not apply to these language APIs. Instead, DataFrame remains the primary interface there, and is analogous to the single-node data frame notion in these languages. Get a peek fromthis notebook and this blog for the stories behind these APIs.

  • SparkSession: a new entry point that supersedes SQLContext and HiveContext. For users of the DataFrame API, a common source of confusion for Spark is which “context” to use. Now you can use SparkSession, which subsumes both, as a single entry point, asdemonstrated in this notebook. Note that the old SQLContext and HiveContext classes are still kept for backward compatibility.

  • Simpler, more performant Accumulator API: We have designed a new Accumulator APIthat has a simpler type hierarchy and support specialization for primitive types. The old Accumulator API has been deprecated but retained for backward compatibility

  • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API.

  • Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark. See this blog post for more details and this notebook for examples.

  • Distributed algorithms in R: Added support for Generalized Linear Models (GLM), Naive Bayes, Survival Regression, and K-Means in R.

  • User-defined functions (UDFs) in R: Added support for running partition level UDFs (dapply and gapply) and hyper-parameter tuning (lapply).

Faster: Apache Spark as a Compiler

According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Apache Spark. As a result, performance optimizations have always been a focus in our Spark development. Before we started planning our contributions to Spark 2.0, we asked ourselves a question: Spark is already pretty fast, but can we push the boundary and make Spark 10X faster?



我们为Spark而自豪的一件事是简单,直观和富有表现力的API。 Spark 2.0继续这一传统,关注两个领域:(1)标准SQL支持和(2)统一DataFrame / Dataset API。

在SQL方面,我们大大扩展了Spark的SQL支持,引入了一个新的ANSI SQL解析器和子查询。 Spark 2.0可以运行所有99个TPC-DS查询,这需要许多SQL:2003特性。因为SQL一直是Spark的主要接口之一,这些扩展功能大大降低了移植旧应用程序的工作量。


在Scala / Java中统一DataFrames和数据集:从Spark 2.0开始,DataFrame只是Row的Dataset的类型别名。类型方法(例如map,filter,groupByKey)和非类型化方法(例如select,groupBy)在Dataset类上可用。此外,这个新的组合数据集接口是用于结构化流的抽象。由于编译时类型安全性不是Python和R中的一个特性,因此Dataset的概念不适用于这些语言API。相反,DataFrame仍然是主接口,并且类似于这些语言中的单节点数据帧概念。从这个笔记本和这个博客了解这些API背后的故事。

SparkSession:一个新的入口点,取代SQLContext和HiveContext。对于DataFrame API的用户,Spark的常见混乱来源是使用哪个“上下文”。现在你可以使用SparkSession,它包含这两个,作为一个单一的入口点,如本笔记本中所示。注意,旧的SQLContext和HiveContext类仍然保持向后兼容性。

更简单,更高性能的累加器API:我们设计了一个新的累加器API,它具有更简单的类型层次结构,并支持基本类型的专门化。旧的Accumulator API已弃用,但为了向后兼容性而保留

基于DataFrame的机器学习API成为主要的ML API:在Spark 2.0中,spark.ml包及其“管道”API将成为主要的机器学习API。虽然原始spark.mllib包被保留,但未来的开发将集中在基于DataFrame的API上。




更快:Apache Spark作为编译器

根据我们的2015年Spark调查,91%的用户认为性能是Apache Spark最重要的方面。因此,性能优化始终是我们Spark开发的重点。在我们开始规划我们对Spark 2.0的贡献之前,我们问自己一个问题:Spark已经相当快,但是我们可以推动边界,使Spark 10X更快吗?