Apache Spark 2.0: An Anthology of Technical Assets
Webinar, videos, blogs, news articles, notebooks, and podcasts to peruse
Older anthologies collated a collection of contributions from various authors around a theme—bounded then as a journal or periodical. Newer anthologies include multiple modals of expressions—digitized now as an ebook or a blog. Both offer an exposition of the subject matter—no matter their form. In this anthology, I’ve compiled a collection of videos, technical blogs, notebooks, webinar, podcasts, and news articles that focus on Apache Spark 2.0 now generally available. You can try the Apache Spark 2.0 version from two places: Spark Summit East Keynote: Apache Spark 2.0 Structuring Spark: DataFrames, Datasets, and Streaming A Deep-Dive in Structured Streaming in Apache Spark 2.0 Apache Spark 2.0: Easier, Faster & Smarter Introducing Apache Spark 2.0 Now Generally Available on Databricks Approximate Algorithms in Apache Spark: HyperLogLog Quantiles Apache Spark as a Compiler: Joining a Billion Rows on your Laptop Efficiently Compiling Efficient Query Plans for Modern Hardware Spark With Tungsten Burns Brighter Structured Streaming Comes to Apache Spark 2.0 What Spark’s Structured Streaming Really Means Apache Spark 2.0 Preview: Machine Learning Model Persistence How to Process IoT Data Using Datasets APIs SQL Subqueries in Apache Spark 2.0 A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets Spark 2.0 – Datasets and case classes Continuous Applications: Evolving Streaming in Apache Spark 2.0 <a href="https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html" <="" a="" style="box-sizing: border-box; color: rgb(28, 177, 194); text-decoration: none; background-color: transparent;">Structured Streaming in Apache Spark 2.0: A new high-level API for streaming. How to Use SparkSessions in Apache Spark 2.0 In the coming weeks, we’ll publish a series of posts on Spark 2.0 versions and will update this anthology. You might want to bookmark this page!
Databricks’ CTO Matei Zaharia thanks community’s contributions and previews Apache Spark 2.0’s three themes: simplicity, speed and unification.
Apache Spark committer and Databricks’ engineer Michael Armbrust sets the stage for why structure, as applied to data, is relevant, and how it affects the design of DataFrame and Dataset APIs and Streaming in Apache Spark 2.0.
Databricks’ Spark committer Tathagata Das gives a tech-talk on how Structured Streaming works, under the hood.
Apache Spark committer and Chief Architect at Databricks Reynold Xin and Spark Community Evangelist Jules S. Damji preview Apache Spark 2.0 and showcase salient features in Databricks notebooks running pre-release of Spark 2.0.
A more in-depth version of the webinar, Matei Zaharia, Reynold Xin, and Michael Armbrust expound on three thrusts—speed, simplicity structured streaming—behind Apache Spark 2.0, with notebooks running on Databricks.
Databricks’ engineers Tim Hunter, Hossein Falaki, and Joseph Bradley showcase two approximation algorithms to approximate distinct elements and compute quantiles in a large data using pre-release preview of Apache Spark 2.0 on Databricks.
Apache Spark is already pretty fast, but can we make it 10x faster? Reynold Xin, Sameer Agarwal, and Davies Liu explain how Tungsten’s whole-stage code generation makes it so.
Adrian Coyle, former CTO of SpringSource, explores influential and important topics in the world of computer science in his Morning Paper.
Paige Roberts (of Syncort) opines that Tungsten represents a huge leap forward for Apache Spark, particularly in the area of performance, and writes how it works, and why it improves Spark performance.
O’Reilly’s Chief Data Scientist Ben Lorica sits down with Michael Armbrust and talks about life and structured streaming.
Ion Pointer (contributor for InfoWorld) advocates why DataFrames are the best choice for Spark Streaming in Spark 2.0, and why structured streaming makes sense.
Databricks’ engineer Joseph Bradley shares the benefits of Machine Learning Model Persistence in Spark 2.0 Preview, and how you can save and load ML Pipelines across multiple languages in Spark.
Databricks Community Edition notebook showcasing Apache Spark 2.0 Dataset APIs.
Databricks’ engineers Davies Liu and Herman van Hövell provide hands-on examples of scalar and predicate type subqueries
Databricks’ Spark Community Evangelist Jules S. Damji tells the tale of three Spark APIs: when to use them and why
Daniel Pape, an analytics engineer at codecentric explores and explains the type-safety features of Datasets APIs through some code examples using Scala case classes
Databricks’ Co-founder and CTO Matei Zaharia shares his vision on end-to-end streaming applications called continuous application using Structured Streaming APIs in Apache Spark 2.0
Messrs Matei Zaharia, Tathagata Das, Reynold Xin and Michael Armbrust explain the challenges of writing end-to-end streaming applications called continuous application and elaborate why and how Structured Streaming makes it simple.
Databricks’ Spark Community Evangelist Jules S. Damji explores SparkSession functionality in Spark 2.0.What’s Next?