课程: Data Platforms: Spark to Snowflake
免费学习该课程!
今天就开通帐号,24,700 门业界名师课程任您挑!
Apache Spark
- [Dr. Berman] Apache Spark, which sits on top of Hadoop, is also a big data analytics engine or platform. It keeps as much of the data and memory as possible. This means that it's generally faster than Hadoop, especially for iterative work such as running machine learning algorithms. These are the algorithms for which it was originally designed. Unlike Hadoop, Spark does not come with its own cluster management system, but attaches to a number of pre-existing ones, including Hadoop's YARN system. Also, unlike Hadoop, Spark does not have its own distributed data store, but once again can attach to a number of existing data stores including the one supplied in Hadoop. Let's talk about some Spark concepts. In Spark, there's a driver which sends jobs to various executors. The jobs are divided into stages and the data is partitioned with tasks running in parallel per partition. Though not always in parallel, but in parallel whenever possible. At the heart of Spark is the resilient…