课程: Data Platforms: Spark to Snowflake
免费学习该课程!
今天就开通帐号,24,700 门业界名师课程任您挑!
Apache Hadoop
- [Instructor] Now let's talk about Apache Hadoop. Apache Hadoop is a platform for processing big data. It is also an implementation of the MapReduce programming model. The term Hadoop refers to both the Hadoop implementation of the MapReduce program model and tools in the Hadoop ecosystem. Hadoop distributes work across multiple machines, referred to as nodes, in a cluster. It writes intermediate data to disk as a strategy for dealing with large data that can't all sit in memory. So if we have our input files on disk and we're going to use Hadoop to process them, the first step is to split the files into partitions. We then run various transformations on these partitions and what's known as the map phase. In between each transformation, intermediate files are written to disk, and then the next transformation step reads from these files on disk. Finally, the data is combined to produce the final output. This final combination is referred to as the reduce phase.