CloseHelpPrint
Kies de Nederlandse taal
Course module: 192110720
192110720
Distributed Data Processing using MapReduce
Course infoSchedule
Course module192110720
Credits (ECTS)5
Course typeCourse
Language of instructionEnglish
Contact persondr.ir. D. Hiemstra
E-mail-
Lecturer(s)
Lecturer
dr. M.M. Fokkinga
Lecturer
dr.ir. D. Hiemstra
Contactperson for the course
dr.ir. D. Hiemstra
Academic year2010
Starting block
1B
Application procedureYou apply via OSIRIS Student
Registration using OSIRISYes
Aims
The course teaches how to carry out large-scale distributed data analysis using MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions 'map' and 'reduce' as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google's file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo's Pig Latin and Microsoft's DryadLINQ. The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google's PageRank.
Content

Vakbeschrijving
The course teaches how to carry out large-scale distributed data analysis using MapReduce as the programming abstraction. MapReduce is a programming abstraction that is inspired by the functions 'map' and 'reduce' as found in functional programming language such as Lisp. It was developed at Google as a mechanism to allow large-scale distributed processing of data on data centers consisting of thousands of low-cost machines. MapReduce allows programmers to distribute their programs over many machines without the need to worry about system failures, threads, locks, semaphores, and other concepts from concurrent and distributed programming. Students will learn to specify algorithms using map and reduce steps and to implement these algorithms using Hadoop, an open source implementation of Google's file system and MapReduce. The course will introduce recent attempts to develop high-level languages for simplified relational data processing on top of Hadoop, such as Yahoo's Pig Latin and Microsoft's DryadLINQ.

The course consists of lectures and practical assignments. Students will solve lab exercises on a large cluster of machines in order to get hands-on experience and solve real large-scale problems. The lab exercises will be done on the University of Twente PRISMA-2 computer, a data center consisting of 16 dual core systems sponsored by Yahoo Research. Examples of lab exercises are: counting bigrams in large web crawls, inverted index construction, and the computation of Google's PageRank.

 

-            Course objectives:
After successful completion of the course, the student is able to:

-            Disect complex problems in algorithms that use map and reduce steps,

-            Specify these algorithms in a functional language such as Haskell,

-            Implement these algorithms using the Hadoop framework,

-            Specify simplified relational queries using Pig Latin.

Prerequisites:

-            Introductory course in Databases (192110741 or an equivalent).

 

Deelnemende opleiding(en)

Examendoel

Master Computer Science  (MSc CSC)

Master

 

Required materials
-
Recommended materials
Reader
Reader "Distributed Data Processing using MapReduce"
Instructional modes
Lecture

Project

Tutorial

Tests
Assignments

CloseHelpPrint
Kies de Nederlandse taal