After successful completion of the course, the student is able to:
-
Design large scale storage, data-intensive web applications (for instance GMail, Facebook);
-
Specify complex problems as MapReduce algorithms using a programming language with functional constructs (for instance Haskell, Python);
-
Write complex analytical queries using query languages such as Pig Latin and Sawzall;
-
Implement and run solutions using the Hadoop framework.
|
|
Big data is a term introduced in the early 2000's to refer to data sets whose size grew beyond the abilityof the software tools of that time to process, typically in the order of many terabytes or petabytes for a single dataset. Big data sets are encountered by software architects in for instance web search and social media, by scientists in for instance meteorology and genomics, and by analysts in for instance finance and business informatics. The course will closely follow developments to manage big data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, AOL, Facebook, Hyves, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: The course discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (for instance Sawzall and Pig Latin), and 'noSQL' database paradigms (for instance BigTable and Dynamo) for managing big data.
The course consists of lectures and practical assignments. Students will solve real world, largescale
problems as lab exercises, and they get the opportunity to access the University of Twente
PRISMA-2 computer, a 32 node data center sponsored by Yahoo Research. Examples of lab exercises
are: Counting words in large web crawls, inverted index construction, the computation of Google's
PageRank, Analyzing NetFlow logs, and designing a storage layer for the GMail clone HMail.
Prerequisites:
Introductory course in Databases (192110741 or an equivalent).
|
 |
|