After successful completion of the course, the student is able to:|
• Design large scale storage for industrial, data-intensive applications (for instance GMail, Facebook);
• Design solutions for processing large data streams (for instance the Twitter stream);
• Implement distributed data processing programs in MapReduce algorithms;
• Implement distributed processing programs for big data in Spark programs over a Hadoop infrastructure;
• Implement complex analytical queries in query languages such as Spark SQL;
• Configure, execute and debug programs over the Hadoop framework.
Big data is a term introduced in the early 2000s to refer to data sets whose size grew beyond the ability of the software tools of that time to process, typically in the order of many terabytes or petabytes for a single dataset. Big data sets are encountered by software architects in for instance web search and social media, by scientists in for instance meteorology and genomics, and by analysts in for instance finance and business informatics. The course will closely follow developments to manage big data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, Facebook, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: The course discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (Spark), 'noSQL' database paradigms (for instance Google's BigTable) for managing big data, and solutions for managing streaming data (for instance Twitter's Storm). The course consists of lectures and practical assignments. Students will solve real world, large-scale problems as lab exercises, and they get the opportunity to access the University of Twente Hadoop computing clusters. Examples of lab exercises are: counting words in large web crawls, inverted index construction, the computation of Google's PageRank, analyzing Twitter streams, and designing a storage layer for GMail clones.