Kies de Nederlandse taal
Course module: 201200044
Managing Big Data
Course info
Course module201200044
Credits (ECTS)5
Course typeCourse
Language of instructionEnglish
Contact persondr. D. Bucur
Contactperson for the course
dr. D. Bucur
dr. D. Bucur
Academic year2022
Starting block
Application procedureYou apply via OSIRIS Student
Registration using OSIRISYes
After successful completion of the course, the student is able to:
  • Design large scale storage for industrial, data-intensive applications (for instance Gmail, Facebook);
  • Design solutions for processing large data streams (for instance the Twitter stream);
  • Implement distributed data processing programs in MapReduce algorithms;
  • Implement distributed processing programs for big data in Spark programs over a Hadoop infrastructure;
  • Implement complex analytical queries in query languages such as Spark SQL;
  • Configure, execute and debug programs over the Hadoop framework.
Big data is a term introduced in the early 2000s to refer to data sets whose size grew beyond the ability of the software tools of that time to process, typically in the order of many terabytes or petabytes for a single dataset. Big data sets are encountered by software architects in for instance web searches and social media, by scientists in for instance meteorology and genomics, and by analysts in for instance finance and business informatics.

The course will closely follow developments to manage big data on large clusters of commodity machines, initiated by Google, and followed by many other web companies such as Yahoo, Amazon, Facebook, Spotify, Twitter, etc. Big data gives rise to a redesign of many core computer science concepts: The course discusses file systems (Google FS), programming paradigms (MapReduce), programming languages and query languages (Spark), 'noSQL' database paradigms (for instance Google's BigTable) for managing big data, and solutions for managing streaming data (for instance Twitter's Storm).

The course consists of lectures and practical assignments. Students will solve real world, large-scale problems as lab exercises, and they get the opportunity to access the University of Twente Hadoop computing clusters. Examples of lab exercises are: counting words in large web crawls, inverted index construction, the computation of Google's PageRank, analyzing Twitter streams, and designing a storage layer for GMail clones.

Exam 50%, Assignment 50%
Assumed previous knowledge
The programming in this course is done in Python, and basic ability with this language is assumed.

Basic BSc-level modules or courses in Programming, Networks and Operating Systems (Unix), and Databases are required (our UT Module 1 Pearls of Computer Science can be sufficient).
Participating study
Master Business Information Technology
Participating study
Master Computer Science
Participating study
Master Internet Science and Technology
Required materials
Recommended materials
Instructional modes
Presence dutyYes


Written exam, Assignment

Kies de Nederlandse taal