Data Science is the emerging interdisciplinary field that lies at the intersection of computer science, statistics, visualization and the social sciences. Scientific and economic progress is increasingly powered by our capabilities to explore big data sets. Data scientists dig for value in data by analyzing for instance texts, application usage logs, and sensory data. They are the driving force behind the successful innovation of Internet companies like Google, Twitter, and Yahoo. There is an increasing need for data scientists and big data engineers seen in job advertisements. The need for data scientists and big data analysts is apparent in almost every aspect of our society, including computer science, medicine, physics, and the humanities.|
The goal of the course Data Science is to teach several data science skills needed in various phases of data analysis projects. The course concept is geared towards self study in an assignment & project-driven manner, i.e., it is designed to offer a rich environment for flexible, effective, and efficient self study with ample guidance and supervision. The course is assessed with a project that takes about half of the course. There are several projects offered from which the student can choose. A project is composed of a real-world data set and a challenge, i.e., what knowledge can potentially be extracted from the data or what the project owner wants to do with the data. The data science skills are offered as technical topics from which the student has to choose two. The projects indicate which technical topics provide the necessary skills for doing the project, so the choice for project and technical topics should be coherent.
Each topic consists of one lecture and a practical for learning the basic skills. The practical and project are done in pairs. Supervision is provided during practical sessions twice per week shared with all topics and projects. The project is assessed by the project owner and a topic teacher. The project grade is the grade for the course.
The list of projects and topics will be revised every year. Projects come from a variety of domains: health, logistics, business intelligence, transport, security, social media, etc. The following topics are offered:
• Topic DPV: Data Preparation and Visualization
(Topic teachers: M. van Keulen)
The skills for Data Preparation and Data Visualization taught are in essence drawn from technologies developed for Business Intelligence. They are, however, also effective for data science. The topic teaches (a) data warehousing techniques for extracting and transforming data (ETL), (b) modeling data for analytic purposes using the multidimensional modeling approach of OLAP, and (c) data visualization techniques.
• Topic DM: Data Mining
(Topic teacher: M. Poel)
Data mining is about discovering patterns in large data sets involving methods from artificial intel- ligence, machine learning, statistics, and database systems. The topic teaches (a) classification, (b) clustering, and (c) association rule mining.
• Topic SEMI: Semi-structured data
(Topic teacher: M. van Keulen)
There exist several data exchange and knowledge representation standards. This topic teaches the most important standards and skills to manipulate data in these standards: (a) XML and its associated standards SQL/XML, XPath and XQuery for publishing and manipulation with both relational as well as XML databases, (b) JSON storage and manipulation in relational databases, and (c) Semantic Web standard RDF with its associated standards SPARQL for remote querying, also known as "Linked Open Data".
• Topic IENLP: Information Extraction Using Natural Language Processing
(Topic teacher: M.B. Habib)
Most information is available in a form rather unsuitable for processing by computers, namely natural language text. This topic teaches (a) text mining (analysing text directly), (b) rule-based techniques for information extraction, and (c) statistical techniques for information extraction and natural language processing. This topic is preferably done in combination with "Data Mining".
• Topic PDBDQ: Probabilistic Databases and Data Quality
(Topic teacher: M. van Keulen)
Much effort in data preparation is devoted to dealing with data quality problems. Probabilistic database technology has the potential of representing data quality problems as uncertainty in the data, and storing and querying it. The topic teaches the most important skills for (a) using probabilistic database technology, and (b) how to represent several kinds of data quality problems as uncertainty in the data.
• Topic TS: Feature Extraction from Time Series data
(Topic teacher: C. Amrit)
Sensors and other measurements increasingly produce massive amounts of data with space and time dimensions. The analysis of spatio-temporal data has many applications. The topic focuses on key techniques for preparing time series data for analysis, such as peak detection, filtering, fourier analysis (FFT), dynamic time warping (DTW), and prediction models.
• Topic 7: Process Mining [PM]
(Topic teacher: F. Bukhsh)
Process mining aims to improve understanding and efficiency of business processes by analysing event logs with specialized data-mining algorithms. The topic teaches the most important concepts and skills for applying and understanding Process Mining: (a) petri nets: the theoretical foundation of process mining, (b) concepts like event log, causal trace, and the Alpha algorithm, (c) using the ProM tool for process dicovery, (d) answering analytical questions for a discovered process, and (e) using the ProM tool for process conformance checking.
The "Data Science" course is explicitly open for students of any master study. The following restrictions apply:
• For students of masters "Industrial Engineering and Management" [IEM], "Health Science" [HS], and "Technical Medicine" [TM], it is compulsory to choose the topics DPV and DM.
• Quarter 1A is for HS-students only and offers only topics DPV and DM and a project specifically geared towards this study.
• Students doing the master course "Probabilistic Programming" cannot also do topic PDBDQ.
Besides the above topics, several more that are envisaged or already under development, may be offered. Finally, if one is interested in more than two topics, it is possible to follow the course a second time choosing two different topics. In this way, one can study 4 topics for 10 ECTS.
For students Business Administration and Communication Studies it is compulsory to choose the topics Data Mining (DM) and IENLP (Text Mining).
Assumed previous knowledge
|Master Business Information Technology|
|Master Interaction Technology|
|Master Industrial Engineering and Management|
|Master Internet Science and Technology|
|Master Technical Medicine|
|Master Biomedical Engineering|
|Master Business Administration|
|Master Communication Studies||Required materials-Recommended materials|