After completing the course, students:
- have knowledge and understanding of various data science skills for preparing different kinds of raw data and for analyzing data,
- are able to properly apply these data science skills in a real-world project,
- are able to make proper methodical decisions in a real-world project: taking steps with relevancy and justified priority, showing a critical attitude towards data and results, and properly interpreting data science results,
- are able to properly communicate the results of a real-world project both orally as well as in written form,
- and have the ability and attitude for continued learning in data science techniques and methods.
Data Science is the emerging interdisciplinary field that lies at the intersection of computer science, statistics, visualization and the social sciences. Scientific and economic progress is increasingly powered by our capabilities to explore big data sets. Data scientists dig for value in data by analyzing for instance texts, application usage logs, and sensory data. They are the driving force behind the successful innovation of Internet companies like Google, Twitter, and Yahoo. There is an increasing need for data scientists and big data engineers seen in job advertisements. The need for data scientists and big data analysts is apparent in almost every aspect of our society, including computer science, medicine, physics, and the humanities.|
The goal of the course Data Science is to teach several data science skills needed in various phases of data analysis projects. The course concept is geared towards self study in an assignment & project-driven manner, i.e., it is designed to offer a rich environment for flexible, effective, and efficient self study with ample guidance and supervision. The course is assessed with a project that takes about half of the course. There are several projects offered from which the student can choose. A project is composed of a real-world data set and a challenge, i.e., what knowledge can potentially be extracted from the data or what the project owner wants to do with the data. The data science skills are offered as technical topics from which the student has to choose two. The projects indicate which technical topics provide the necessary skills for doing the project, so the choice for project and technical topics should be coherent.
Each topic consists of one lecture and a practical for learning the basic skills. The practical and project are done in pairs. Supervision is provided during practical sessions twice per week shared with all topics and projects. The project is assessed by the project owner and a topic teacher. The project grade is the grade for the course.
The list of projects and topics will be revised every year. Projects come from a variety of domains: health, logistics, business intelligence, transport, security, social media, etc. The following topics are offered:
• Topic DPV: Data Preparation and Visualization
(Topic teacher: M. van Keulen)
The skills for Data Preparation and Data Visualization taught are in essence drawn from technologies developed for Business Intelligence. They are, however, also effective for data science. The topic teaches (a) data warehousing techniques for extracting and transforming data (ETL), (b) modeling data for analytic purposes using the multidimensional modeling approach of OLAP, and (c) data visualization techniques.
• Topic DM: Data Mining
(Topic teacher: K. Groothuis-Oudshoorn, E. Mocanu)
Data mining is about discovering patterns in large data sets involving methods from artificial intelligence, machine learning, statistics, and database systems. The topic teaches (a) classification, (b) clustering, and (c) association rule mining.
• Topic SEMI: Semi-structured data
(Topic teacher: M. van Keulen)
There exist several data exchange and knowledge representation standards. This topic teaches the most important standards and skills to manipulate data in these standards: (a) XML and its associated standards SQL/XML, XPath and XQuery for publishing and manipulation with both relational as well as XML databases, (b) JSON storage and manipulation in relational databases, and (c) Semantic Web standard RDF with its associated standards SPARQL for remote querying, also known as "Linked Open Data".
• Topic IENLP: Information Extraction Using Natural Language Processing
(Topic teacher: N.Bouali)
Most information is available in a form rather unsuitable for processing by computers, namely natural language text. This topic teaches (a) text mining (analysing text directly), (b) rule-based techniques for information extraction, and (c) statistical techniques for information extraction and natural language processing. This topic is preferably done in combination with "Data Mining".
• Topic PDBDQ: Probabilistic Databases and Data Quality
(Topic teacher: M. van Keulen)
Much effort in data preparation is devoted to dealing with data quality problems. Probabilistic database technology has the potential of representing data quality problems as uncertainty in the data, and storing and querying it. The topic teaches the most important skills for (a) using probabilistic database technology, and (b) how to represent several kinds of data quality problems as uncertainty in the data.
• Topic TS: Feature Extraction from Time Series data
(Topic teacher: F. Ahmed)
Sensors and other measurements increasingly produce massive amounts of data with space and time dimensions. The analysis of spatio-temporal data has many applications. The topic focuses on key techniques for preparing time series data for analysis, such as peak detection, filtering, Fourier analysis (FFT), dynamic time warping (DTW), and prediction models.
• Topic PM: Process Mining
(Topic teacher: F. Bukhsh)
Process mining aims to improve understanding and efficiency of business processes by analysing event logs with specialized data-mining algorithms. The topic teaches the most important concepts and skills for applying and understanding Process Mining: (a) petri nets: the theoretical foundation of process mining, (b) concepts like event log, causal trace, and the Alpha algorithm, (c) using the ProM tool for process discovery, (d) answering analytical questions for a discovered process, and (e) using the ProM tool for process conformance checking.
• Topic DINT: Data Integration
(Topic teacher: S. Wang)
An often-needed activity in Data Science is combining the data from two or more independent data sources. Its purpose is usually data enrichment: Given a data set with information about a certain entity (e.g., patients, users, products, locations, etc.), we would like to add additional information about these same entities from a different data source. The topic teaches the most important steps in a data integration pipeline: matching, mapping, merging, and evalution as well as dealing with often occurring complications: no or no reliable IDs, attributes having different names, inconsistencies between the sources, large data sources, etc.
• Topic CV&IC: Data Integration
(Topic teacher: E. Talavera)
In this topic, we discuss concepts such as computer vision, image analysis, feature extraction, supervised learning, deep learning and image classification. In the assignments, you will learn about some basic tasks in image processing, computer vision and machine learning. The assignments will also introduce you to some helpful tools in the field
Since for most of these topics, some programming experience in R or Python is desirable, we offer an optional topic for those students with little or no experience with programming:
Topic Zero: R or Python programming
(Topic teachers: N. Bouali and Karin Groothuis-Oudshoorn)
This topic introduces basic programming concepts in both R and Python. Students can follow this topic in addition to the two topics they registered for in the course. The main contents of this topic are with respect to both languages R and Python: (a) The Programming Environment (b) Variables and Control Flow (c) Basic Data Structures (d) Functions (e) Libraries for Data Processing, Visualization and Manipulation. Note that for the last part, the libraries covered depend on the topic(s) you’re registered for.
The "Data Science" course is explicitly open for students of any master study. The following restrictions apply:
Besides the above topics, several more that are envisaged or already under development, may be offered. Finally, if one is interested in more than two topics, it is possible to follow the course a second time choosing two different topics and doing a second different project. This is called “Data Science Additional Topics” and has its own course code. In this way, one can study 4 topics for 10 ECTS. Register for this course via course code 201400174 "Data Science".
- For students of masters "Industrial Engineering and Management" [IEM], "Health Science" [HS], "Communication Science” (COM), and “Business Administration” (BA) it is compulsory to choose the topics DPV and DM.
- For students of specialisation MSS in the master "Technical Medicine" [TM], it is compulsory to choose the topics TS and DM.
- For students of master "Business Information Technology" [BIT], it is compulsory to choose two topics from the list DPV, DM, PM.
- Quarter 1A is meant for HS-students and offers only topics DPV and DM and a project specifically geared towards this study. Other students are welcome given these restrictions.
|Verplicht materiaal-Aanbevolen materiaal-Werkvormen|