Publication in the Diário da República: Despacho n.º 13495/2022 - 18/11/2022
10 ECTS; 1º Ano, 1º Semestre, 30,0 PL + 30,0 TP + 30,0 OT , Cód. 390913.
Lecturer
- Ricardo Nuno Taborda Campos (1)(2)
(1) Docente Responsável
(2) Docente que lecciona
Prerequisites
Objectives
This course aims to introduce students to the acquisition, processing, storage, and retrieval of large scale data in support of data science tasks.
By the end of the course, the student should be able to
1. list the steps involved in a large-scale data science project and describe the functions of each;
2. know the data science toolkit;
3. understand the fundamental concepts of big data;
4. know how to apply methods for data acquisition using Python software packages, APIs, and web scraping;
5. master the process of storage and retrieval of large-scale data;
6. understand and know how to apply the appropriate strategies for large-scale data processing;
7. be familiar with the map-reduce paradigm;
8. know the fundamentals of the most relevant large-scale data processing frameworks;
9. know how to use, program, and process large-scale data using the Spark framework.
Program
1. Introduction to Data Science
- Definition of Large-Scale Data Science
- Skills of a Large-Scale Data Scientist
- Data Science Lifecycle
- The Importance of Data Science in Large-Scale environments
- Challenges and Opportunities in Data Science and Big Data
- Top-industries using large-scale data science
- Trending Topics
- Data Repositories
- Data Lakes
- Data Science Meetup Groups
2. Data Science Toolkit
- Git
- Github
- Docker
- Python (Anaconda - Jupyter Notebook)
- Google Colab
3. Introduction to Big Data
- Definition of big data
- History
- Characteristics
- Advantages
- Practical applications in large-scale scenarios
- Big data architecture
- Big data frameworks
4. Large-Scale Data Acquisition
- Data Formats (unstructured, structured, semi-structured)
- Data acquisition from files
- Data acquisition using packages
- Data acquisition using APIs
- Data acquisition using Web Scraping
- Web Scraping Principles and Ethics
5. Large-Scale Storage & Retrieval
- NoSQL databases
- Advantages
- NoSQL vs SQL databases
- Types of NoSQL databases
- Open-Source NoSQL databases
6. Large-Scale Data Processing Strategies
- How many data are many data?
- Strategies overview to handle large-scale datasets (compression, databases, chunking, scale-up (expand resources); scale-out (data parallelism) and big data).
- The importance of GPUs in the context of large-scale data science.
7. Programming Large-Scale Applications based on the Map-Reduce Paradigm
- Overview of the map-reduce paradigm
- History of map-reduce
- How does it work?
- Advantages
- Frameworks
8. Large-Scale Data Processing Frameworks Hadoop; Spark; Dask
Hadoop
- What is Hadoop?
- History and Evolution
- Characteristics
- Architecture
- Hadoop Ecosystem
Spark
- What is Spark?
- History and Evolution
- Characteristics
- Architecture
- Spark vs Hadoop Map-Reduce
Dask
- What is Dask?
- Characteristics
- Advantages
- Architecture
- Dask vs PySpark
9. Large-Scale Data Processing with Spark
- Introduction to Core Spark Concepts
- RDDs (Resilient Distributed Datasets)
- Spark Dataframes
- Streaming
Evaluation Methodology
Periodic Evaluation
- P1 - Project I (team work): 40%
- P2 - Project II (team work): 40%
- T - Test: 20%
The final classification of the course results from the weighted average of the classifications obtained in the defined evaluation components. The student obtains approval at the course, being exempt from the Exam, in case he/she obtains a grade equal to or greater than 9.5 values.
Final Assessment
- Exam: 100% (computer-based test with only partial access to the contents)
Admission to the Teaching/Learning and Exams:
- Minimum of 70% class attendance during the teaching-learning period (except student workers);
- Minimum score of 6 points in AE, where AE = ((P1 * 40%) + (P2 * 40%) + (T * 20%))
Failure to comply with any of these items (including the submission of any of the projects after the foreseen period) prevents the student from being approved.
Bibliography
- Galar, M. e Triguero, I. (2023). Large-Scale Data Analytics with Python and Spark. UK: Cambridge University Press
- Marr, B. (2022). Data Strategy: How to Profit from a World of Big Data, Analytics and the Internet of Things. USA: Kogan Page
- McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. USA: O'Reilly
- Rioux, J. (2022). Data Analysis with Python and PySpark. USA: Manning
Teaching Method
Exposure of the syllabus using the expository and demonstrative method. Analysis and resolution of practical cases through Python notebooks. The acquired knowledge will be evaluated through the realization and presentation of projects and exams
Software used in class
Python: Anaconda e Jupyter Notebooks; PySpark