
Publication in the Diário da República: Despacho n.º 13495/2022 - 18/11/2022
10 ECTS; 1º Ano, 1º Semestre, 30,0 PL + 30,0 TP + 30,0 OT , Cód. 390913.
Lecturer
- Renato Eduardo Silva Panda (1)(2)
(1) Docente Responsável
(2) Docente que lecciona
Prerequisites
NA
Objectives
This course aims to equip students with practical and theoretical skills in large-scale data processing and analysis. Modern data engineering techniques, including data preparation, storage, transformation, exploratory analysis, and visualization, will be covered, using tools such as Pandas, Dask, Spark, and Streamlit, integrated into an applied project context.
Program
1. Introduction to Data Science and Big Data
1.1 Basic Concepts, Project Lifecycle, Roles in Data Science
1.2 The 5Vs of Big Data: Volume, Velocity, Variety, Veracity, Value
1.3 Ethics, Privacy, Transparency, and Social Impact
1.4 Reproducibility, Documentation, and Version Control
2. Development Environment
2.1 Jupyter, Python, and VS Code
2.2 Docker, DevContainers, and Reproducible Environments
2.3 Dependency Isolation (Pip, Conda)
2.4 Environment Management with Requirements.txt
3. Python for Analysis and Visualization
3.1 Review of Python Syntax, Data Structures, and Basic Scripting
3.2 NumPy and Matrix Manipulation
3.3 Pandas for Tabular Analysis
3.4 Visualizing with Matplotlib, Seaborn, and Plotly
4. Data Acquisition and Storage
4.1 Access to Local and Remote Data (CSV, JSON, Parquet)
4.2 APIs REST, authentication, and error handling
4.3 Web scraping with requests and BeautifulSoup
4.4 Storage in MongoDB (document store) and Redis (key-value)
5. Data engineering and EDA (Exploratory Data Analysis)
5.1 Building ETL pipelines
5.2 Data cleansing and transformation
5.3 Initial exploration and descriptive analysis
5.4 Optimization with efficient formats (Parquet, compression)
6. Interactive dashboards
6.1 Introduction to Streamlit and Dash
6.2 Building interfaces with filters, tables, and visualizations
6.3 Integration with pipelines and APIs
6.4 Practical applications in UC projects
7. Large-scale data processing
7.1 Introduction to the MapReduce paradigm
7.2 Architectures: Hadoop, Spark, Dask
7.3 Parallelization and chunking strategies
7.4 Operations with RDDs and DataFrames in PySpark
7.5 Introduction to MLlib (Spark) and machine learning distributed
Evaluation Methodology
Assessment is ongoing and based on three mandatory practical projects:
- Project I (35%) Reorganizing a dataset, creating loaders, and building interactive dashboards
- Project II (35%) Developing an ETL pipeline with exploratory analysis and visualization
- Project III (30%) Large-scale processing with Dask and Spark
All projects are mandatory, as is their defense.
Bibliography
- Marr, B. (2022). Data Strategy: How to Profit from a World of Big Data, Analytics and the Internet of Things. USA: Kogan Page
- McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. USA: O'Reilly
- Rioux, J. (2022). Data Analysis with Python and PySpark. USA: Manning
- Santos, M. e Costa, C. (2019). Big Data Concepts, Warehousing, and Analytics. . Lisboa: FCA
- Triguero, I. e Galar, M. (2023). Large-Scale Data Analytics with Python and Spark. UK: Cambridge University Press
Teaching Method
Theoretical and practical classes introduce concepts, with implementation examples. Guided lab classes use Jupyter notebooks and Python scripts. Applied projects throughout the semester consolidate knowledge and enable application.
Software used in class
Python (Anaconda, Jupyter Notebooks), Pandas, NumPy, Matplotlib, Seaborn, Plotly, MongoDB, Redis, Dask, PySpark, Streamlit, Dash, Docker, Git, Visual Studio Code