Engenharia Informática-Internet das Coisas

Big Data Processing

Publication in the Diário da República: Despacho n.º 13495/2022 - 18/11/2022

10 ECTS; 1º Ano, 1º Semestre, 30,0 PL + 30,0 TP + 30,0 OT , Cód. 390913.

Lecturer
- Renato Eduardo Silva Panda (1)(2)

(1) Docente Responsável
(2) Docente que lecciona

Prerequisites
NA

Objectives
This course aims to equip students with practical and theoretical skills in large-scale data processing and analysis. Modern data engineering techniques, including data preparation, storage, transformation, exploratory analysis, and visualization, will be covered, using tools such as Pandas, Dask, Spark, and Streamlit, integrated into an applied project context.

Program
1. Introduction to Data Science and Big Data
1.1 Basic Concepts, Project Lifecycle, Roles in Data Science
1.2 The 5Vs of Big Data: Volume, Velocity, Variety, Veracity, Value
1.3 Ethics, Privacy, Transparency, and Social Impact
1.4 Reproducibility, Documentation, and Version Control
2. Development Environment
2.1 Jupyter, Python, and VS Code
2.2 Docker, DevContainers, and Reproducible Environments
2.3 Dependency Isolation (Pip, Conda)
2.4 Environment Management with Requirements.txt
3. Python for Analysis and Visualization
3.1 Review of Python Syntax, Data Structures, and Basic Scripting
3.2 NumPy and Matrix Manipulation
3.3 Pandas for Tabular Analysis
3.4 Visualizing with Matplotlib, Seaborn, and Plotly
4. Data Acquisition and Storage
4.1 Access to Local and Remote Data (CSV, JSON, Parquet)
4.2 APIs REST, authentication, and error handling
4.3 Web scraping with requests and BeautifulSoup
4.4 Storage in MongoDB (document store) and Redis (key-value)
5. Data engineering and EDA (Exploratory Data Analysis)
5.1 Building ETL pipelines
5.2 Data cleansing and transformation
5.3 Initial exploration and descriptive analysis
5.4 Optimization with efficient formats (Parquet, compression)
6. Interactive dashboards
6.1 Introduction to Streamlit and Dash
6.2 Building interfaces with filters, tables, and visualizations
6.3 Integration with pipelines and APIs
6.4 Practical applications in UC projects
7. Large-scale data processing
7.1 Introduction to the MapReduce paradigm
7.2 Architectures: Hadoop, Spark, Dask
7.3 Parallelization and chunking strategies
7.4 Operations with RDDs and DataFrames in PySpark
7.5 Introduction to MLlib (Spark) and machine learning distributed

Evaluation Methodology
The evaluation of the course unit is continuous and based on the completion of three compulsory practical projects, which allow for the assessment of the progressive application of Data Science and Big Data concepts.

- Project I (35%) introduces the complete Data Science process, including problem definition, data collection through simple web scraping, data processing and preparation, exploratory analysis, and presentation of conclusions in reports and visualisations.
- Project II (35%) deepens these skills, using more advanced data collection and integration techniques, such as scraping multiple sources, text extraction with OCR, and access to APIs, culminating in the production of reports and dashboards.
- Project III (30%) addresses large-scale data processing using tools such as Dask and Spark.

The delivery and defense of all projects are mandatory. Each project requires a minimum grade of 35% to be considered in the final grade.
The final grade is the weighted average of the three components, with a minimum overall grade of 9.5 out of 20 required to pass the course.

Bibliography
- Marr, B. (2022). Data Strategy: How to Profit from a World of Big Data, Analytics and the Internet of Things. USA: Kogan Page
- McKinney, W. (2017). Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. USA: O'Reilly
- Rioux, J. (2022). Data Analysis with Python and PySpark. USA: Manning
- Santos, M. e Costa, C. (2019). Big Data Concepts, Warehousing, and Analytics. . Lisboa: FCA
- Triguero, I. e Galar, M. (2023). Large-Scale Data Analytics with Python and Spark. UK: Cambridge University Press

Teaching Method
Theoretical and practical classes introduce concepts, with implementation examples. Guided lab classes use Jupyter notebooks and Python scripts. Applied projects throughout the semester consolidate knowledge and enable application.

Software used in class
Python (Anaconda, Jupyter Notebooks), Pandas, NumPy, Matplotlib, Seaborn, Plotly, MongoDB, Redis, Dask, PySpark, Streamlit, Dash, Docker, Git, Visual Studio Code

<< back to Curriculum Plan

Engenharia Informática-Internet das Coisas

Big Data Processing

News | Agenda