Authors
Affiliations

Gesellschaft für Informatik

deRSE

Gesellschaft für Informatik

deRSE

Florian Goth

Jan Phillip Thiele

Jan Linxweiler

Anna-Lena Lamprecht

Maja Toebs

Text to Data (Natural Language Processing)

This module introduces the students to text processing and information retrieval.

Contents

  • Introduction to textual data and Natural Language Processing (NLP): key concepts and use cases
  • Text acquisition and preprocessing: tokenisation, stemming/lemmatisation, stopwords, named entity recognition, text cleaning
  • Text representations: bag-of-words, TF-IDF, word embeddings (Word2Vec, GloVe), contextualised embeddings (BERT, RoBERTa, GPT variants)
  • Text mining and information extraction: key terms, topics, sentiment analysis, relation extraction
  • Information Retrieval and Web Search Engines
  • Data-to-insight pipelines: from raw text to structured data (CSV/SQL), feature engineering for text data
  • Modelling and evaluation: classification, regression, sequence models, transformers; evaluation metrics (accuracy, F1, MCC, AUC)
  • Prototyping and reproducibility: experiment tracking, versioning, reproducible pipelines (Docker/Kubernetes, MLflow, DVC)
  • Risk management and ethics: bias, fairness, data protection (GDPR), transparency, interpretability (LIME, SHAP)
  • Deployment and practical applications: API-based models, batch vs real-time processing, monitoring
  • Project work: from a research question to data collection, modelling, evaluation and documentation

Learning outcomes

  • Understand how text data is collected, preprocessed, and transformed into usable formats
  • Be able to select and justify different text representations
  • Apply NLP methods to extract relevant information from text
  • Develop, train and evaluate text-based models with attention to reproducibility and ethics
  • Assess model limitations, interpretability, and risk of errors
  • Build a reproducible, scalable text-data pipeline
  • Communicate results clearly in reports, presentations, and prototype designs

Examination method

project report (X pages) and presentation (15 minutes)

Lecture: Natural Language Processing

SWS: 2 ECTS: 4

Exercise: Natural Language Processing Exercise

SWS: 2 ECTS: 2

Sources & Implementations:

Curricula

Courses

Programs