Bachelor Theses at the Chair of Cognitive Systems (Prof. Dr. Andreas Zell)

Students who want to take a bachelor thesis should have attended at least one lecture of Prof. Zell and passed it with good or at least satisfactory grades. They might also have obtained the relevant background knowledge for the thesis from other, similar lectures.

Open Topics

Event-based vision for a tactile sensor

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Intelligent interaction with the physical world requires perceptual abilities beyond vision and hearing; vibrant tactile sensing is essential for autonomous robots to dexterously manipulate unfamiliar objects or safely contact humans. Therefore, robotic manipulators need high-resolution touch sensors that are compact, robust, inexpensive, and efficient. In recent work, our collaborators at MPI presented Minsight [1], a soft vision-based haptic sensor, which is a miniaturized and optimized version of the previously published sensor Insight. Minsight has the size and shape of a human fingertip and uses machine learning methods to output high-resolution maps of 3D contact force vectors at 60 Hz.

To look into the high frequency aspect of textures, an update rate of 60 Hz is not enough. Event-based cameras [2] which become more and more popular could be a good alternative to the classical, frame-based camera used so far. Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

In this thesis, the student is tasked to use a new, miniature event-based camera together with Deep Learning to bring Minsight to the next level.

The student should be familiar with Computer Vision, Deep Learning and ideally already used Deep Learning frameworks like PyTorch previously in projects or course work

[1] Andrussow, I.Sun, H.Kuchenbecker, K. J.Martius, G. Minsight: A Fingertip-Sized Vision-Based Tactile Sensor for Robotic Manipulation Advanced Intelligent Systems, 5(8):2300042, August 2023, Inside back cover

[2] G. Gallego, T. Delbruck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, D. Scaramuzza, Event-based Vision: A Survey, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 44, no. 1, pp. 154-180, 1 Jan. 2022.

Asynchronous Graph-based Neural Networks for Ball Detection with Event Cameras

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

State-of-the-art machine-learning methods for event cameras treat events as dense representations and process them with CNNs. Thus, they fail to maintain the sparsity and asynchronous nature of event data, thereby imposing significant computation and latency constraints. A recent line of work [1]–[5] tackles this issue by modeling events as spatio-temporally evolving graphs that can be efficiently and asynchronously processed using graph neural networks. These works showed impressive reductions in computation.

The goal of this thesis is to apply these Graph-based networks for ball detection with event cameras. Existing graph-based networks were designed for some more general object detection task [4], [5]. Since we only want to detect balls, in a first step, the student will investigate if a network architecture, targeted for our use case, could further improve the inference time.

The student should to be familiar with „traditional“ Computer Vision and Deep Learning. Experience with Python and PyTorch from previous projects would be beneficial.

[1] Y. Li et al., “Graph-based Asynchronous Event Processing for Rapid Object Recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, Oct. 2021, pp. 914–923. doi: 10.1109/ICCV48922.2021.00097.

[2] Y. Deng, H. Chen, H. Liu, and Y. Li, “A Voxel Graph CNN for Object Classification with Event Cameras,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, Jun. 2022, pp. 1162–1171. doi: 10.1109/CVPR52688.2022.00124.

[3] A. Mitrokhin, Z. Hua, C. Fermuller, and Y. Aloimonos, “Learning Visual Motion Segmentation Using Event Surfaces,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, Jun. 2020, pp. 14402–14411. doi: 10.1109/CVPR42600.2020.01442.

[4] S. Schaefer, D. Gehrig, and D. Scaramuzza, “AEGNN: Asynchronous Event-based Graph Neural Networks,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, Jun. 2022, pp. 12361–12371. doi: 10.1109/CVPR52688.2022.01205.

[5] D. Gehrig and D. Scaramuzza, “Pushing the Limits of Asynchronous Graph-based Object Detection with Event Cameras.” arXiv, Nov. 22, 2022. Accessed: Dec. 16, 2022. [Online]. Available: arxiv.org/abs/2211.12324

Ball Detection with event-based asynchronous sparse convolutional networks

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

In comparison to image frames from conventional cameras, data from event-based cameras is much sparser in most cases. If this sparsity is taken into account, a deep-learning based detector can benefit from this sparsity and achieve a reduced inference time. The goal of this thesis is to use Asynchronous Sparse Convolutional Layers [1] and apply it in a neural network to detect fast moving table tennis balls in real-time.

The student should to be familiar with „traditional“ Computer Vision, Machine Learning/Deep Learning and Python. Prior experience of PyTorch would be beneficial.

[1] N. Messikommer, D. Gehrig, A. Loquercio, and D. Scaramuzza, “Event-based Asynchronous Sparse Convolutional Networks,” European Conference on Computer Vision. (ECCV) 2020. [Online]. Available: arxiv.org/abs/2003.09148

Multi Object tracking via event-based motion segmentation with event cameras

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that asynchronously report timestamped changes in pixel intensity and offer advantages over conventional frame-based cameras in terms of low-latency, low redundancy sensing and high dynamic range. Hence, event cameras have a large potential for robotics and computer vision.

Since event cameras report changes of intensity per pixel, their output resembles an image gradient where mainly edges and corners are present. The contrast maximization framework (CMax) [1] uses this fact by optimizing the sharpness of accumulated events to solve computer vision tasks like the estimation of motion, depth or optical flow. Most recent works on event-based (multi) object segmentation [2]–[4] applies this CMax framework. The common scheme is to jointly assign events to an objct and fit ting a motion model which best explains the data.

The goal of this thesis is to develop a real-time capable (multi) object tracking pipeline by applying multi object segmentation. After the student got familiar with the recent literature, a suitable multi object segmentation approach should be chosen and adjusted for our use case, namely a table tennis setup. Afterwards, different object tracking approaches should be developed, evaluated and compared against each other.

The student should to be familiar with „traditional“ Computer Vision. Experience with C++ and/or optimization from previous projects or coursework would be beneficial.

[1] G. Gallego, M. Gehrig, and D. Scaramuzza, “Focus Is All You Need: Loss Functions for Event-Based Vision,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, Jun. 2019, pp. 12272–12281. doi: 10.1109/CVPR.2019.01256.

[2] X. Lu, Y. Zhou, and S. Shen, “Event-based Motion Segmentation by Cascaded Two-Level Multi-Model Fitting.” arXiv, Nov. 05, 2021. Accessed: Jan. 05, 2023. [Online]. Available: http://arxiv.org/abs/2111.03483

[3] T. Stoffregen, G. Gallego, T. Drummond, L. Kleeman, and D. Scaramuzza, “Event-Based Motion Segmentation by Motion Compensation,” ArXiv190401293 Cs, Aug. 2019, Accessed: Jun. 14, 2021. [Online]. Available: http://arxiv.org/abs/1904.01293

[4] Y. Zhou, G. Gallego, X. Lu, S. Liu, and S. Shen, “Event-based Motion Segmentation with Spatio-Temporal Graph Cuts,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1–13, 2021, doi: 10.1109/TNNLS.2021.3124580.

Multi-modal Robot Manipulation combining Gaze and Speech

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous, particularly for users with physical impairments. Gaze, as a natural interaction modality, has great potential in HRI for individuals with severe physical limitations. By integrating gaze with natural language understanding, robot arm can infer the user's intention, reducing ambiguity and enhancing interaction efficiency. This multi-modal approach is particularly beneficial for individuals with severe motor impairments, as it enables hands-free, natural communication without requiring complex gestures or precise verbal articulation.

The student should begin by reading previous related work, including our own publication and the official Aria Research Kit tutorials, to become familiar with the system architecture and capabilities of the glasses. They should then study literature on multimodal human-robot interaction, robotic grasp detection, and large language model prompt engineering. The core tasks include improving the alignment between language and gaze for more accurate intent recognition and integrating a state-of-the-art grasp detection algorithm, such as AnyGrasp, into the robotic system to enable generalizable object manipulation. A critical part of the project is replacing the prior cloud-based LLM module with a local LLaMA model  for privacy-preserving inference. The student is expected to implement at least one existing algorithm for grasp point prediction and propose two prompt for gaze-language fusion and robot action generation. The project will conclude with real world experiments to evaluate system performance and usability.

Muilt-modal Robot Manipulation, A Comparion between Large Language Model and Traditional Natural Lagauage Process

Mentor: Yuzhi Lai

Email: yuzhi.laispam prevention@uni-tuebingen.de

Effective Human-Robot Interaction (HRI) is crucial for enhancing accessibility and usability in real-world robotics applications. However, existing solutions often rely on gestures or language commands, making interaction inefficient and ambiguous. Gaze, as a natural interaction modality, has great potential in HRI. By integrating gaze and natural language processing (NLP), we can create a multi-modal interaction framework that enables hands-free, efficient, and intuitive HRI.

This project builds on our existing gaze-language fusion framework to develop an efficient and lightweight multi-modal interaction system for robot manipulation. Unlike previous methods that rely on large-scale language models (LLMs) for command interpretation, this project will focus on traditional NLP approaches that offer faster response times, lower memory consumption, and better suitability for edge-device deployment.

To evaluate the effectiveness of this approach, we will conduct a detailed comparison between our traditional NLP-based system and an LLM-based system across key performance metrics such as processing speed, accuracy in interpreting commands, and overall interaction efficiency. This analysis will help determine whether lightweight NLP models can match or exceed the performance of large language models while significantly improving real-time responsiveness and computational efficiency.

Spiking neural network for event-based ball detection

Mentor: Andreas Ziegler

Email: andreas.zieglerspam prevention@uni-tuebingen.de

Event cameras are bio-inspired sensors that differ from conventional frame cameras: Instead of capturing images at a fixed rate, they asynchronously measure per-pixel brightness changes, and output a stream of events that encode the time, location and sign of the brightness changes. Event cameras offer attractive properties compared to traditional cameras: high temporal resolution (in the order of μs), very high dynamic range (140 dB vs. 60 dB), low power consumption, and high pixel bandwidth (on the order of kHz) resulting in reduced motion blur. Hence, event cameras have a large potential for robotics and computer vision.

So far, most learning approaches applied to event data, convert a batch of events into a tensor and then use conventional CNNs as network. While such approaches achieve state-of-the-art performance, they do not make use of the asynchronous nature of the event data. Spiking Neural Networks (SNNs) on the other hand are bio-inspired networks that can process output from event-based directly. SNNs process information conveyed as temporal spikes rather than numeric values. This makes SNNs an ideal counterpart for event-based cameras.

The goal of this thesis is to investigate and evaluate how a SNN can be used together with our event-based cameras to detect and track table tennis balls. The Cognitive Systems groups has a table tennis robot system, where the developed ball tracker can be used and compared to other methods.

Requirements: Familiar with "traditional" Computer Vision, Deep Learning, Python

Web-based Leaderboard and Comparative Analysis Tool

Mentor: Rafia Rahim

Email: rafia.rahimspam prevention@uni-tuebingen.de

Implement a leaderboard-style web platform that lists stereo matching methods sorted by runtime and accuracy. Each entry would present detailed runtime statistics, inference speed (FPS), and accuracy metrics on a uniform hardware environment. Users can select two or more methods for side-by-side comparisons, revealing detailed runtime breakdowns through expandable panels, graphs, and visualizations. Include intuitive search/filter functionality to narrow down methods based on criteria (runtime range, accuracy thresholds, GPU/CPU usage).
 

Requirements: good programming skills, deep learning knowledge.

Performance and Efficiency of Hybrid Stereo Depth Models: A Comparative Analysis with Quantization

Mentor: Rafia Rahim

Email: rafia.rahimspam prevention@uni-tuebingen.de

 

This thesis evaluates modern stereo depth estimation models (e.g., MonSter, StereoAnyWhere, FoundationStereo, DEFOM Stereo) that leverage monocular or foundation model priors. It presents a comparative analysis of their zero-shot performance and robustness, alongside an investigation into the accuracy-efficiency trade-offs introduced by model quantization.
 

Requirements: good programming skills, deep learning knowledge.
 

Local deployment of DeepSeek R1

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT”  can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.

However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.  As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~1.3 TB of (GPU) memory which needs 16 H100 just for loading it (market price as of April 2025: ~30,000€ / unit). Thus, these models are usually ran using cloud-based solutions where your query is sent to and processed by a server-cluster.

This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.

To address this, model compression is an active area of research which aims to lower resource requirements of models by “shrinking” them. Using such methods (mainly a subset called ‘quantization’), Unsloth shrank the R1-model enough to fit it onto a single consumer grade GPU (RTX 4090) which could allow running the model locally. 

The goal of this thesis is to replicate the deployment described in the Unsloth article.

Specifically, the student should

  1. Follow the steps outlined here to run DeepSeek-R1 on a cluster of 4 x A5000 GPUs
  2. Benchmark inference speed for different hardware settings (i.e. using only 1 of the 4 GPUs)
  3. Create a web-based interface that allows chatting with the model

Necessary Background:

  • You can work independently
  • You can follow basic instructions such as those found under “Contact Details” 

Recommended Background:

  • Web-programming
  • Familiar with C++
  • Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)

Contact Details:

  • Please contact me only via e-mail
  • Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
  • I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.

Edge deployment of a diverse set of LLMs

Mentor: Dominik Hildebrand

Email: Dominik.Hildebrandspam prevention@uni-tuebingen.de

Large Language Models (LLMs) such as “ChatGPT”  can quickly turn huge amounts of text into clear and helpful responses such as when you need to draft an email, translate a paragraph, or provide a quick summary. Thus, they are becoming a larger and larger part of our everyday lives by making everyday tasks faster and easier.

However, LLMs - as their name suggests - are indeed large with parameter counts ranging from 1 Billion (B) over 56B all the way up to 671B.  As such, running inference with these models is expensive. For instance, the 671B model (called “DeepSeek R1”) requires (without optimization, lower bound) ~1.3 TB of (GPU) memory which needs 16 H100 just for loading it (market price as of April 2025: ~30,000€ / unit). Thus, these models are usually ran using cloud-based solutions where your query is sent to and processed by a server-cluster.

This means a number of issues for the user such as potentially high latency, no way to query it offline and privacy concerns of both your and other's data. For instance, using ChatGPT to summarize your chat messages means you are giving away not just your data but also that of the other participants.

To address this, model compression is an active area of research which aims to lower resource requirements of models by “shrinking” them. Ideally, this allows running those models locally and even in resource constraint settings (on so called “edge devices” like a smartphone). However, the effectiveness of such methods should be verified empirically by doing actual deployment on edge devices.  

The goal of this thesis is to facilitate the deployment of various LLMs on an edge device, namely the Nvidia Orin AGX Development Kit.

Specifically, the student should

  1. Setup a working environment on the edge device
  2. Use said environment to run a selection of LLMs (i.e. Llama-3.2-1B, Llama-3.2-3B, Mistral-7B, …)
  3. Benchmark inference speed
  4. (Optional:) Apply various compression techniques to shrink the models deployed in (2.)

Necessary Background:

  • You can work independently
  • You can follow basic instructions such as those found under “Contact Details” 

Recommended Background:

  • Has used a package manager like Anaconda before

Ideal Background:

  • Some experience using the transformers library
  • Knows what CUDA is
  • Basic understanding of the transformer architecture (i.e. attention mechanism, auto-regressive decoding, kv-cache, …)

Contact Details:

  • Please contact me only via e-mail
  • Attach your Transcript of Records (feel free to hide your grades, I only want to see what lectures you have heard)
  • I try to get back to you within a week. If I don't, please contact me again (ideally just resend your original mail). If you don't, I'll assume you are no longer interested.

Design and Implementation of a Motor-Propeller Test Bench for Tethered Drone Applications

Mentor: Max Beffert

Email: Max.Beffertspam prevention@uni-tuebingen.de

One way to work around the limited flight time of drones is by powering them from the ground through a tether. A critical factor in optimizing their performance is selecting the right tether cable, which depends on the power consumption and thrust characteristics of the motor-propeller system. Currently we are using the specs provided by the motor manufacturer but they are incomplete and not always trustworthy.

In this thesis, the student will design and build a test bench to experimentally measure thrust, power consumption, and optionally RPM and torque of some motor-propeller combinations. The setup should integrate sensors (load cells, current sensors, etc) and a data acquisition system to log and analyze performance under varying conditions. 

Requirements:

  • Good knowledge of electronics (sensors, microcontrollers) and mechanics
  • Basic Python experience
  • Optional: Familiarity with CAD (e.g., Fusion 360) for designing the test bench frame

Pedestrian Detection for a Tethered Drone in Agricultural Environments

Mentor: Max Beffert

Email: Max.Beffertspam prevention@uni-tuebingen.de

We use a tethered multirotor drone to fly over an autonomous agricultural ground vehicle. The purpose of this is to detect pedestrians in the vicinity of the ground vehicle. Currently the system streams live video from an onboard global shutter camera to the ground. The goal of this thesis is to train a model in order to detect pedestrians.

This project involves:

  1. Data Collection: Gathering and annotating video datasets of pedestrians under varying conditions.
  2. Model Training: Implementing and fine-tuning a lightweight object detection model (e.g., YOLO)
  3. Evaluation: Testing the model’s accuracy and robustness

The outcome will be a proof-of-concept detector that can later be deployed on the drone or ground vehicle.

Requirements:

  • Experience with Python, computer vision and deep learning frameworks (PyTorch)

References:

OSZAR »