CVPR 2024 (Conference on Computer Vision and Pattern Recognition) saw some of the most outstanding research papers on computer vision. As a preeminent event for new research in support of AI, ML, deep learning, and much more, it continues to lead the field.
This year, CVPR saw 11,532 papers submitted with 2,719 approvals, which is a considerable increase compared to last year that saw 9,155 papers and 2,359 accepted.
CVPR, a leading-edge expo, also provides a platform for networking opportunities with tutorials and workshops, with the event annually attracting over 10,000 scientists and engineers. It featured research papers presented by major tech companies, including Meta, Google, and others, which followed suit from last year.
Here are some of the top papers presented by Meta.
PlatoNeRF: 3D Reconstruction in Plato’s Cave via Single-View Two-Bounce Lidar
PlatoNeRF is an innovative method for reconstructing 3D scenes from a single view using two-bounce lidar data. By combining neural radiance fields (NeRF) with time-of-flight data from a single-photon lidar system, it reconstructs both visible and occluded geometry with enhanced robustness to ambient light and low albedo backgrounds.
This method outperforms existing single-view 3D reconstruction techniques by utilising pulsed laser measurements to train NeRF, ensuring accurate reconstructions without hallucination. As single-photon lidars become more common, PlatoNeRF offers a promising, physically accurate alternative for 3D reconstruction, especially for occluded areas.
Read the full paper here.
Relightable Gaussian Codec Avatars
Meta researchers developed Relightable Gaussian Codec Avatars, which create high-fidelity, relightable head avatars capable of generating novel expressions.
The method uses a 3D Gaussian geometry model to capture fine details and a learnable radiance transfer appearance model for diverse materials, enabling realistic real-time relighting even under complex lighting.
This approach outperforms existing methods, demonstrated on a consumer VR headset. By combining advanced geometry and appearance models, it achieves exceptional visual quality and realism suitable for real-time applications like virtual reality, though further research is needed to address scalability, accessibility, and ethical considerations.
Read the full paper here.
Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild
The Nymeria dataset, the world’s largest of its kind, contains 300 hours of human motion data from 264 participants across 50 locations, captured using multimodal egocentric devices.
It includes 1200 recordings, 260 million body poses, 201.2 million images, 11.7 billion IMU samples, and 10.8 million gaze points, all synchronised into a single metric system.
The dataset features comprehensive language descriptions of human motion, totaling 310.5K sentences and 8.64 million words. It supports research tasks like motion tracking, synthesis, and understanding, with baseline results for models such as MotionGPT and TM2T.
Collected under strict privacy guidelines, the Nymeria dataset significantly advances egocentric motion understanding and supports breakthroughs in related research areas.
Read the full paper here.
URHand: Universal Relightable Hands
URHand is a universal relightable hand model using multi-view images of hands captured in a light stage with hundreds of identities.
Its key innovation is a spatially varying linear lighting model that preserves light transport linearity, enabling efficient single-stage training and adaptation to continuous illuminations without costly processes.
Combining physically-based rendering with data-driven modelling, URHand generalises across various conditions and can be quickly personalised using a phone scan. It outperforms existing methods in quality producing realistic renderings with detailed geometry and accurate shading.
URHand is suitable for applications in gaming, social telepresence, and augmenting training data for hand pose estimation tasks, representing a significant advancement in scalable, high-fidelity hand modelling.
Read the full paper here.
HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
HybridNeRF enhances the speed of neural radiance fields (NeRFs) by blending surface and volumetric rendering methods. While traditional NeRFs are slow due to intensive per-ray sampling in volume rendering, HybridNeRF optimises by predominantly rendering objects as surfaces.
It requires fewer samples, and reserves volumetric modelling for complex areas like semi-opaque or thin structures.
Adaptive “surfaceness” parameters dictate this hybrid approach, which improves error rates by 15-30% compared to current benchmarks and achieves real-time frame rates of over 36 FPS at 2K x 2K resolution.
Evaluated on datasets including Eyeful Tower and ScanNet++, HybridNeRF delivers state-of-the-art quality and real-time performance through innovations like spatially adaptive surfaceness, distance-adjusted Eikonal loss, and hardware acceleration techniques, advancing neural rendering for immersive applications.
Read the full paper here.
Robust Human motion reconstruction via diffusion
The paper ‘RoHM: Robust Human Motion Reconstruction via Diffusion’ introduces a method for reconstructing 3D human motion from monocular RGB(-D) videos, focusing on noise and occlusion challenges.
RoHM uses diffusion models to denoise and fill motion data iteratively, improving upon traditional methods like direct neural network regression or data-driven priors with optimisation.
It divides the task into global trajectory reconstruction and local motion prediction, managed separately with a novel conditioning module and iterative inference scheme.
RoHM outperforms existing methods in accuracy and realism across various tasks, with faster test-time performance. Future work aims to enhance real-time capability and incorporate facial expressions and hand poses.
Read the full paper here.
Learning to Localise Objects Improves Spatial Reasoning in Visual-LLMs
LocVLM is a novel approach to enhance spatial reasoning and localisation awareness in visual language models (V-LLMs) such as BLIP-2 and LLaVA. The method utilises image-space coordinate-based instruction fine-tuning objectives to inject spatial awareness, treating location and language as a single modality.
This approach improves VQA performance across image and video domains, reduces object hallucination, enhances contextual object descriptions, and boosts spatial reasoning abilities.
The researchers evaluate their model on 14 datasets across five vision-language tasks, introducing three new localisation-based instruction fine-tuning objectives and developing pseudo-data generation techniques.
Overall, LocVLM presents a unified framework for improving spatial awareness in V-LLMs, leading to enhanced performance in various vision-language tasks.
Read the full paper here.