Scene Semantic Reconstruction from Egocentric …

Scene Semantic Reconstruction from Egocentric RGB-D-Thermal VideosRachel Luo, Ozan Sener, and Silvio SavareseStanford University{rsluo, osener, RGB-Depth-Thermal streamRGBD epthThermal*EnvironmentLeft handRight handObject in InteractionGeometric UnderstandingSemantic UnderstandingFigure 1: Scene propose a method for Egocentric SLAM to gain geometric and Semantic understandingof complex scenes where humans manipulate and interact with objects. The input of our system is an RGB-D-Thermal sen-sory stream from the perspective an operator ( an Egocentric view) (left panel). The desired output is a 3D reconstructionof the Scene where the location and pose of the observer is detected (center panel) as well as a 3D Semantic segmentation interms of the elements that involve human interaction (for instance: left hand, right hand, object that the operator interactswith, remainder of the environment) (right panel).AbstractIn this paper we focus on the problem of inferring geo-metric and Semantic properties of a complex Scene wherehumans interact with objects from Egocentric views.}

Un-like most previous work, our goal is to leverage a multi-modal sensory stream composed of RGB, depth, and ther-mal (RGB-D-T) signals and use this data stream as an in-put to a new framework for joint 6 DOF camera localiza-tion, 3D Reconstruction , and Semantic segmentation. As ourextensive experimental evaluation shows, the combinationof different sensing modalities allows us to achieve greaterrobustness in situations where both the observer and theobjects in the Scene move rapidly (a challenging situationfor traditional methods for Semantic Reconstruction ). More-over, we contribute a new dataset that includes a large num-ber of Egocentric RGB-D-T videos of humans performingdaily real-word activities as well as a new demonstrationhardware platform for acquiring such a IntroductionConsider a typical robotics scenario, in which a robotmust understand a Scene that includes humans, objects, andsome human-object interactions. These robots will need todetect, track, and predict human motions [34, 42, 8], andrelate them to the objects in the environment.

For example,a kitchen robot helping a chef should understand which ob-ject the chef is reaching for on a very crowded and highly-occluded kitchen table in order to deduce and bring back themissing next ingredient from the refrigerator. This type ofhigh-level reasoning requires the ability to infer rich seman-tics and geometry associated with both the humans ( theposition and pose of the chef s hands) and the objects ( pan, a cabinet, etc.) in the Scene . We seek to solve thisproblem by introducing: i) a new Egocentric dataset that in-tegrates various sensing modalities including RGB, depth,and thermal; and ii) a framework that allows us to jointlyextract critical Semantic and geometric properties such as3D location and Egocentric videos in our dataset leads to a cleardefinition of semantics : each point can be labeled as part ofeither the human, the object in interaction, or the environ-ment. This definition of semantics describes human-objectinteractions, and we define semantics and geometry as theyrelate to such interactions for the remainder of this include three critical modalities - RGB, depth, and ther-mal - in our raw data.

We design an affordable hardware4321to capture all of these modalities by mounting a structured-light camera (RGB-D) and a mobile thermal camera (RGB-Thermal) to a chest harness (Figure 2). We then developa system to calibrate and time-synchronize these resulting data is a 2D stream of RGB, depth, and ther-mal information. However, although a stream of RGB,depth, and thermal images includes the necessary semanticand geometric information, it is not structured in a way thatis useful for Scene understanding. We propose a new frame-work that can jointly infer the semantics (human vs. objectvs. static environment) and the geometry (camera poses and3D reconstructions) of the elements that are involved in aninteraction ( a hand, a plate that the hand is holding).The problem of jointly inferring semantics and Scene geom-etry can be considered a form of a Semantic SLAM problemwhereby all three modalities (RGB, depth, and thermal) areconsidered in conjunction to increase robustness and accu-racy.

Moreover, our framework can naturally handle bothstatic Scene elements ( a desk) and moving objects (ahand), unlike most SLAM or SFM methods, which assumethat the entire 3D Scene is summary, our contributions are: i) Designing an af-fordable, multi-modal data acquisition system for betterscene understanding. ii) Sharing a large dataset of RGB-depth-thermal videos of Egocentric scenes in which hu-mans interact with the environment. Annotations (boundingboxes and labels) of the elements related to an interaction( a hand or a plate) are also provided. iii) An egocentricSLAM algorithm that can combine the three data modalities(RGB, depth, and thermal) and can handle both static andmoving objects. We envision our proposed real-time SLAM architecture as a critical tool for modeling or discovering af-fordances or functional properties of objects from complexegocentric videos for robotics or co-robotics rest of this paper is organized as follows. In Sec-tion 2, we discuss some related work.

In Section 3, wedescribe the data acquisition system that we built as wellas characteristics of the collected dataset. In Section 4, weexplain the problem of solving SLAM for our egocentricvideos to obtain Semantic and geometric information. InSection 5, we show some of the outputs from our frame-work. Finally, Section 6 concludes the Related WorkEgocentric Scenes:A few previous works have studiedegocentric scenes. For example, [39], [24], [15], and [22]look at first-person pose and activity recognition. [19] cre-ates object-driven summaries for Egocentric videos. How-ever, none of these include large-scale publicly availabledatasets, and none of them include a thermal :SLAM is the problem of constructing a map ofan unknown environment while tracking the location of amoving camera within it. Although there are visual odom-etry approaches [14, 23], explicit models of the map typ-ically increase the accuracy of ego-motion estimation aswell.

Thus, SLAM has become an increasingly popular areaof research, especially for robotics or virtual reality appli-cations even when only the ego-motion is needed. Sev-eral early papers propose methods for monocular SLAM[4, 11]. More recently, ORB-SLAM proposes a sparse,feature-based monocular SLAM system [26]. LSD-SLAMis a dense, direct monocular SLAM algorithm for large-scale environments [7], and DSO is a sparse, direct visualodometry formulation [6].Several stereo SLAM methods also exist for RGB-D set-tings, for example the dense visual method DVO-SLAM[16]. Kintinuous is another dense visual SLAM systemthat can produce large-scale reconstructions [46]. Elastic-Fusion is a dense visual SLAM system for room scale en-vironments [47]. ORB-SLAM2 extends ORB-SLAM formonocular, stereo, and RGB-D cameras [27]. KinectFusioncan map indoor scenes in variable lighting conditions [29],and BundleFusion estimates globally optimized poses [3].

All of the above algorithms are designed for static scenesfrom a more global perspective. Nevertheless, althoughmost SLAM systems assume a static environment, a fewmethods have been developed with dynamic objects inmind. [41] builds a system that allows a user to rotate anobject by hand and see a continuously-updated model. [48]presents a structured light technique that can generate a scanof a moving object by projecting a stripe pattern. More re-cently, DynamicFusion builds a dense SLAM system for re-constructing non-rigidly deforming scenes in real time withRGB-D information [28]. However, these methods recon-struct only single objects rather than entire scenes, and noneconsider the Egocentric Interactions:One plausible application ofscene understanding with human-object interactions is forrobotics. Numerous attempts have been made over the pastseveral decades to better understand human-object interac-tions. Gibson coined the term affordance as early as1977 to describe the action possibilities latent in the envi-ronment [9].

Donald Norman later appropriated the term torefer to only the action possibilites that are perceivable byan individual [30].More recently, [18] and [17] learn human activities byusing object affordances. [35] teaches a robot about theworld by having it physically interact with objects. [25]predicts long-term sequential movements caused by apply-ing external forces to an object, which requires reasoningabout the physics of the works also examine human-object interactionsas they relate to hands or grasps. For instance, [40] and[2] both explore grasp classification in an attempt to under-stand hand-object manipulation. [12] studies the hands todiscover a taxonomy of grasp types using Egocentric we do not attack this problem directly, the outputof our framework can be useful for learning about human-object Tracking:Hand tracking is another area of inter-est for human-object interactions, and it is another potentialapplication of our framework.

[21] and [20] perform pixel-wise hand detection for Egocentric videos by using a datasetof labeled hands, and by posing the hand detection prob-lem as a model recommendation task, respectively. [45]and [38] perform depth-based hand pose estimation froma third-person and an Egocentric perspective. [44] simulta-neously tracks both hands manipulating an object as wellas the object pose using RGB-D videos. The closest to ourwork is [38]; however, it considers only images, lacks ther-mal information, and experiments only at small DatasetIn order to test our approach for obtaining a Semantic re-construction of complex Egocentric scenes, we designed amulti-modal data acquisition system combining an RGB-D camera with a mobile thermal camera. We then usedthis setup to collect a large dataset of aligned multi-modalvideos, and annotated semantically relevant information inthese videos. In this section, we explain our process in de-tail and discuss the characteristics of the collected Section , we describe the hardware that we used; inSection , we describe our method for geometrically cal-ibrating our data; in Section , we describe our annota-tions; in Section , we discuss the collected data; and inSection , we discuss potential future applications of HardwareOur data acquisition system includes two mobile cam-eras: one RGB-D (an Intel RealSense SR300 [36]) and onethermal (a Flir One Android [33]).

Scene Semantic Reconstruction from Egocentric …

Tags:

Information

Transcription of Scene Semantic Reconstruction from Egocentric …

Related search queries

Scene Semantic Reconstruction from Egocentric …

Tags:

Information

Related documents

An Analysis on the Market Segmentation of …

DATA SHEET ARUBA 2930F SWITCH SERIES

SAVE ON ALL NEW - pearsoncmg.com

1 SegNet: A Deep Convolutional Encoder-Decoder ...

Mastering the Customer Experience: The Key …

ADR (Average Daily Rate) - STR - Hotel Market …

Related search queries