The space that can be explored quickly from a fixed view point without locomotion is known as the vista space. In indoor environments single rooms and room parts follow this definition. The vista space plays an important role in situations with agent-agent interaction as it is the directly surrounding environment in which the interaction takes place. A collaborative interaction of the partners in and with the environment requires that both partners know where they are, what spatial structures they are talking about, and what scene elements they are going to manipulate. This thesis focuses on the analysis of a robot's vista space. Mechanisms for extracting relevant spatial information are developed which enable the robot to recognize in which place it is, to detect the scene elements the human partner is talking about, and to segment scene structures the human is changing. These abilities are addressed by the proposed holistic, aligned, and articulated modeling approach. For a smooth human-robot interaction, the computed models should be aligned to the partner's representations. Therefore, the design of the computational models is based on the combination of psychological results from studies on human scene perception with basic physical properties of the perceived scene and the perception itself. The holistic modeling realizes a categorization of room percepts based on the observed 3D spatial layout. Room layouts have room type specific features and fMRI studies have shown that some of the human brain areas being active in scene recognition are sensitive to the 3D geometry of a room. With the aligned modeling, the robot is able to extract the hierarchical scene representation underlying a scene description given by a human tutor. Furthermore, it is able to ground the inferred scene elements in its own visual perception of the scene. This modeling follows the assumption that cognition and language schematize the world in the same way. This is visible in the fact that a scene depiction mainly consists of relations between an object and its supporting structure or between objects located on the same supporting structure. Last, the articulated modeling equips the robot with a methodology for articulated scene part extraction and fast background learning under short and disturbed observation conditions typical for human-robot interaction scenarios. Articulated scene parts are detected model-less by observing scene changes caused by their manipulation. Change detection and background learning are closely coupled because change is defined phenomenologically as variation of structure. This means that change detection involves a comparison of currently visible structures with a representation in memory. In range sensing this comparison can be nicely implement as subtraction of these two representations. The three modeling approaches enable the robot to enrich its visual perceptions of the surrounding environment, the vista space, with semantic information about meaningful spatial structures useful for further interaction with the environment and the human partner.