Researchers at the University of Maryland Institute for Advanced Computer Studies (UMIACS) partnered with a scientist at the National Information Communications Technology Research Centre of Excellence in Australia (NICTA) to develop robotic systems that are able to teach themselves. Specifically, these robots are able to learn the intricate grasping and manipulation movements required for cooking by watching online cooking videos. The key breakthrough is that the robots can “think” for themselves, determining the best combination of observed motions that will allow them to efficiently accomplish a given task.
Robot Learning Manipulation Action Plans by “Watching” Unconstrained Videos from the World Wide Web
Yezhou Yang University of Maryland email@example.com
Yi Li NICTA, Australia firstname.lastname@example.org
Cornelia Fermuller ¨ University of Maryland email@example.com
Yiannis Aloimonos University of Maryland firstname.lastname@example.org
In order to advance action generation and creation in robots beyond simple learned schemas we need computational tools that allow us to automatically interpret and represent human actions. This paper presents a system that learns manipulation action plans by processing unconstrained videos from the World Wide Web. Its goal is to robustly generate the sequence of atomic actions of seen longer actions in video in order to acquire knowledge for robots. The lower level of the system consists of two convolutional neural network (CNN) based recognition modules, one for classifying the hand grasp type and the other for object recognition. The higher level is a probabilistic manipulation action grammar based parsing module that aims at generating visual sentences for robot manipulation. Experiments conducted on a publicly available unconstrained video dataset show that the system is able to learn manipulation actions by “watching” unconstrained videos with high accuracy.
The ability to learn actions from human demonstrations is one of the major challenges for the development of intelligent systems. Particularly, manipulation actions are very challenging, as there is large variation in the way they can be performed and there are many occlusions. Our ultimate goal is to build a self-learning robot that is able to enrich its knowledge about fine grained manipulation actions by “watching” demo videos. In this work we explicitly model actions that involve different kinds of grasping, and aim at generating a sequence of atomic commands by processing unconstrained videos from the World Wide Web (WWW). The robotics community has been studying perception and control problems of grasping for decades (Shimoga 1996). Recently, several learning based systems were reported that infer contact points or how to grasp an object from its appearance (Saxena, Driemeyer, and Ng 2008; Lenz, Lee, and Saxena 2014). However, the desired grasping type could be different for the same target object, when used for different action goals. Traditionally, data about the grasp has been acquired using motion capture gloves or hand trackers, such as the model-based tracker of (Oikonomidis, Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Kyriazis, and Argyros 2011). The acquisition of grasp information from video (without 3D information) is still considered very difficult because of the large variation in appearance and the occlusions of the hand from objects during manipulation. Our premise is that actions of manipulation are represented at multiple levels of abstraction. At lower levels the symbolic quantities are grounded in perception, and at the high level a grammatical structure represents symbolic information (objects, grasping types, actions). With the recent development of deep neural network approaches, our system integrates a CNN based object recognition and a CNN based grasping type recognition module. The latter recognizes the subject’s grasping type directly from image patches. The grasp type is an essential component in the characterization of manipulation actions. Just from the viewpoint of processing videos, the grasp contains information about the action itself, and it can be used for prediction or as a feature for recognition. It also contains information about the beginning and end of action segments, thus it can be used to segment videos in time. If we are to perform the action with a robot, knowledge about how to grasp the object is necessary so the robot can arrange its effectors. For example, consider a humanoid with one parallel gripper and one vacuum gripper. When a power grasp is desired, the robot should select the vacuum gripper for a stable grasp, but when a precision grasp is desired, the parallel gripper is a better choice. Thus, knowing the grasping type provides information for the robot to plan the configuration of its effectors, or even the type of effector to use. In order to perform a manipulation action, the robot also needs to learn what tool to grasp and on what object to perform the action. Our system applies CNN based recognition modules to recognize the objects and tools in the video. Then, given the beliefs of the tool and object (from the output of the recognition), our system predicts the most likely action using language, by mining a large corpus using a technique similar to (Yang et al. 2011). Putting everything together, the output from the lower level visual perception system is in the form of (LeftHand GraspType1 Object1 Action RightHand GraspType2 Object2). We will refer to this septet of quantities as visual sentence. At the higher level of representation, we generate a symbolic command sequence. (Yang et al. 2014) proposed a context-free grammar and related operations to parse manipulation actions. However, their system only processed RGBD data from a controlled lab environment. Furthermore, they did not consider the grasping type in the grammar. This work extends (Yang et al. 2014) by modeling manipulation actions using a probabilistic variant of the context free grammar, and explicitly modeling the grasping type. Using as input the belief distributions from the CNN based visual perception system, a Viterbi probabilistic parser is used to represent actions in form of a hierarchical and recursive tree structure. This structure innately encodes the order of atomic actions in a sequence, and forms the basic unit of our knowledge representation. By reverse parsing it, our system is able to generate a sequence of atomic commands in predicate form, i.e. as Action(Subject, P atient) plus the temporal information necessary to guide the robot. This information can then be used to control the robot effectors (Argall et al. 2009). Our contributions are twofold. (1) A convolutional neural network (CNN) based method has been adopted to achieve state-of-the-art performance in grasping type classification and object recognition on unconstrained video data; (2) a system for learning information about human manipulation action has been developed that links lower level visual perception and higher level semantic structures through a probabilistic manipulation action grammar.