Adaptive control of a robotic arm is one of the greatest challenges that scientists have been tackling since the advent of Deep Learning [1; 2; 3]. The difficulty of the task, may it be reaching or grasping, resides in the fact that theoretically, an infinite number of solutions exists for any given situation. Indeed, if grasping an object in front of us may be achieved thanks to, let’s say, a certain combination of angles from the shoulder, elbow, wrist and fingers, the exact same result would be obtained by simply raising the elbow a bit, lowering the shoulder slightly, rotating the wrist further or closing another finger instead. In other words, there isn’t one optimal solution but a wide range of solutions. However, the space of solutions is seldom continuous – especially when natural constraints are imposed on each of the arm’s joint so that certain movements are impossible –, meaning that the average of two perfectly acceptable solutions may just as well lead to a failure.
Thus, Deep Learning methods often rely on the ability for an AI controller to find its own way of solving the problem, that is, by developing a strategy that always favors one solution over all others and that avoids discontinuities as much as possible. In other words, deep learning models allow the finding of an optimal set of continuous solutions. Interestingly, such a system will develop a more natural behavior, i.e. that resembles that of living beings, as compared to traditional AI methods, because all movements would obey that same strategy.
Two of the most recent attempts to apply Deep Learning to the adaptive control of a robotic arm in grasping tasks were led by Pinto and Gupta in 2015 , and by Levine, Pastor, Krizhevsky, and Quillen in 2016 . In both their experiments, a series of objects are arbitrarily placed on a table and the task of the robot is to pick them up one by one. In both again, the space of grasping positions is inherently two-dimensional as only the x axis, the y axis, and rotations around the z-axis (yaw) are controlled.
In collaboration with the AI research team at YASKAWA, a top international leader in industrial robotics, we have developed at XCompass an innovative method to operate real-time control of a robotic arm in a task of bulk grasping, such as, notably, the space of grasping positions be fully three-dimensional (x, y and z axes are controlled, as well as yaw, pitch and roll rotations). The robot arm was a 6-axis GP7 where grippers were used in place of tool. A 2D camera was mounted on the robot just above the grippers. The YASKAWA team oversaw real-time data exchange between their robot hardware controller, the camera, and our in-house AI platform. They also conducted data collection on up to 6 robots at a time, contributed to the training of the Deep Learning models, and ultimately prepared and ran the demo (see Video 1). The system was indeed showcased during the iREX International Robot Exhibition held in Tokyo in November 2017:
Video 1. Adaptive control of a robotic arm for grasping behavior.
Demo by YASKAWA at iREX 2017
XCompass’ team designed the AI platform and the architecture of the AI controller, then led and carried out the corresponding software development. Our implementation for the real-time demo involved two main AI components, so-called Engine 1 and Engine 2. The first engine was responsible for selecting and tracking a graspable point on a target object, while the second engine was generating the corresponding motor commands to execute with the robot arm. Their specific operation unfolds as follow:
It essentially consists in the Faster R-CNN algorithm . Fundamentally, this method can spot and identify objects in any given image (see Figure 1). The model has two outputs. Typically, one output generates candidate bounding boxes around objects in the image, while the other output analyzes the content of each bounding box to decide the category of the object (e.g., a person, a car, etc.).
Taking advantage of a pretraining for ImageNet classification , we kept the weights of the convolutional layers and only reinitialized the fully connected layers before training the whole model again on our new dataset.
Figure 1. Faster R-CNN architecture (on the left) and
an example of object classification in an image (on the right)
The bounding boxes, instead of surrounding whole objects, were focused on graspable points. We trained the model on different categories of objects, each of which presenting several graspable points to choose from. Thus, the task of our Faster R-CNN was not limited to identifying the objects, it also had to detect the most graspable points on these objects (c.f., Figure 2).
In contrast to previous research by Pinto & Gupta, 2015, we couldn’t just rely on the output of the Faster R-CNN to generate a target position for the grippers, for in our case, objects weren’t lying on the table but instead were in bulk. To ensure that our robot would follow the exact same grasping point from the beginning until the end of a grasping attempt, we added a tracking functionality to Engine 1.
Figure 2. Faster R-CNN’s identification of the most graspable points in images
of pipes (left), y-mark (center), and eye-bolt (right),
either at the beginning of the sequence (top row) or at the end (bottom row)
Engine 2 operates adaptive control of the robotic arm. To solve the task at hand, we have designed a dedicated deep neural network architecture, which can be understood as a series of interacting modules (see Figure 3). The Goalinit receives as inputs the image captured by the camera at the beginning of a grasping sequence (Imginit) as well as the grasping point that was selected by Engine 1 in this image (Dotinit: x,y coordinates in the image). Consequently, activations in this module are only updated once per sequence, at its very beginning. The Goalcurr module operates in a similar manner but for every step of the sequence – including the first. Thus, its inputs are the current image (Imgcurr) and the current grasping point (Dotcurr). Goal modules do not have specific targets to calculate a loss from and instead are trained thanks to distal signals backpropagated from all other modules (the Actors). Ultimately, they learn to convey the goal of the action to the Actors, that is, by integrating the target grasping positions and the camera images in a useful manner.
Figure 3. Engine 2 architecture. Distal training operates in the Goal modules where
information about the target object is extracted (initial and current steps, respectively).
The Actor, which generates the motor commands, is split in different modules (X, Y, Z axes,
Rx, Ry, Rz rotations) and cascaded for better flexibility and faster learning.
Actor modules receive such goal information along with the current position of the robot (Poscurr). Each actor separately computes one degree of freedom in the command (respectively, X, Y and Z axes, Rx, Ry and Rz rotations). This strong modularity of our architecture takes its root in the fundamental finding that sensorimotor control is better served by multiple internal models dividing up experience or context than by a unique and more complex internal model . This would, notably, allow greater flexibility, while we also found out that it quickens learning. In addition, we feedback Actors’ outputs from one module to another (output of Actor Rx is an input to other Actors and so on), overall generating a cascading architecture. Because of this cascade, the modules are indeed deeper and deeper (the deepest layers being in Actor Z) so that each subsequent module gets an opportunity to correct the mistakes of the previous modules. Training dataset was first obtained using previously available recognition technology from YASKAWA, then using our AI engines directly once significant performance was achieved.
In collaboration with YASKAWA’s research team, XCompass has developed a new method for the adaptive control of a robotic arm. Among others, its new features include:
- 2D camera mounted directly on the robot (cheaper than 3D camera)
- Selection and tracking of the target object from bulk
- 3D (6-axis) control of the robotic arm
- Fast training of the actor model thanks to a cascading architecture
Engine 1 is unique in that it merges Deep Learning (grasp point identification) with traditional AI (tracking). As for Engine 2, its modularity and its cascading architecture have proven reliable in the design of an AI controller that can generate robotic commands in real-time. By contrast, neither Pinto & Gupta (2015) nor Levine et al. (2016) implemented a deep learning model in place of an actor, as the former team simply ignored the problem and the other relied on a classical method of random sampling. Hence, we believe that our approach stands out for its boldness and for the challenges it allowed us to overcome.
Further details on the project can be found at https://www.yaskawa.co.jp/newsrelease/technology/35697. Should you be interested in knowing more about our work at XCompass, please contact us at firstname.lastname@example.org. Please also consider joining our team if you are eager to participate in similar projects (https://www.wantedly.com/projects/103913).
Author: Antoine Pasquali
Core development team at XCompass: Thomas Wilmotte, Gaku Nemoto, Kenji Aoki, Yuichi Sasaki and Antoine Pasquali
 Jordan, M. I., & Rumelhart, D. E. (1992). Forward Models: Supervised Learning with a Distal Teacher. Cognitive Science, 16, 307–354. doi:10.1207/s15516709cog1603_1
 Kawato, M., Furukawa, K., & Suzuki, R. (1987). A hierarchical neural network model for control and learning of voluntary movement. Biological Cybernetics, 57, 169–185. doi:10.1007/BF00364149
 Miller, W.T. (1987). Sensor-based control of robotic manipulators using a general learning algorithm. IEEE Journal of Robotics and Automation, 3, 157–165.
 Pinto, L., & Gupta, A. (2015). Supersizing Self-supervision: Learning to Grasp from 50K Tries and 700 Robot Hours. arXiv:1509.06825v1
 Levine, S., Pastor, P., Krizhevsky, A., & Quillen, D. (2016). Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection. arXiv:1603.02199v4
 Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497v3
 Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575v3.
 Wolpert, D. M., & Kawato, M. (1998). Multiple paired forward and inverse models for motor control. Neural Networks, 11(7-8), 1317–1329.