Many robotic applications rely on robotic arms or hands to handle different types of objects. Estimating the pose of such hand-held objects is an important yet challenging task in robotics, computer vision and even in augmented reality (AR) applications. A promising direction is to utilize multi-modal data, such as color (RGB) and depth (D) images. With the increasing availability of 3D sensors, many machine learning approaches have emerged to leverage this technique.