Detection of human activities along with the associated context is of key importance for various application areas, including assisted living and well-being. To predict a user’s context in the daily-life situation a system needs to learn from multimodal data that are often imbalanced, and noisy with missing values. The model is likely to encounter missing sensors in real-life conditions as well (such as a user not wearing a smartwatch) and it fails to infer the context if any of the modalities used for training are missing. We propose a method based on an adversarial autoencoder for handling missing sensory features and synthesizing realistic samples. We empirically demonstrate the capability of our method in comparison with classical approaches for filling in missing values on a large-scale activity recognition dataset collected in-the-wild. We develop a fully-connected classification network by extending an encoder and systematically evaluate its multi-label classification performance when several modalities are missing. Furthermore, we propose a multi-stream temporal convolutional network to address the problem of multi-label behavioral context recognition. The presented architecture can be extended to include similar sensors for performance improvements and handles missing modalities through multi-task learning without any manual feature engineering on highly imbalanced and sparsely labeled datasets.