Starting with the 1M context language model, we train on mixed formats: images, videos, and texts in diverse formats (text-image, image-text, video-text, text-video, etc.) using autoregressive prediction. Essentially in an any-to-any prediction manner with multiple modalities.