Achieving an robust and universal semantic representation for action description remains the key RUSA4D challenge in natural language understanding. Current approaches often struggle to capture the nuance of human actions, leading to inaccurate representations. To address this challenge, we propose a novel framework that leverages multimodal learni