In railway engineering, monitoring the health condition of rail track structures is crucial to prevent abnormal vibration issues of the wheel–rail system. To address the problem of low efficiency of traditional nondestructive testing methods, this work investigates the feasibility of the computer vision-aided health condition monitoring approach for track structures based on vibration signals. The proposed method eliminates the tedious and complicated data pre-processing including signal mapping and noise reduction, which can achieve robust signal description using numerous redundant features. First, the method converts the raw wheel–rail vibration signals directly into two-dimensional grayscale images, followed by image feature extraction using the FAST-Unoriented-SIFT algorithm. Subsequently, Visual Bag-of-Words (VBoW) model is established based on the image features, where the optimal parameter selection analysis is implemented based on fourfold cross-validation by considering both recognition accuracy and stability. Finally, the Euclidean distance between word frequency vectors of testing set and the codebook vectors of training set is compared to recognize the health condition of track structures. For the three health conditions of track structures analyzed in this paper, the overall recognition rate could reach 96.7%. The results demonstrate that the proposed method performs higher recognition accuracy and lower bias with strong time-varying and random vibration signals, which has promising application prospect in early-stage structural defect detection. © 2022, The Author(s).