Skip to content

UnconGen - Train / Val / Test Split #4

Description

@sofiane87

Hi, I had a question with regard to this section of the dataprovider_pypots.py more specifically this part :

    # --- 3. Split Data and Time Info ---
    idx_train, idx_val, idx_test = make_split_indices(ori_data.shape[0], train_ratio, val_ratio, test_ratio)
    
    train_set_X, train_set_time = ori_data[idx_train], time_info[idx_train]
    val_set_X, val_set_time = ori_data[idx_val], time_info[idx_val]
    test_set_X, test_set_time = ori_data[idx_test], time_info[idx_test]    

    # --- 4. Apply Sliding Window to both Features and Time ---
    train_X = sliding_window(train_set_X, seq_len, stride)
    val_X = sliding_window(val_set_X, seq_len, stride)
    test_X = sliding_window(test_set_X, seq_len, stride)
    
    time_info_train = sliding_window(train_set_time, seq_len, stride)
    time_info_val = sliding_window(val_set_time, seq_len, stride)
    time_info_test = sliding_window(test_set_time, seq_len, stride)

My understanding here is that the sequence is being shuffled and then randomly split into a train / validation / test set before any windowing is done.
However wouldn't this lead the created sliding window to no longer match the real sequences, especially considering the following points:

  1. Steps are no longer sorted appropriately
  2. Sequences have now gaps within where the next "step" in the train sequence can randomly end up in the validation or test.

Can you confirm if my understanding is correct and if so how / if these concerns are addressed by the modelling ?

Thanks for all your work, very helpful !

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions