Enhancing data augmentation with advanced autoencoders & transformer-based tools in tabular data

What Is Data Augmentation

Data augmentation is a technique used to enhance the size and quality of training datasets in machine learning by artificially generating new data points. This process involves applying various transformations to existing data, such as rotation, scaling, cropping, flipping, and adding noise, to create new examples that help models generalize better to unseen data. By increasing the diversity of the training data, data augmentation can improve the robustness and performance of machine learning models, particularly in scenarios where acquiring large volumes of labeled data is challenging or costly. This technique is widely used in computer vision, natural language processing, and speech recognition, among other fields, to mitigate overfitting and enhance model accuracy.

Figure 1 This is a diagram illustrating data augmentation. The original image is shown alongside various augmented versions created through transformations such as rotation, shifting, zooming, and flipping.

In the context of generative AI, data augmentation takes on a more sophisticated role, leveraging advanced algorithms to create entirely new and realistic data samples. Generative Adversarial Networks (GANs) and Autoencoders (AEs) are commonly used to generate high-quality synthetic data that can be indistinguishable from real data. This type of augmentation is particularly valuable in scenarios where data privacy is a concern or where rare events are underrepresented in the dataset. For example, in medical imaging, generative AI can produce synthetic images of rare diseases to train diagnostic models more effectively. By utilizing generative models for data augmentation, researchers can significantly expand and diversify their training datasets, leading to improved model performance and broader applicability across various domains.

Figure 2 Original and generated image from VQ-VAE[1]

Key Techniques for Data Augmentation in s-X-AIPI

In the context of the s-X-AIPI project, one of the critical challenges in the Asphalt Use Case (UC) lies in the scarcity of available laboratory test data. These datasets, though spanning extended time periods, are limited in volume, making it difficult to effectively train AI models and procedures aimed at analyzing asphalt conditions. This scarcity can lead to imbalanced datasets, reduced model robustness, and suboptimal predictive performance.

To overcome this limitation, our data augmentation efforts focus on enhancing the available data using advanced generative AI techniques, focused on models belong in the Autoencoder family, particularly Variational Autoencoders (VAEs) and Denoising Autoencoders (DAEs), and ReaLTabFormer.

Variational Autoencoders (VAEs) for Data Augmentation

  • Latent Space Representation: VAEs work by learning a probabilistic mapping from the input data to a lower-dimensional latent space and then decoding from this space to generate new data samples. For tabular data, this means learning the underlying distribution of the features. By sampling from the latent space, the VAE can generate new synthetic data points that closely follow the patterns and dependencies observed in the original data.

  • Generating New Data: In the Asphalt use case, generating additional data points could help in scenarios where labeled data is scarce or imbalanced. For instance, VAEs can be trained on a dataset representing various conditions of asphalt and the model can then be used to generate synthetic samples to augment underrepresented conditions in the training set.

Denoising Autoencoders (DAEs) for Robust Data Generation

  • Noise Removal and Data Reconstruction: DAEs are trained to reconstruct the original input from a corrupted version, effectively learning to "denoise" the data. In the context of tabular data, this can mean training the model to handle missing or noisy data. By corrupting parts of the data and training the DAE to recover the original, the model learns a robust representation of the underlying data structure.

  • Data Augmentation via Corruption: To augment data, new samples can be generated by adding controlled noise to the input data and then using the DAE to reconstruct it. This process can introduce variability in the data, which helps in training more robust models. For instance, slightly altering the features of a tabular dataset representing different asphalt conditions and then reconstructing these altered states can lead to a richer dataset.

How ReaLTabFormer Works for Tabular Data Augmentation

  • Transformer Architecture: REaLTabFormer uses a GPT-2-like transformer architecture to model non-relational tabular data. For relational data, it extends the approach by incorporating an encoder-decoder framework. The encoder processes the "parent" table (often the primary table in relational datasets), while the decoder generates synthetic data for the "child" table, preserving the relationships between them.

  • Data Privacy and Regularization: To ensure privacy, REaLTabFormer incorporates several techniques to prevent the model from memorizing and reproducing exact records from the training data. These include statistical bootstrapping and a data-copying metric that assess overfitting risks, combined with techniques like target masking for regularization.

The quality and suitability of the augmented datasets generated by the three methods—VAE, DAE, and ReaLTabFormer—were assessed using model evaluation metrics. Among them, the dataset generated by the VAE method emerged as the most effective, significantly enhancing the model’s accuracy and predictive performance.

Figure 3: Comparison of R² values for each variable with and without the augmented dataset. The histogram highlights the improvement in predictive performance across variables when using the augmented data, demonstrating the effectiveness of the augmentation proces.

Next
Next

Key takeaways at the s-X-AIPI Month 30 Consortium Meeting in Athens, Greece