Zacharias 🐝 Voulgaris in IT - Information Technology, beBee in English, Engineers and Technicians Chief Science Officer • Data Science Partnership Jan 6, 2021 · 1 min read · 1.0K

Data Synthetics without A.I. and Why This Adds Value to You as an Individual

Data Synthetics without A.I. and Why This Adds Value to You as an Individual

Data Synthetics is a term I coined to refer to the framework/processes related to synthesizing data (instead of just analyzing it). It's by far the most significant thing in data science today and one of the many applications of A.I.; namely, specialized systems generating data based on a given dataset, all while maintaining the properties of the original dataset. But isn't there an abundance of data out there? Well, yes, but we could always use some more. This rationale is much like the work of a fiction writer. The latter often fancies creating her own characters for a novel or a short story even though there are plenty of real-world characters out there she could copy and include in her text. So, if you don't want to be part of someone else's work of fiction (especially if that gets published and read by many other people), you may want to abstain from having your personally identifiable information (PII) roaming free in the world. Part of that information you may be unable to change (e.g., health-related PII, aka PHI) so, protecting it is of paramount importance.

Data synthetics can do this for you by creating new data very similar to existing data, thereby creating an unbridgeable gap between your PII and the data that is used by a predictive model, for example. This similarity can also help make these predictions relevant to you since the general underlying pattern (aka, the signal in the data) remains the same.

Plenty of brilliant A.I. professionals, be it scientists or engineers, have delved into this problem and have come up with mathematically elegant solutions. One such solution is Variational AutoEncoder (VAE, link to a comprehensive and somewhat comprehensible article on this topic), a kind of artificial neural network (ANN) that aims to figure out the underlying distributions of the data and create new data based on them. These distributions are a mathematical model aiming to describe the signal. Not the only one and probably not even the best one either, but it's good enough for something basic. The problem with VAEs (and other A.I. systems) is that they need sufficiently large datasets to figure out this signal and manifest it in new data. Additionally, building a VAE isn't so simple unless you understand the technology and the not-so-trivial math involved.

What if there was a way to develop synthetic data without utilizing A.I.? What if all you needed to know was the Math you learned in school and a few other things based on that Math, elegant but not overly sophisticated? Well, that's what I've done recently with sufficient success to consider this something usable and useful. This framework (which I call ROOF, hence the picture on the top) I developed in Julia 1.5, is low on computational resources and can be applied to any kind of continuous data (there is also a version for ordinal data though I imagine that's not something you care about that much). If you are in this sort of work or know someone who is, feel free to reach out to me. Cheers!