Data Challenges in Semi-Structured Data
In the old days of data warehouses, enterprises had to be careful about what went into their warehouses, knowing they’d have to pay dearly for every byte of data they stored. As cloud technologies drove the storage costs dramatically down, these enterprises could loosen the restrictions on what data (and how much of it) they stored in a data warehouse, opting to do more of the cleaning and transformation using SQL, inside the warehouse, rather than using procedural code to transform the data on its way into the warehouse.
JSON contains readable tags and an implicit organizational structure, but it’s a far cry from the rows and columns and rules of structured data. The structure is not fixed, and the files can contain an arbitrary depth of nesting. Flattening this data on its way into the data warehouse can result in lower resolution data and the loss of any detail that doesn’t conform to a predefined relational schema. Conversely, having semi-structured data like JSON in the warehouse creates a semantics challenge for anyone trying to curate the data for the downstream consumers.
To complicate matters further, the challenges of working with semi-structured data only grow as the volume, velocity, variety, and volatility of the data increase.