We let you deep dive into section 4 of the original Dremel paper to learn the intrinsics of how records are split in columns. This is the issue that the Dremel paper addressed and solved with a clever mechanism for describing how fields are nested and repeated. Systems such as HBase had long made use of the columnar format paradigm.īut if the columnar format was well suited for table like data, it was not adapted to nested data structures which often would end up in a single column hence annihilating the benefit of the columnar format if only specific fields of such a structure are needed.
The idea behind columnar formats is that by grouping data in columns instead or rows, similar values end up together and lead to better compression, and at retrieval time only the needed data can be read if not all columns are needed. The novelty described in the Dremel paper was a clever way of enabling complex records to be stored in a columnar format. In the summer of 2012, Julien Le Dem who was then working at Twitter, tweeted that he had found an error in the Dremel paper, the reason was that he had started working on an open source implementation of the nested record format used by Dremel for use in Hadoop, a project that would a little later become known as Parquet. In 2010 at the VLDB Conference, people from Google presented a paper titled Dremel: Interactive Analysis of Web-Scale Datasets which described an internal tool named Dremel and the way it organized data. This post aims at giving you a better understanding of the internals of the Parquet format and of its suitability for time series data. When we ask further questions to better understand the schema adopted, how those files are used and what storage and access performance they allow, the answers often show the way the Parquet format works is not very well understood. Among those technologies we often encounter the Parquet file format which is usually chosen because it is a columnar format and therefore it is performant for time series data. The diversity of technologies that have employed would surprise you. As the maker of Warp 10, the most advanced Time Series Platform, we are regularly in contact with people who have been storing time series data for quite some time.