Skip to main content
Skip to main content

ParquetMetadata

Description

Special format for reading Parquet file metadata (https://parquet.apache.org/docs/file-format/metadata/). It always outputs one row with the next structure/content:

  • num_columns - the number of columns
  • ``num_rows` - the total number of rows
  • num_row_groups - the total number of row groups
  • format_version - parquet format version, always 1.0 or 2.6
  • total_uncompressed_size - total uncompressed bytes size of the data, calculated as the sum of total_byte_size from all row groups
  • total_compressed_size - total compressed bytes size of the data, calculated as the sum of total_compressed_size from all row groups
  • columns - the list of columns metadata with the next structure:
    • name - column name
    • path - column path (differs from name for nested column)
    • max_definition_level - maximum definition level
    • max_repetition_level - maximum repetition level
    • physical_type - column physical type
    • logical_type - column logical type
    • compression - compression used for this column
    • total_uncompressed_size - total uncompressed bytes size of the column, calculated as the sum of total_uncompressed_size of the column from all row groups
    • total_compressed_size - total compressed bytes size of the column, calculated as the sum of total_compressed_size of the column from all row groups
    • space_saved - percent of space saved by compression, calculated as (1 - total_compressed_size/total_uncompressed_size).
    • encodings - the list of encodings used for this column
  • row_groups - the list of row groups metadata with the next structure:
    • num_columns - the number of columns in the row group
    • num_rows - the number of rows in the row group
    • total_uncompressed_size - total uncompressed bytes size of the row group
    • total_compressed_size - total compressed bytes size of the row group
    • columns - the list of column chunks metadata with the next structure:
      • name - column name
      • path - column path
      • total_compressed_size - total compressed bytes size of the column
      • total_uncompressed_size - total uncompressed bytes size of the row group
      • have_statistics - boolean flag that indicates if column chunk metadata contains column statistics
      • statistics - column chunk statistics (all fields are NULL if have_statistics = false) with the next structure:
        • num_values - the number of non-null values in the column chunk
        • null_count - the number of NULL values in the column chunk
        • distinct_count - the number of distinct values in the column chunk
        • min - the minimum value of the column chunk
        • max - the maximum column of the column chunk

Example Usage

Example: