Luke Posey, Product Manager
In this article we explore many of the different data file formats, both common and obscure. We'll look at the history of each file type, their pros and cons, and how they are used today. We'll also sprinkle in some data file types and other related concepts among the list of file formats.
CSV
CSV (Comma-Separated Values) is a simple, text-based file format used for storing tabular data. Each line in a CSV file represents a row of data, with individual values separated by commas. It is the most common spreadsheet file type for data practitioners.
Commonly used for: data exchange between applications, spreadsheets, and databases. CSVs are common for sharing data between users of spreadsheets and databases.
Origin: Comma-separated values date to the early days of computing in the 60s and 70s. IBM developed it for their Fortran language to facilitate data exchange and the first standardization attempt came with RFC 4180. [1]
Pros:
- Wide support
- Easy to create and easily human readable
- Relatively compact file sizes, can be compressed well
Cons:
- Not the most efficient format for large datasets
Parquet
Parquet is a column-oriented data file format that provides efficient data compression and encoding schemes, making it a common big data file format for high-performance data processing.
Commonly used for: large datasets in data warehouses, real-time data processing, and big data situations.
Origin: Parquet was created by a group of Twitter and Cloudera engineers looking to improve the file performance in big data situations and is now an Apache project.
Pros:
- Efficient data compression and encoding schemes
- Supports a wide range of data types
- Schema stored in metadata
- Columnar format allows for efficient data processing (add/remove columns easier)
Cons:
- Not (yet) as widely supported as CSV
JSON
JSON (JavaScript Object Notation) is a lightweight data file format that's easy for both humans and machines to work with. It's based on JavaScript but works with many programming languages, making it ideal for data exchange. JSON is simple to read, write, parse, and generate.
Commonly used for: API responses, configuration files, and data interchange between web applications and servers.
Origin: JSON was created by Douglas Crockford in 2001. The first standardization attempt came in 2005 with RFC 4627. [2]
Pros:
- Human-readable and writable
- Easy for machines to parse and generate
- Widely supported and understood
- Flexible data types (strings, numbers, arrays, objects)
Cons:
- No standard schema definition
- Security concerns in untrusted services
ORC
ORC (Optimized Row Columnar) is a columnar based data format that provides efficient data compression and encoding schemes. Where Parquet is optimized for high performance read situations that exist in many analytics scenarios, ORC is a big data file type optimized for high performance write.
Commonly used for: large datasets in data warehouses and big data processing systems.
Origin: ORC was a collaboration of Facebook and Hortonworks and is now an Apache project.
Pros:
- Efficient data compression and encoding schemes
- Great for write performance
- Supports ACID transactions
Cons:
- Not as optimized for read performance as parquet
- Not as widely supported
Avro
Avro is a compact, fast, binary data format that is easy for humans to read and write. Unlike Parquet, Avro is row-based.
Commonly used for: data interchange between systems, commonly in Hadoop, Spark, and Kafka.
Origin: Avro was created by Doug Cutting as an efficient way to store and transmit data with Hadoop; it is now an Apache project. Commonly used in Hadoop, Spark, and Kafka.
Pros:
- Compact and fast binary format
- Supports schema
- Wide support for data types
Cons:
- Not as widely supported of a data format as CSV and JSON
XML
XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.
Commonly used for: configuration files, data interchange.
Origin: XML was created by the World Wide Web Consortium (W3C) in the late 1990s. [3]
Pros:
- Human-readable and writable
- Supports complex data types
- Supports schema definition
Cons:
- Not so compact and subsequently not great for large datasets or data storage
- Not as widely supported as CSV and JSON, less popular than it used to be
YAML
YAML (YAML Ain't Markup Language) is a human-readable serialization format that is easy to read and write, optimized for serialization. As a consequence, it has become very common for configuration files.
Commonly used for: configuration files
Origin: YAML was created in 2001 as a human-readable data-serialization standard.
Pros:
- Human-readable and writable
- Supports complex data types
- Supports schema definition
Cons:
- Not good for data storage (not really a con, just not a use-case; it's like saying the con of a fridge is that it isn't great at freezing things)
TOML
TOML (Tom's Obvious, Minimal Language) is a data file format primarily used for configuration files. Its natural comparison is YAML.
Commonly used for: configuration files
Origin: TOML was created by Tom Preston-Werner.
Pros:
- Has some extra features over YAML, like support for comments and date-time types
Cons:
- More niche
Pros/cons in this category are pretty subjective. [7]
NDJSON
NDJSON (Newline-Delimited JSON) is a format that represents JSON objects in a single line, with each object separated by a newline character. Useful when you know you need to stream multiple JSON objects.
Commonly used for: data interchange, log files, big exports of multiple JSON objects.
TSV
TSV (Tab-Separated Values) is a simple text format for storing data in a tabular structure. Just like CSV, but instead uses tabs as delimiters instead of commas.
Commonly used for: data exchange between databases and spreadsheet applications, especially when the data might contain commas. (This section is called "commonly used for" but TSV is not commonly used. More like "where it's used.")
Origin: Created at the University of Minnesota, 1993. [4]
Pros:
- Good over CSVs when a field includes commas; don't have to escape the delimiter
- Human readable: easy to read and write
Cons:
- Less common/less supported than CSV
Binary
Binary files contain data stored in binary format, which means the data is represented as a sequence of bytes rather than human-readable text. These files are typically used for storing non-text data or for efficient storage and processing of large amounts of data.
Commonly used for: Executables, compiled code, images, audio files, video files, and other types of non-text data.
Origin: Binary files have been used since the early days of computing.
Pros:
- Efficient data storage depending on the data type (some are very inefficient if stored in binary format)
- Supports complex data types
- Supports efficient data compression and encryption
Cons:
- Inefficient data storage depending on the data type
Hierarchial Data Format (HDF)
HDF (Hierarchical Data Format) is a type of data file format designed for storing and managing large and complex data sets. It's extremely efficient for data retrieval when you typically read n number of consecutive rows, but not so efficient if you need to do random sampling, specific queries, etc.
Commonly used for: common in various scientific and commercial use-cases. Seeming to phase out/lose popularity.
Origin: HDF5 was created by the National Center for Supercomputing Applications (NCSA).
Pros:
- Fast and efficient data storage and retrieval for large datasets
- Long time Python support and continuous development has driven a lot of adoption
Cons:
- Slow in certain scenarios with specific sampling requirements
Feather
Feather is a binary file format designed to be fast and portable. Ships as part of Arrow and is often compared to Parquet. Answer to what is Arrow, Arrow vs Parquet, etc?
Commonly used for: storing Arrow tables and data frames. [6]
Origin: Feather was created as part of the Arrow project.
Pros:
- Fast, portable
- Integrated deeply with Arrow
Cons:
- Much less support than Parquet
- Optimized for read/write but not storage size (though compression was added at some point as a feature)
Pickle
Pickle is a commonly used Python file format for serialized Python objects.
Commonly used for: carrying on a Python program's state between sessions, caching results, etc.
Origin: not sure.
Pros/cons:
- This is the default option if you want to save Python objects.
Delta
(Worth noting that Delta is a table format, not a data file format. The difference is that a table format is an abstraction over a set of files, optimizing for the organization of the files upon which the abstraction sits. Same story with Iceberg, which is another table format mentioned after Delta.)
Delta is a table format built on Parquet that adds some extra features, like support for ACID transactions.
Commonly used for: similar use-cases to Parquet but prioritized in some data lakes like Databricks.
Origin: Was created at Databricks for use in Delta Lakes.
Pros:
- Efficient data storage for large and complex data sets (like Parquet)
- ACID transactions where Parquet lacks this feature
Cons:
- Extra maintenance overhead
Iceberg
Iceberg bills itself as the table format for "huge analytic tables." [5] You can imagine Iceberg in a similar vein to Delta, as a table format.
Commonly used for: analytics engines like Spark, Trino, Flink, Presto, Hive and Impala to be capable of working with very large tables.
Origin: Iceberg was developed at Netflix to solve Hadoop limitations. Iceberg is now an Apache project.
Pros:
- Extremely high performance
- ACID transactions
- Cheap schema changes
Cons:
- Extra maintenance overhead
Protocol Buffers (protobuf)
Protocol Buffers (protobuf) is a data file format for serializing data that is often compared to JSON. Used over JSON when you want serialization and care less about human readability.
Commonly used for: very common data format for serializing data
Origin: Developed at Google in 2001.
Pros vs JSON:
- Smaller size
- Type safety
- Fast serialization and deserialization
Cons vs JSON:
- Not human readable
- Schema obligatory
- Less widely supported
- If you need to read data passed directly in browser
Flatbuffers
Flatbuffers is a serialization format like Protocol Buffers.
Commonly used for: serialization where you want to deserialize without having to deserialize the entire object. Arrow also uses Flatbuffers for serialization. [8]
Origin: Flatbuffers was created at Google for game development.
Binary JSON (BSON)
Just as the name says, BSON is a binary-encoded serialization of JSON data.
Commonly used for: cases where you need to serialize JSON, largest use is MongoDB as its creator, doesn't seem to have caught on too much elsewhere.
Origin: BSON was created by MongoDB in 2009.
Sources
[1] https://datatracker.ietf.org/doc/html/rfc4180
[2] https://datatracker.ietf.org/doc/html/rfc4627
[3] https://www.itwriting.com/xmlintro.php
[4] https://en.wikipedia.org/wiki/Tab-separated_values
[5] https://iceberg.apache.org/
[6] https://arrow.apache.org/docs/python/feather.html
[7] https://news.ycombinator.com/item?id=17523304
[8] https://news.ycombinator.com/item?id=34415858
Please feel free to contact us if you think there are any common file types we've missed.