Common Data File Formats Explained

In this article we explore many of the different data file formats, both common and obscure. We'll look at the history of each file type, their pros and cons, and how they are used today. We'll also sprinkle in some data file types and other related concepts among the list of file formats.

CSV

CSV (Comma-Separated Values) is a simple, text-based file format used for storing tabular data. Each line in a CSV file represents a row of data, with individual values separated by commas. It is the most common spreadsheet file type for data practitioners.

Commonly used for: data exchange between applications, spreadsheets, and databases. CSVs are common for sharing data between users of spreadsheets and databases. Learn how to analyse CSVs in seconds.

Origin:
Comma-separated values date to the early days of computing in the 60s and 70s. IBM developed it for their Fortran language to facilitate data exchange and the first standardization attempt came with RFC 4180. [1]

Pros:

Wide support
Easy to create and easily human readable
Relatively compact file sizes, can be compressed well

Cons:

Not the most efficient format for large datasets

Parquet

Parquet is a column-oriented data file format that provides efficient data compression and encoding schemes, making it a common big data file format for high-performance data processing.

Commonly used for: large datasets in data warehouses, real-time data processing, and big data situations.

Origin: Parquet was created by a group of Twitter and Cloudera engineers looking to improve the file performance in big data situations and is now an Apache project.

Pros:

Efficient data compression and encoding schemes
Supports a wide range of data types
Schema stored in metadata
Columnar format allows for efficient data processing (add/remove columns easier)

Cons:

Not (yet) as widely supported as CSV

JSON

JSON (JavaScript Object Notation) is a lightweight data file format that's easy for both humans and machines to work with. It's based on JavaScript but works with many programming languages, making it ideal for data exchange. JSON is simple to read, write, parse, and generate.

Commonly used for: API responses, configuration files, and data interchange between web applications and servers.

Origin: JSON was created by Douglas Crockford in 2001. The first standardization attempt came in 2005 with RFC 4627. [2]

Pros:

Human-readable and writable
Easy for machines to parse and generate
Widely supported and understood
Flexible data types (strings, numbers, arrays, objects)

Cons:

No standard schema definition
Security concerns in untrusted services

ORC

ORC (Optimized Row Columnar) is a columnar based data format that provides efficient data compression and encoding schemes. Where Parquet is optimized for high performance read situations that exist in many analytics scenarios, ORC is a big data file type optimized for high performance write.

Commonly used for: large datasets in data warehouses and big data processing systems.

Origin: ORC was a collaboration of Facebook and Hortonworks and is now an Apache project.

Pros:

Efficient data compression and encoding schemes
Great for write performance
Supports ACID transactions

Cons:

Not as optimized for read performance as parquet
Not as widely supported

Avro

Avro is a compact, fast, binary data format that is easy for humans to read and write. Unlike Parquet, Avro is row-based.

Commonly used for: data interchange between systems, commonly in Hadoop, Spark, and Kafka.

Origin: Avro was created by Doug Cutting as an efficient way to store and transmit data with Hadoop; it is now an Apache project. Commonly used in Hadoop, Spark, and Kafka.

Pros:

Compact and fast binary format
Supports schema
Wide support for data types

Cons:

Not as widely supported of a data format as CSV and JSON

XML

XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.

Commonly used for: configuration files, data interchange.

Origin: XML was created by the World Wide Web Consortium (W3C) in the late 1990s. [3]

Pros:

Human-readable and writable
Supports complex data types
Supports schema definition

Cons:

Not so compact and subsequently not great for large datasets or data storage
Not as widely supported as CSV and JSON, less popular than it used to be

YAML

YAML (YAML Ain't Markup Language) is a human-readable serialization format that is easy to read and write, optimized for serialization. As a consequence, it has become very common for configuration files.

Commonly used for: configuration files

Origin: YAML was created in 2001 as a human-readable data-serialization standard.

Pros:

Human-readable and writable
Supports complex data types
Supports schema definition

Cons:

Not good for data storage (not really a con, just not a use-case; it's like saying the con of a fridge is that it isn't great at freezing things)

TOML

TOML (Tom's Obvious, Minimal Language) is a data file format primarily used for configuration files. Its natural comparison is YAML.

Commonly used for: configuration files

Origin: TOML was created by Tom Preston-Werner.

Pros:

Has some extra features over YAML, like support for comments and date-time types

Cons:

More niche

Pros/cons in this category are pretty subjective.
[7]

NDJSON

NDJSON (Newline-Delimited JSON) is a format that represents JSON objects in a single line, with each object separated by a newline character. Useful when you know you need to stream multiple JSON objects.

Commonly used for: data interchange, log files, big exports of multiple JSON objects.

TSV

TSV (Tab-Separated Values) is a simple text format for storing data in a tabular structure. Just like CSV, but instead uses tabs as delimiters instead of commas.

Commonly used for: data exchange between databases and spreadsheet applications, especially when the data might contain commas. (This section is called "commonly used for" but TSV is not commonly used. More like "where it's used.")

Origin: Created at the University of Minnesota, 1993. [4]

Pros:

Good over CSVs when a field includes commas; don't have to escape the delimiter
Human readable: easy to read and write

Cons:

Less common/less supported than CSV

Binary

Binary files contain data stored in binary format, which means the data is represented as a sequence of bytes rather than human-readable text. These files are typically used for storing non-text data or for efficient storage and processing of large amounts of data.

Commonly used for: Executables, compiled code, images, audio files, video files, and other types of non-text data.

Origin: Binary files have been used since the early days of computing.

Pros:

Efficient data storage depending on the data type (some are very inefficient if stored in binary format)
Supports complex data types
Supports efficient data compression and encryption

Cons:

Inefficient data storage depending on the data type

Hierarchial Data Format (HDF)

HDF (Hierarchical Data Format) is a type of data file format designed for storing and managing large and complex data sets. It's extremely efficient for data retrieval when you typically read n number of consecutive rows, but not so efficient if you need to do random sampling, specific queries, etc.

Commonly used for: common in various scientific and commercial use-cases. Seeming to phase out/lose popularity.

Origin: HDF5 was created by the National Center for Supercomputing Applications (NCSA).

Pros:

Fast and efficient data storage and retrieval for large datasets
Long time Python support and continuous development has driven a lot of adoption

Cons:

Slow in certain scenarios with specific sampling requirements

Feather

Feather is a binary file format designed to be fast and portable. Ships as part of Arrow and is often compared to Parquet. Answer to what is Arrow, Arrow vs Parquet, etc?

Commonly used for: storing Arrow tables and data frames. [6]

Origin: Feather was created as part of the Arrow project.

Pros:

Fast, portable
Integrated deeply with Arrow

Cons:

Much less support than Parquet
Optimized for read/write but not storage size (though compression was added at some point as a feature)

Pickle

Pickle is a commonly used Python file format for serialized Python objects.

Commonly used for: carrying on a Python program's state between sessions, caching results, etc.

Origin: not sure.

Pros/cons:

This is the default option if you want to save Python objects.

Delta

(Worth noting that Delta is a table format, not a data file format. The difference is that a table format is an abstraction over a set of files, optimizing for the organization of the files upon which the abstraction sits. Same story with Iceberg, which is another table format mentioned after Delta.)

Delta is a table format built on Parquet that adds some extra features, like support for ACID transactions.

Commonly used for: similar use-cases to Parquet but prioritized in some data lakes like Databricks.

Origin: Was created at Databricks for use in Delta Lakes.

Pros:

Efficient data storage for large and complex data sets (like Parquet)
ACID transactions where Parquet lacks this feature

Cons:

Extra maintenance overhead

Iceberg

Iceberg bills itself as the table format for "huge analytic tables." [5] You can imagine Iceberg in a similar vein to Delta, as a table format.

Commonly used for: analytics engines like Spark, Trino, Flink, Presto, Hive and Impala to be capable of working with very large tables.

Origin: Iceberg was developed at Netflix to solve Hadoop limitations. Iceberg is now an Apache project.

Pros:

Extremely high performance
ACID transactions
Cheap schema changes

Cons:

Extra maintenance overhead

Protocol Buffers (protobuf)

Protocol Buffers (protobuf) is a data file format for serializing data that is often compared to JSON. Used over JSON when you want serialization and care less about human readability.

Commonly used for: very common data format for serializing data

Origin: Developed at Google in 2001.

Pros vs JSON:

Smaller size
Type safety
Fast serialization and deserialization

Cons vs JSON:

Not human readable
Schema obligatory
Less widely supported
If you need to read data passed directly in browser

Flatbuffers

Flatbuffers is a serialization format like Protocol Buffers.

Commonly used for: serialization where you want to deserialize without having to deserialize the entire object. Arrow also uses Flatbuffers for serialization. [8]

Origin: Flatbuffers was created at Google for game development.

Binary JSON (BSON)

Just as the name says, BSON is a binary-encoded serialization of JSON data.

Commonly used for: cases where you need to serialize JSON, largest use is MongoDB as its creator, doesn't seem to have caught on too much elsewhere.

Origin: BSON was created by MongoDB in 2009.