File Formats
CSV, TSV, JSON, and Avro, are traditional row-based file formats. Parquet, and ORC file are columnar file formats.
SequenceFile
Sequence files are introduced in Hadoop. Sequence files act as a container to store the small files. Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record.Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file.
There are three types of sequence files:
- Uncompressed key/value records.
- Record compressed key/value records -- only 'values' are compressed here
- Block compressed key/value records -- both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable.
RCFile
- RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate on the top of the rows.
- RCFILE is used when we want to perform operations on multiple rows at a time.
- RCFILEs are flat files consisting of binary key/value pairs, which shares many similarities with SEQUENCEFILE. RCFILE stores columns of a table in form of record in a columnar manner. It first partitions rows horizontally into row splits and then it vertically partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as the key part of a record, and all the data of a row split as the value part. This means that RCFILE encourages column oriented storage rather than row oriented storage.
- This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we "hive' a column oriented storage type.
ORCFile
- ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats.
- An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data.
Choosing File Formats
- If your data is delimited by some parameters then you can use TEXTFILE format.
- If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format.
- If you want to perform analytics on your data and you want to store your data efficiently for that then you can use RCFILE format.
- If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.
https://acadgild.com/blog/apache-hive-file-formats
Amazon Ion
Amazon Ion is a richly-typed, self-describing, hierarchical data serialization format offering interchangeable binary and text representations. The text format(a superset of JSON) is easy to read and author, supporting rapid prototyping. The binary representation is efficient to store, transmit, and skip-scan parse. The rich type system provides unambiguous semantics for long-term preservation of data which can survive multiple generations of software evolution.
Ion was built to address rapid development, decoupling, and efficiency challenges faced every day while engineering large-scale, service-oriented architectures. It has been addressing these challenges within Amazon for nearly a decade, and we believe others will benefit as well.
The Ion text format is a superset of JSON; thus, any valid JSON document is also a valid Ion document.
http://amzn.github.io/ion-docs
http://amzn.github.io/ion-docs/docs/spec.html
File Format Benchmarks - Avro, JSON, ORC, Parquet
Avro
- Cross-language file format for Hadoop
- Schema evolution was primary goal
- Schema segregated from data
- Unlike Protobuf and Thrift
- Row major format
JSON
- Serialization format for HTTP & Javascript
- Text-format with many parsers
- Schema completely integrated with data
- Row major format
- Compression applied on top
ORC
- Originally part of Hive to replace RCFile
- Now top-level project
- Schema segregated into footer
- Column major format with stripes
- Rich type mode, stored top-down
- Integrated compression, indexes, & stats
Parquet
- Design based on Google's Dremel paper
- Schema segregated into footer
- Column major format with stripes
- Simpler type-model with logical types
- All data pushed to leaves of the tree
- Integrated compression and indexes

Schema Evolution
8 Compatibility Types
- Backward: New versions work with old data.
- Forward: Old versions work with new data.
- Backward Transitive: Consumers using any version can process data from any previous version.
- Forward Transitive: Consumers using any version can process data from any newer version.
- Full (or Full Transitive): Achieves both backward and forward compatibility (often implies transitive).
- None: No compatibility is guaranteed.
- Backward Compatible, but not Full: You can read older data, but not all changes from the new version are guaranteed to be backward compatible.
- Forward Compatible, but not Full: You can process newer data, but not all changes from the new version are guaranteed to be forward compatible.
DataSets
- NYC Taxi Data
- 18 columns with no null values
- Doubles, integers, decimal & strings
- 2 months of data - 22.7 million rows
- Github Logs
- 704 columns with a lot of structure & nulls
- 1/2 month of data - 10.5 million rows
- Schema is huge (12k)
- Sales
- 55 columns with lots of nulls
- A little structure
- Timestamps, strings, longs, booleans, list & struct
- 23 million rows
Compression
- ORC and Parquet use RLE & Dictionaries
- All the formats have general compression
- ZLIB (GZip) - tight compression, slower
- Snappy - Some compression, faster

Taxi size analysis
- Don' use JSON
- Use either Snappy or Zlib compression
- Avor's small compression window hurts
- Parquet Zlib is smaller than ORC
- Group the column sizes by type


Taxi size analysis
- ORC did better than expected
- String columns have small cardinality
- Lots of timestamp columns
- No doubles

Github Size Analysis
- Surprising win for JSON and Avro
- Worst when uncompressed
- Best with zlib
- Many partially shared strings
- ORC and Parquet don't compress across columns