Working with Parquet files in TiDB Lightning
Introduction
The Parquet file format is a columnar data storage format maintained by the Apache Software Foundation. Parquet is widely used in the Hadoop ecosystem and supports more efficient storage, compression, and lookup than CSV or other similar flat-file formats. Many tools can export data in Parquet format, and Amazon Aurora snapshots are exported to S3 in Parquet format. TiDB Lightning natively supports importing data in Parquet files. This document describes how to interact with Parquet files and how to import them into a TiDB cluster using TiDB Lightning.
parquet-cli
The Parquet GitHub repository includes a useful tool called parquet-cli that can be used to interrogate and create parquet files: https://github.com/apache/parquet-mr/tree/master/parquet-cli.
The parquet-cli tool is written in Java and it's a little difficult to build and run if you are not comfortable with using maven to build Java projects. To get parquet-cli, you can either install maven using your OS package manager and follow the steps given in the GitHub repository to build parquet-cli yourself (build from a tag rather than from master), or you can set up homebrew and execute brew install parquet-cli
.
These steps have been tested on Ubuntu 20.04.3 LTS (Focal Fossa):
sudo apt install openjdk-8-jdk-headless maven thrift-compiler
git clone https://github.com/apache/parquet-mr.git
cd parquet-mr/
git checkout apache-parquet-1.12.2
mvn clean install -DskipTests
cd parquet-cli/
mvn dependency:copy-dependencies
This tutorial will assume that you can run parquet-cli by executing "parquet". If you built parquet-cli from GitHub, you should create a shell function to simplify execution of the tool:
parquet(){
dir=/home/ubuntu
java -cp "$dir/parquet-mr/parquet-cli/target/*:$dir/parquet-mr/parquet-cli/target/dependency/*" \
org.apache.parquet.cli.Main "$@"
}
Make sure you can execute parquet-cli by executing parquet help
.
Parquet File
If you already have a Parquet file, you can use parquet schema
to view information about the columns in your file.
$ parquet schema $HOME/data/test.t1.parquet
{
"type" : "record",
"name" : "data",
"fields" : [ {
"name" : "id",
"type" : [ "null", "long" ],
"doc" : "Type inferred from '1'",
"default" : null
}, {
"name" : "v",
"type" : [ "null", "string" ],
"doc" : "Type inferred from '4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8d'",
"default" : null
} ]
}
The Parquet file I used for my testing is one that I created from a CSV dump of an existing Parquet file exported by S3 by Amazon Aurora. Why? I wanted to make sure that TiDB Lightning could handle the Parquet file exported by Amazon Aurora as well as one that I created myself.
Amazon Aurora snapshot export
When you create a snapshot of an Amazon Aurora instance, the snapshot is stored in an internal format that is not directly accessible or readable by an end user. If you want to reuse the snapshot outside of Aurora, you must first export the snapshot to S3. This operation can take a considerable amount of time. Once the export is finished, you can view the files in your S3 bucket.
$ aws s3 ls --recursive s3://kolbe-test-1627426353/snap-1/
2022-01-10 14:48:03 565 snap-1/export_info_snap-1.json
2022-01-10 14:48:03 681 snap-1/export_tables_info_snap-1_from_1_to_1.json
2022-01-10 14:46:23 0 snap-1/test/test.t1/_SUCCESS
2022-01-10 14:46:21 80278529 snap-1/test/test.t1/part-00000-71a9fc31-2794-4b59-811a-86b580d1fb50-c000.gz.parquet
Copy all the files in that subdirectory (also known as a "prefix" in S3) to a local directory:
$ aws s3 cp --recursive s3://kolbe-test-1627426353/snap-1 aurora-snapshot/
You'll see one or more files with the suffix .gz.parquet in test/test.t1/:
$ find aurora-snapshot/ -type f
aurora-snapshot/export_info_snap-1.json
aurora-snapshot/export_tables_info_snap-1_from_1_to_1.json
aurora-snapshot/test/test.t1/_SUCCESS
aurora-snapshot/test/test.t1/part-00000-71a9fc31-2794-4b59-811a-86b580d1fb50-c000.gz.parquet
$ parquet schema aurora-snapshot/test/test.t1/part-00000-71a9fc31-2794-4b59-811a-86b580d1fb50-c000.gz.parquet
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "id",
"type" : [ "null", "long" ],
"default" : null
}, {
"name" : "v",
"type" : [ "null", "string" ],
"default" : null
} ]
}
Table schema
Note that the schema information in the Parquet file doesn't include enough detail to actually reconstruct the MySQL/TiDB table schema. The JSON files included in the exported snapshot include enough information to reconstruct the data types of the columns in the table, but not the indexes.
If you're working with an export of an existing MySQL/Aurora database, you should try to retrieve the schema from the source instance and use that to create the tables in the downstream/target TiDB cluster.
The easiest way to get the schema of the upstream cluster is to use dumpling's --no-data option to dump only the schema. Because Aurora doesn't support FLUSH TABLES WITH READ LOCK
, we also pass --consistency none
.
$ tiup dumpling --host database-1-instance-1.clv3ykk55yww.us-west-2.rds.amazonaws.com -P 3306 -u admin -p mypassword --consistency none --no-data --output aurora-schema
$ find aurora-schema
aurora-schema
aurora-schema/metadata
aurora-schema/test-schema-create.sql
aurora-schema/test.t1-schema.sql
$ cat aurora-schema/test.t1-schema.sql
/*!40101 SET NAMES binary*/;
CREATE TABLE `t1` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`v` varchar(512) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1310694 DEFAULT CHARSET=latin1;
Other Parquet file
The parquet-cli tool can be used to create a Parquet file from a CSV file. To do that, I will actually convert the Aurora snapshot export Parquet file into a CSV file and then convert that CSV file back to Parquet. If you have some other CSV data, you can create a Parquet file from that instead.
Convert Parquet to CSV
The parquet-cli cat
subcommand outputs the contents of the file as JSON.
$ parquet cat test/test.t1/part-00000-71a9fc31-2794-4b59-811a-86b580d1fb50-c000.gz.parquet > data.json
$ head data.json
{"id": 1, "v": "4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8db9dfe84c58b2b37b89903a740e1ee172da793a6e79d560e5f7f9bd058a12a280433ed6fa46510a"}
{"id": 2, "v": "40b244112641dd78dd4f93b6c9190dd46e0099194d5a44257b7efad6ef9ff4683da1eda0244448cb343aa688f5d3efd7314dafe580ac0bcbf115aeca9e8dc114"}
{"id": 3, "v": "3bafbf08882a2d10133093a1b8433f50563b93c14acd05b79028eb1d12799027241450980651994501423a66c276ae26c43b739bc65c4e16b10c3af6c202aebb"}
{"id": 4, "v": "a321d8b405e3ef2604959847b36d171eebebc4a8941dc70a4784935a4fca5d5813de84dfa049f06549aa61b20848c1633ce81b675286ea8fb53db240d831c568"}
{"id": 6, "v": "3c9ad55147a7144f6067327c3b82ea70e7c5426add9ceea4d07dc2902239bf9e049b88625eb65d014a7718f79354608cab0921782c643f0208983fffa3582e40"}
{"id": 7, "v": "f05210c5b4263f0ec4c3995bdab458d81d3953f354a9109520f159db1e8800bcd45b97c56dce90a1fc27ab03e0b8a9af8673747023c406299374116d6f966981"}
{"id": 8, "v": "bc23b8b01772d2dd67efb8fe1a5e6bd0f44b97c36101be6cc09f253b53e68d67a22e4643068dfd1341980134ea57570acf65e306e4d96cef4d560384894c88a4"}
{"id": 9, "v": "0dc526d8c4fa04084f4b2a6433f4cd14664b93df9fb8a9e00b77ba890b83704d24944c93caa692b51085bb476f81852c27e793600f137ae3929018cd4c8f1a45"}
{"id": 13, "v": "413f2ba78c7ed4ccefbe0cc4f51d3eb5cb15f13fec999de4884be925076746663aa5d34476a3df4a8729fd8eea01defa4f3f66e99bf943f4d84382d64bbbfa9e"}
{"id": 14, "v": "5f3a799ba20c20a225f75d4fe2acab79912dfcd2f2b333bf062b37acbb6463388c344430d5ba1e9fd318d3ed8263074e999e2b2e811bc51c5e2dfea4e2f32e58"}
We can use the jq tool to turn this JSON file into a CSV file.
$ jq -r 'map(.) | @csv ' < data.json > data.csv
$ head data.csv
1,"4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8db9dfe84c58b2b37b89903a740e1ee172da793a6e79d560e5f7f9bd058a12a280433ed6fa46510a"
2,"40b244112641dd78dd4f93b6c9190dd46e0099194d5a44257b7efad6ef9ff4683da1eda0244448cb343aa688f5d3efd7314dafe580ac0bcbf115aeca9e8dc114"
3,"3bafbf08882a2d10133093a1b8433f50563b93c14acd05b79028eb1d12799027241450980651994501423a66c276ae26c43b739bc65c4e16b10c3af6c202aebb"
4,"a321d8b405e3ef2604959847b36d171eebebc4a8941dc70a4784935a4fca5d5813de84dfa049f06549aa61b20848c1633ce81b675286ea8fb53db240d831c568"
6,"3c9ad55147a7144f6067327c3b82ea70e7c5426add9ceea4d07dc2902239bf9e049b88625eb65d014a7718f79354608cab0921782c643f0208983fffa3582e40"
7,"f05210c5b4263f0ec4c3995bdab458d81d3953f354a9109520f159db1e8800bcd45b97c56dce90a1fc27ab03e0b8a9af8673747023c406299374116d6f966981"
8,"bc23b8b01772d2dd67efb8fe1a5e6bd0f44b97c36101be6cc09f253b53e68d67a22e4643068dfd1341980134ea57570acf65e306e4d96cef4d560384894c88a4"
9,"0dc526d8c4fa04084f4b2a6433f4cd14664b93df9fb8a9e00b77ba890b83704d24944c93caa692b51085bb476f81852c27e793600f137ae3929018cd4c8f1a45"
13,"413f2ba78c7ed4ccefbe0cc4f51d3eb5cb15f13fec999de4884be925076746663aa5d34476a3df4a8729fd8eea01defa4f3f66e99bf943f4d84382d64bbbfa9e"
14,"5f3a799ba20c20a225f75d4fe2acab79912dfcd2f2b333bf062b37acbb6463388c344430d5ba1e9fd318d3ed8263074e999e2b2e811bc51c5e2dfea4e2f32e58"
Convert CSV to Parquet
We can create a Parquet file from a CSV file using the parquet-cli convert-csv
subcommand. The CSV file we created in the previous section does not include a header, so we must give the --no-header
option to parquet-cli.
$ parquet convert-csv data.csv -o data.parquet --no-header
We can confirm the schema of the data:
$ parquet schema data.parquet
{
"type" : "record",
"name" : "data",
"fields" : [ {
"name" : "field_0",
"type" : [ "null", "long" ],
"doc" : "Type inferred from '1'",
"default" : null
}, {
"name" : "field_1",
"type" : [ "null", "string" ],
"doc" : "Type inferred from '4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8d'",
"default" : null
} ]
}
Note that because we did not include a header, the field names in the Parquet file are "field_0" and "field_1", rather than the column names of the table. This will cause problems for TiDB Lightning later (a fix for https://github.com/pingcap/tidb/issues/31630. You should add a header to the file and re-generate the parquet data.
$ rm data.parquet data.csv
$ echo '"id","v"' > data.csv
$ jq -r 'map(.) | @csv ' < data.json >> data.csv
$ head data.csv
"id","v"
1,"4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8db9dfe84c58b2b37b89903a740e1ee172da793a6e79d560e5f7f9bd058a12a280433ed6fa46510a"
2,"40b244112641dd78dd4f93b6c9190dd46e0099194d5a44257b7efad6ef9ff4683da1eda0244448cb343aa688f5d3efd7314dafe580ac0bcbf115aeca9e8dc114"
3,"3bafbf08882a2d10133093a1b8433f50563b93c14acd05b79028eb1d12799027241450980651994501423a66c276ae26c43b739bc65c4e16b10c3af6c202aebb"
4,"a321d8b405e3ef2604959847b36d171eebebc4a8941dc70a4784935a4fca5d5813de84dfa049f06549aa61b20848c1633ce81b675286ea8fb53db240d831c568"
6,"3c9ad55147a7144f6067327c3b82ea70e7c5426add9ceea4d07dc2902239bf9e049b88625eb65d014a7718f79354608cab0921782c643f0208983fffa3582e40"
7,"f05210c5b4263f0ec4c3995bdab458d81d3953f354a9109520f159db1e8800bcd45b97c56dce90a1fc27ab03e0b8a9af8673747023c406299374116d6f966981"
8,"bc23b8b01772d2dd67efb8fe1a5e6bd0f44b97c36101be6cc09f253b53e68d67a22e4643068dfd1341980134ea57570acf65e306e4d96cef4d560384894c88a4"
9,"0dc526d8c4fa04084f4b2a6433f4cd14664b93df9fb8a9e00b77ba890b83704d24944c93caa692b51085bb476f81852c27e793600f137ae3929018cd4c8f1a45"
13,"413f2ba78c7ed4ccefbe0cc4f51d3eb5cb15f13fec999de4884be925076746663aa5d34476a3df4a8729fd8eea01defa4f3f66e99bf943f4d84382d64bbbfa9e"
$ parquet convert-csv data.csv -o data.parquet
$ parquet schema data.parquet
{
"type" : "record",
"name" : "data",
"fields" : [ {
"name" : "id",
"type" : [ "null", "long" ],
"doc" : "Type inferred from '1'",
"default" : null
}, {
"name" : "v",
"type" : [ "null", "string" ],
"doc" : "Type inferred from '4dff4ea340f0a823f15d3f4f01ab62eae0e5da579ccb851f8d'",
"default" : null
} ]
}
TiDB Lightning
TiDB Lightning expects to find the schema and data together under the same directory and with the same filename structure. Because our schema is from dumpling and the data is from an Aurora snapshot export, the directory structure and filename conventions are very different. To handle that, we will complete the load using two steps: the first step to import the schema and the second step to import the data.
Here's the configuration file to use for loading the schema:
$ cat lightning-schema.toml
[mydumper]
data-source-dir = "/home/ubuntu/aurora-schema/"
[tidb]
host = '127.0.0.1'
port = 4000
user = 'root'
password = ''
pd-addr = '127.0.0.1:2379'
[tikv-importer]
backend = "tidb"
And here's the configuration file for loading the data:
$ cat lightning-data.toml
[mydumper]
data-source-dir = "/home/ubuntu/aurora-snapshot/"
no-schema = true
[tidb]
host = '127.0.0.1'
port = 4000
user = 'root'
password = ''
pd-addr = '127.0.0.1:2379'
[tikv-importer]
backend = "local"
sorted-kv-dir = "/home/ubuntu/sort"
[[mydumper.files]]
# Uses single quoted strings to avoid escaping.
pattern = '(?i)^(?:[^\/]*\/)*([a-z0-9_]+)\.([a-z0-9_]+)\/(?:[^\/]*\/)*(?:[a-z0-9\-_.]+\.(parquet))$'
schema = '$1'
table = '$2'
type = '$3'
Now we can simply run tidb-lightning twice, once to load the schema and once to load the data:
$ tiup tidb-lightning -config lightning-schema.toml
...
$ tiup tidb-lightning -config lightning-data.toml
...
Comments
0 comments
Please sign in to leave a comment.