When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

luckyliush · 2024-05-10T09:36:28Z

Search before asking

I had searched in the issues and found no similar issues.

What happened

The amount of data is 15 million rows. The speed of extracting hive data to mysql can only reach 13,000 records per second, and it takes nearly 30 minutes to complete the extraction.
But I set the seatunnel parallelism parameter to 10. After the data extraction task started, I used the show processlist command on the MySQL client and found that there was only one insert thread.
Then I tested extracting hive data to the console. The same amount of data took less than 1 minute.

SeaTunnel Version

2.3.1 and 2.3.5

SeaTunnel Config

env {
  execution.parallelism = 10
  job.mode = "BATCH"
}

source {
 Hive {
table_name = ""
metastore_uri = "${metastore_uri}"
 result_table_name = "Table_test"
hdfs_site_path = ""
hive_site_path = ""
    }
}



transform {
  sql {
    source_table_name="Table_test"
query = "select xxx from Table_test"
result_table_name = "Table_test2"
}
}


sink {
    Jdbc {

url = "${url}"
driver = "${driver}"
user = "${user}"
password = "${password}"
database = "${mysql_db}"
batch_size = 20000
table = ""
query = "insert into table_name (xxx) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
    }
}

Running Command

I use the seatunnel module on dolphinscheduler to run in cluster mode.
I also submitted from the command line using: ./bin/seatunnel.sh --config ./config/test.config

Error Exception

No exception occurred, just did not meet expectations

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

1.8

Screenshots

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

liunaijie · 2024-05-11T03:50:39Z

try to add rewriteBatchedStatements=true parameter to your jdbc url

luckyliush · 2024-05-11T05:48:22Z

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before，did not meet expectations

liunaijie · 2024-05-11T05:59:49Z

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before，did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

luckyliush · 2024-05-11T06:51:36Z

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before，did not meet expectations
plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]
try with this config, it will auto generate insert sql, i use this config, the write speed is good

`env {
execution.parallelism = 10
job.mode = "BATCH"
}

source {
Hive {
table_name = ""
metastore_uri = ""
result_table_name = "Table_test"
hdfs_site_path = "/home/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml"
hive_site_path = "/home/hadoop/hive-2.3.9/conf/hive-site.xml"
}
}

transform {
sql {
source_table_name="Table_test"
query = "select xxx,xxx from Table_test"
result_table_name = "Table_test2"
}
}

sink {
Jdbc {

url = "jdbc:mysql://xxx:3306/xxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
driver = "com.mysql.cj.jdbc.Driver"
user = "root"
enable_upsert = true
generate_sink_sql = true
password = "xxx"
database = "xxx"
primary_keys = [xxx,xxx,xxx]
table = "xxx"
}
}`

The version I am using is 2.3.1, and the configuration is as shown above, but the extraction speed is the same as before and has not improved. Is there something wrong with my configuration?
Approximately 12,000 pieces of data can be extracted per second.
I passed the primary key parameters based on the granular fields of the hive table, but these fields are not set as primary keys in the mysql table. Does this have any impact?

luckyliush · 2024-05-13T05:47:49Z

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before，did not meet expectations
plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]
try with this config, it will auto generate insert sql, i use this config, the write speed is good

Hello, now in the seatunnel-2.3.5 version, using the same configuration, the parallelism parameter will not take effect.
But after adding the parameter read_limit.rows_per_second=10000 to seatunnel-2.3.5, the parallelism parameter will take effect and the extraction speed will be significantly improved. Do you know the reason?

luckyliush added the bug label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

luckyliush commented May 10, 2024

liunaijie commented May 11, 2024

luckyliush commented May 11, 2024 •

edited

liunaijie commented May 11, 2024

luckyliush commented May 11, 2024 •

edited

luckyliush commented May 13, 2024

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

Comments

luckyliush commented May 10, 2024

Search before asking

What happened

SeaTunnel Version

SeaTunnel Config

Running Command

Error Exception

Zeta or Flink or Spark Version

Java or Scala Version

Screenshots

Are you willing to submit PR?

Code of Conduct

liunaijie commented May 11, 2024

luckyliush commented May 11, 2024 • edited

liunaijie commented May 11, 2024

luckyliush commented May 11, 2024 • edited

luckyliush commented May 13, 2024

luckyliush commented May 11, 2024 •

edited

luckyliush commented May 11, 2024 •

edited