Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When extracting hive data to mysql, the parallelism parameter setting is invalid. #6830

Open
2 of 3 tasks
luckyliush opened this issue May 10, 2024 · 5 comments
Open
2 of 3 tasks
Labels

Comments

@luckyliush
Copy link

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

The amount of data is 15 million rows. The speed of extracting hive data to mysql can only reach 13,000 records per second, and it takes nearly 30 minutes to complete the extraction.
But I set the seatunnel parallelism parameter to 10. After the data extraction task started, I used the show processlist command on the MySQL client and found that there was only one insert thread.
Then I tested extracting hive data to the console. The same amount of data took less than 1 minute.

SeaTunnel Version

2.3.1 and 2.3.5

SeaTunnel Config

env {
  execution.parallelism = 10
  job.mode = "BATCH"
}

source {
 Hive {
table_name = ""
metastore_uri = "${metastore_uri}"
 result_table_name = "Table_test"
hdfs_site_path = ""
hive_site_path = ""
    }
}



transform {
  sql {
    source_table_name="Table_test"
query = "select xxx from Table_test"
result_table_name = "Table_test2"
}
}


sink {
    Jdbc {

url = "${url}"
driver = "${driver}"
user = "${user}"
password = "${password}"
database = "${mysql_db}"
batch_size = 20000
table = ""
query = "insert into table_name (xxx) values(?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)"
    }
}

Running Command

I use the seatunnel module on dolphinscheduler to run in cluster mode.
I also submitted from the command line using: ./bin/seatunnel.sh --config ./config/test.config

Error Exception

No exception occurred, just did not meet expectations

Zeta or Flink or Spark Version

Zeta

Java or Scala Version

1.8

Screenshots

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@luckyliush luckyliush added the bug label May 10, 2024
@liunaijie
Copy link
Contributor

try to add rewriteBatchedStatements=true parameter to your jdbc url

@luckyliush
Copy link
Author

luckyliush commented May 11, 2024

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

@liunaijie
Copy link
Contributor

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

@luckyliush
Copy link
Author

luckyliush commented May 11, 2024

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

`env {
execution.parallelism = 10
job.mode = "BATCH"
}

source {
Hive {
table_name = ""
metastore_uri = ""
result_table_name = "Table_test"
hdfs_site_path = "/home/hadoop/hadoop-3.2.2/etc/hadoop/hdfs-site.xml"
hive_site_path = "/home/hadoop/hive-2.3.9/conf/hive-site.xml"
}
}

transform {
sql {
source_table_name="Table_test"
query = "select xxx,xxx from Table_test"
result_table_name = "Table_test2"
}
}

sink {
Jdbc {

url = "jdbc:mysql://xxx:3306/xxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
driver = "com.mysql.cj.jdbc.Driver"
user = "root"
enable_upsert = true
generate_sink_sql = true
password = "xxx"
database = "xxx"
primary_keys = [xxx,xxx,xxx]
table = "xxx"
}
}`

The version I am using is 2.3.1, and the configuration is as shown above, but the extraction speed is the same as before and has not improved. Is there something wrong with my configuration?
Approximately 12,000 pieces of data can be extracted per second.
I passed the primary key parameters based on the granular fields of the hive table, but these fields are not set as primary keys in the mysql table. Does this have any impact?

@luckyliush
Copy link
Author

try to add rewriteBatchedStatements=true parameter to your jdbc url

Thank you, but this parameter has been added before,did not meet expectations

plugin_name = jdbc
user = xxxxx
url = "jdbc:mysql://xxxxxx/xxxxxxx?allowMultiQueries=true&useUnicode=true&characterEncoding=UTF-8&serverTimezone=Asia/Shanghai&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false&rewriteBatchedStatements=true&useSSL=false"
enable_upsert = true
generate_sink_sql = true
database = db_name
table = table_name
primary_keys = [xxx,xxx]

try with this config, it will auto generate insert sql, i use this config, the write speed is good

Hello, now in the seatunnel-2.3.5 version, using the same configuration, the parallelism parameter will not take effect.
But after adding the parameter read_limit.rows_per_second=10000 to seatunnel-2.3.5, the parallelism parameter will take effect and the extraction speed will be significantly improved. Do you know the reason?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants