Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

无法复原pdf文件中表格的框线 #279

Open
ericosmic opened this issue Apr 9, 2024 · 1 comment
Open

无法复原pdf文件中表格的框线 #279

ericosmic opened this issue Apr 9, 2024 · 1 comment

Comments

@ericosmic
Copy link

ericosmic commented Apr 9, 2024

在识别pdf中发现存在两个问题,
1 无法在docx文件中还原 pdf文件中的隐藏表格的一部分显示线段, 比如样本中的红线是一个表格的一条框线。
2 文字段落无法实现首行缩进

样本如下图:
image
zf1.pdf

@ericosmic ericosmic changed the title Donpage line Don't recovery line in page Apr 9, 2024
@ericosmic ericosmic changed the title Don't recovery line in page 无法复原pdf文件中的直线 Apr 9, 2024
@ericosmic ericosmic changed the title 无法复原pdf文件中的直线 无法复原pdf文件中表格的框线 Apr 9, 2024
@zhangdanfenggg
Copy link

image 一样的问题,转docx的时候横线转不成功,还报错这个: [INFO] Start to convert D:/Download/aab.pdf [INFO] [1/4] Opening document... [INFO] [2/4] Analyzing document... [WARNING] Ignore Line "𝑘𝐿\udc40" due to overlap [WARNING] Ignore Line "𝑘" due to overlap [INFO] [3/4] Parsing pages... [INFO] (1/18) Page 1 [INFO] (2/18) Page 2 [INFO] (3/18) Page 3 [INFO] (4/18) Page 4 [INFO] (5/18) Page 5 [INFO] (6/18) Page 6 [INFO] (7/18) Page 7 [INFO] (8/18) Page 8 [INFO] (9/18) Page 9 [INFO] (10/18) Page 10 [INFO] (11/18) Page 11 [INFO] (12/18) Page 12 [INFO] (13/18) Page 13 [INFO] (14/18) Page 14 [ERROR] Ignore page 14 due to parsing page error: 'utf-8' codec can't encode character '\udc54' in position 0: surrogates not allowed [INFO] (15/18) Page 15 [ERROR] Ignore page 15 due to parsing page error: 'utf-8' codec can't encode character '\udc59' in position 0: surrogates not allowed [INFO] (16/18) Page 16 [INFO] (17/18) Page 17 [INFO] (18/18) Page 18 [INFO] [4/4] Creating pages... [INFO] (1/16) Page 1 [INFO] (2/16) Page 2 [INFO] (3/16) Page 3 [INFO] (4/16) Page 4 [INFO] (5/16) Page 5 [INFO] (6/16) Page 6 [ERROR] Ignore page 6 due to making page error: 'utf-8' codec can't encode character '\udc40' in position 2: surrogates not allowed [INFO] (7/16) Page 7 [INFO] (8/16) Page 8 [INFO] (9/16) Page 9 [INFO] (10/16) Page 10 [INFO] (11/16) Page 11 [INFO] (12/16) Page 12 [INFO] (13/16) Page 13 [INFO] (14/16) Page 16 [INFO] (15/16) Page 17 [INFO] (16/16) Page 18 [INFO] Terminated in 1.70s. File Converted Successfully [aab.pdf](https://github.com/ArtifexSoftware/pdf2docx/files/15048562/aab.pdf)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants