-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: ImageExtraction not extracting all the images in pdf #162
Comments
Please attach the input PDF |
@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below def test_pdf_with_borb(self):
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()
file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
with open(file_path, "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [text_l, image_l])
# check whether we have read a Document
assert doc is not None
images = []
page_num = int(doc.get_document_info().get_number_of_pages())
print(f"page num: {page_num}")
for page in range(0, page_num):
if "XObject" in doc.get_page(page)["Resources"]:
for k, v in doc.get_page(page)["Resources"]["XObject"].items():
print("%d\t%s" % (page, k))
for page, content in image_l.get_images().items():
images += (content)
print(f"image page: {page}") |
I checked the images in your PDF. |
what can i do to extract these images correctly? could you give me any advice, thanks a lot |
You would have to implement your own version of an Essentially you need to:
|
I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps:
Have to say, I am learning the code. Maybe it's not the best solution. |
Describe the bug
not extracting all the images in pdf
To Reproduce
Expected behaviour
the ImageExtraction listenser should return all the images
Screenshots
Desktop (please complete the following information):
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: