Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_text jumbles text order #132

Open
1 of 3 tasks
gpilgrim2670 opened this issue Feb 28, 2021 · 0 comments
Open
1 of 3 tasks

extract_text jumbles text order #132

gpilgrim2670 opened this issue Feb 28, 2021 · 0 comments

Comments

@gpilgrim2670
Copy link

Please specify whether your issue is about:

  • a possible bug
  • a question about package functionality
  • a suggested code or documentation change, improvement to the code, or feature request

Put your code here:

## rJava loads successfully
# install.packages("rJava")
library("rJava")

## load package
library("tabulizer")

## code goes here
file <- "https://cdn.swimswam.com/wp-content/uploads/2019/03/D3.NCAA-2013.pdf" # source pdf
raw <- extract_text(file) # text from file is read in, but order is jumbled

# to make more clear I'll split `raw` into lines
raw_list <- as.list(unlist(strsplit(raw, '\\\n')))
raw_results <- sapply(raw_list, toString)
raw_results[10]
# [1] "Williams SR1 4:47.16Wilson, Caroline"
# same order of text as in raw, just a smaller piece to make viewing easier

# should instead be this (perhaps with different whitespaces)
# can check by viewing file at link provided, first column
# [1] "1 Wilson, Caroline SR Williams 4:47.16"

session info for your system

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] SwimmeR_0.7.2 tabulizer_0.2.2

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 pillar_1.4.6 compiler_4.0.2 pdftools_2.3.1 remotes_2.2.0 prettyunits_1.1.1 tools_4.0.2
[8] testthat_2.3.2 digest_0.6.27 packrat_0.5.0 pkgbuild_1.1.0 pkgload_1.1.0 memoise_1.1.0 lifecycle_0.2.0
[15] tibble_3.0.4 pkgconfig_2.0.3 png_0.1-7 rlang_0.4.8 cli_2.1.0 rstudioapi_0.11 xfun_0.18
[22] rJava_0.9-13 httr_1.4.2 xml2_1.3.2 knitr_1.30 stringr_1.4.0 roxygen2_7.1.1 withr_2.3.0
[29] dplyr_1.0.2 hms_0.5.3 askpass_1.1 fs_1.5.0 generics_0.0.2 desc_1.2.0 vctrs_0.3.4
[36] devtools_2.3.2 rprojroot_1.3-2 tidyselect_1.1.0 glue_1.4.2 qpdf_1.1 R6_2.5.0 processx_3.4.4
[43] fansi_0.4.1 sessioninfo_1.1.1 readr_1.4.0 purrr_0.3.4 callr_3.5.1 magrittr_1.5 usethis_1.6.3
[50] tabulizerjars_1.0.1 backports_1.1.10 ps_1.4.0 ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 stringi_1.5.3
[57] crayon_1.3.4


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant