PDFからテキストを抽出する

Technology

PDFからテキストを抽出する

2014-04-25Linux

pdftotextコマンドを使うとPDFからテキストを抽出できます。

$ pdftotext hoge.pdf

このまま実行すれば、hoge.txtにテキストが出力されます。また、pdfファイルのファイル名の後に適当なファイル名を入力すると、そのファイル名にテキストを出力します。

-layoutオプションをつけると、レイアウトを維持したテキストが出力されます。

hoge.pdfをレイアウトを維持したままhogehoge.txtに出力する場合は、次のようにコマンドを入力します。

$ pdftotext -layout hoge.pdf hogehoge.txt

参考になると思うので、-hオプションの結果を以下に示しておきます。

pdftotext version 0.18.4
Copyright 2005-2011 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2004 Glyph & Cog, LLC
Usage: pdftotext [options] <PDF-file> [<text-file>]
  -f           : first page to convert
  -l           : last page to convert
  -r            : resolution, in DPI (default is 72)
  -x           : x-coordinate of the crop area top left corner
  -y           : y-coordinate of the crop area top left corner
  -W           : width of crop area in pixels (default is 0)
  -H           : height of crop area in pixels (default is 0)
  -layout           : maintain original physical layout
  -raw              : keep strings in content stream order
  -htmlmeta         : generate a simple HTML file, including the meta information
  -enc      : output text encoding name
  -listenc          : list available encodings
  -eol      : output end-of-line convention (unix, dos, or mac)
  -nopgbrk          : don't insert page breaks between pages
  -bbox             : output bounding box for each word and page size to html.  Sets -htmlmeta
  -opw      : owner password (for encrypted files)
  -upw      : user password (for encrypted files)
  -q                : don't print any messages or errors
  -v                : print copyright and version info
  -h                : print usage information
  -help             : print usage information
  --help            : print usage information
  -?                : print usage information

Topic

Languages (24)
- ActionScript (8)
- C# (1)
- Java (6)
  - Android (2)
- JavaScript (5)
  - JavaScript (4)
  - node.js (1)
- Perl (4)
Database (3)
- MySQL (3)
Miscellaneous (5)
- Web (2)
- Event (2)
- Linux (1)

Nikalog World we live in, is filled with wonderful things...

PDFからテキストを抽出する

Topic