EPUBファイルから画像を抽出する

電子書籍フォーマットとして広く使われている EPUB ファイルから、連番で画像を抽出する方法です。

ツール作りました! pip でインストールできます。https://github.com/ytyng/epub-extract-jpeg

EPUB ファイルの概要

EPUB ファイルとは、平たく言えば ZIP圧縮された XHTML です。コミックで一般的に使われる形式では、1ページが1つの XHTML ファイルになっており、その中に 1 つの img タグがあり、画像ファイルにリンクされています。

そのため、手順としては

EPUB ファイルを解凍
構成情報の XML ファイルを解析し、ページ画像の URL (パス) を取得
ページ画像を連番で改名コピー(移動)

となります。

1. EPUBファイルを解凍

unzip で一発です。

$ mkdir /tmp/epub-extract
$ unzip sample.epub -d /tmp/epub-extract

2. 構成情報の XML ファイルを解析し、ページ画像の URL (パス) を取得

まず、展開後のディレクトリにある META-INF/container.xml を開きます。ここに、rootfile というタグがあるので、その full-path 属性を見ます。 full-path の XML ファイルが、各ページの目次のようなものになります。

full-path が示す XML ファイルで、manifest タグの中に item タグが複数あります。これらは、EPUB 中の XHTML から使われているファイルです。

3. ページ画像を連番で改名コピー(移動)

この、item タグの中はおそらくページ順になっているので、このファイルをスクリプトで連番で改名コピーしながら収集すれば、ページ画像を抽出できます。

本来であれば、直接画像のパスを読むのではなく、ページの XHTML ファイルを開き、そこからリンクされている画像を収集していくのが正しいのですが、ページの XHTML の順と画像の item タグの順が一致しないケースは稀だと思いますので(EPUB 作成者が意図的に XHTML ファイルと画像ファイルの順番を一致させなかった場合などは、ページ数が正しく取得できません)、item タグの順で処理して基本的には問題無いでしょう。

これで、EPUB ファイルから画像ファイルを抽出する方法は終わりです。最後に、Python スクリプトにした例を掲載しておきます。

from __future__ import print_function, unicode_literals

import os
import time
import sys
import subprocess
import shutil
from xml.etree import ElementTree

TEMP_DIR = '/tmp/epub-extract-{}'.format(int(time.time()))


def procedure(file_path):
    if not os.path.exists(file_path):
        print("{} is not exist.".format(file_path), file=sys.stderr)
        return

    output_dir, ext = os.path.splitext(file_path)

    if ext != '.epub':
        print("{} is not epub.".format(file_path), file=sys.stderr)
        return

    if os.path.exists(output_dir):
        print("{} is already exists.".format(output_dir), file=sys.stderr)
        return

    os.mkdir(TEMP_DIR)

    subprocess.Popen(
        ('unzip', file_path, "-d", TEMP_DIR),
        stdout=subprocess.PIPE, stderr=subprocess.PIPE).communicate()

    os.mkdir(output_dir)

    container_xml_path = os.path.join(TEMP_DIR, 'META-INF', 'container.xml')
    etree = ElementTree.parse(container_xml_path)
    rootfile_node = etree.find(
        ".//{urn:oasis:names:tc:opendocument:xmlns:container}rootfile")
    content_opf_path = rootfile_node.attrib['full-path']

    content_xml_path = os.path.join(TEMP_DIR, content_opf_path)
    etree = ElementTree.parse(content_xml_path)
    manifest = etree.find('.//{http://www.idpf.org/2007/opf}manifest')
    items = manifest.findall('.//{http://www.idpf.org/2007/opf}item')

    image_paths = []
    for item in items:
        if item.attrib['media-type'] == 'image/jpeg':
            image_paths.append(item.attrib['href'])

    root_dir = os.path.dirname(content_xml_path)

    for i, image_path in enumerate(image_paths, start=1):
        destination_image_name = '{:03d}.jpg'.format(i)
        source_image_path = os.path.join(root_dir, image_path)
        destination_image_path = os.path.join(
            output_dir, destination_image_name)
        shutil.move(source_image_path, destination_image_path)
        print('{} -> {}'.format(image_path, destination_image_name))

    shutil.rmtree(TEMP_DIR)


def main():
    for arg in sys.argv[1:]:
        procedure(arg)


if __name__ == '__main__':
    main()

追記: Github に上げて、pip でインストールできるようにしました。
https://github.com/ytyng/epub-extract-jpeg

TORICO 技術開発ブログ

EPUBファイルから画像を抽出する

2016.02.26 19:17 (9年前) 投稿者: 四柳剛カテゴリ: EPUB , Python

EPUB ファイルの概要

1. EPUBファイルを解凍

2. 構成情報の XML ファイルを解析し、ページ画像の URL (パス) を取得

3. ページ画像を連番で改名コピー(移動)

コメント

TORICOのブログ

TORICOの他のブログ投稿

EPUBファイルから画像を抽出する

2016.02.26 19:17 (9年前) 投稿者: 四柳剛 カテゴリ: EPUB , Python

EPUB ファイルの概要

1. EPUBファイルを解凍

2. 構成情報の XML ファイルを解析し、ページ画像の URL (パス) を取得

3. ページ画像を連番で改名コピー(移動)

コメント

TORICOのブログ

TORICOの他のブログ投稿

2016.02.26 19:17 (9年前) 投稿者: 四柳剛カテゴリ: EPUB , Python