Importing Bookmark Files for Scanned PDFs

从 FreeMdict Forum 下载的扫描电子书，发帖人提到了书签文件可以用来查找，这个想法是我之前没有想到的。尝试用 pdftk-java 来导入书签文件。

brew install pdftk-java

书签文件如下的格式，相同页码的单词数字相同。

@	1
a or an	1
a-/an-	2
-a	2
a fortiori	3
à la	3
à la carte	4
a posteriori	4
a priori	4
abacus	4
abbreviations	4
abide and abode	6
-ability	6
ablative	6
able and able to	7
-able/-ible	7
abled	8
abolition or abolishment	8
Aboriginal and Aborigine	8
about, about to, and not about to	9
about face or about turn	9
abridgement or abridgment	9
abscissa	9
absent	10
absolute	10
abstract nouns	11
academia, academe and academy	11
accents and diacritics	12
acceptance or acceptation	13
accessory or accessary	13
accidentally or accidently	13
acclaim	13
accommodation, accomodation and accommodations	14
accompanist or accompanyist	14
accusative	14
ACE	14
-acious/-aceous	14
acknowledgement or acknowledgment	14
acro-	15
acronyms	15
active verbs	16
acuity or acuteness	16
acute accents	16
ad or advert	16
AD or A.D.	17
ad hoc, ad-hoc and adhoc	17
ad hominem	18
ad infinitum	18
ad lib, ad-lib or adlib	18
ad personam	18
ad rem	18
adage	18
adaptation or adaption	18
adapter or adaptor	19
addendum	19
addition or additive	19
addresses	19
adherence or adhesion	19
adieu	19
adjacent, adjoining and adjunct	20

在实际处理中发现几个问题

导入的书签一页只能对应一个书签

# 请将 'input.txt' 替换为你的输入文件名
input_file_name = 'input.txt'

# 请将 'output.txt' 替换为你的输出文件名
output_file_name = 'output.txt'

def merge_lines_with_same_number(lines):
    merged_lines = {}
    for line in lines:
        parts = line.split('\t')
        if len(parts) == 2:
            text, number = parts
            number = number.strip()
            text = text.strip()
            if number in merged_lines:
                merged_lines[number].append(text)
            else:
                merged_lines[number] = [text]

    return merged_lines

def write_merged_lines_to_file(merged_lines, output_file):
    for number, texts in merged_lines.items():
        merged_text = ' | '.join(texts)
        output_file.write(f"{merged_text}\t{number}\n")

with open(input_file_name, 'r') as input_file:
    lines = input_file.readlines()
    merged_lines = merge_lines_with_same_number(lines)

with open(output_file_name, 'w') as output_file:
    write_merged_lines_to_file(merged_lines, output_file)

print(f"已将每行后面数字相同的英文内容合并，并用'|'分隔，并保存为'{output_file_name}'")

导出的内容需要处理成以下格式，其中 PageMediaBegin 往下的内容都不是必须的，头部的信息是必须的，是我在 PDF Expert 中添加了一个书签后导出的格式。

InfoBegin
InfoKey: ModDate
InfoValue: D:20231110091307+08&apos;00&apos;
InfoBegin
InfoKey: CreationDate
InfoValue: D:20231107093455Z
InfoBegin
InfoKey: Producer
InfoValue: macOS Version 13.6.1 (Build 22G313) Quartz PDFContext, AppendMode 1.1
PdfID0: 27ee54f1394bde6950015ebab5958d48
PdfID1: a0546a75c44ec4b30b12873737127f5b
NumberOfPages: 793
BookmarkBegin
BookmarkTitle: My Bookmarks
BookmarkLevel: 1
BookmarkPageNumber: 0
BookmarkBegin
BookmarkTitle: also
BookmarkLevel: 2
BookmarkPageNumber: 54
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 3,500 5,325.673
PageMediaDimensions: 3,500 5,325.673

上面书签的格式和论坛提供的书签不一致，需要处理成这样的格式。

# 请将 'input.txt' 替换为你的输入文件名
input_file_name = 'input.txt'

# 请将 'output.txt' 替换为你的输出文件名
output_file_name = 'output.txt'

def convert_to_bookmark(line, page_number):
    # 格式化为Bookmark文本
    return f"BookmarkBegin\nBookmarkTitle: {line.strip()}\nBookmarkLevel: 2\nBookmarkPageNumber: {page_number}\n"

with open(input_file_name, 'r') as input_file:
    lines = input_file.readlines()
    bookmarks = [convert_to_bookmark(line, i+1) for i, line in enumerate(lines)]

with open(output_file_name, 'w') as output_file:
    output_file.writelines(bookmarks)

print(f"已将每一行转换为Bookmark格式，并保存为'{output_file_name}'")

书签导入并不会按照对应的页码，这里我开始意识到和我理解的 bookmarks 不是一个东西， pdftk-java 添加的应该是 Ouline。因此需要将错位的条目校对好以及填充好头部，这样才会让 Outline 与页面一一对应。

封面       1
封面黑白       2
出版信息       3
目录       4
出版说明       5
编译者序       6
编译者序       7
著者前言       8
著者前言       9
著者前言       10
内容提要、体例及查询方法         11
内容提要、体例及查询方法       12
空白页       13
@ | a or an	14
a-/an- | -a	15
a fortiori | à la	16
à la carte | a posteriori | a priori | abacus | abbreviations	17
abbreviations       18
abide and abode | -ability | ablative	19
able and able to | -able/-ible	20
abled | abolition or abolishment | Aboriginal and Aborigine	21
about, about to, and not about to | about face or about turn | abridgement or abridgment | abscissa	22
absent | absolute	23

最终处理后的结果是，可以非常方便地检索扫描件当中的内容了。

Source: Create bookmarks for your PDF with pdftk | Opensource.com

2023-11-13 更新

原来是可以都设置成一级 bookmark，同时对于开始的位置作设置。

No notes link to this note