从 FreeMdict Forum 下载的扫描电子书,发帖人提到了书签文件可以用来查找,这个想法是我之前没有想到的。尝试用 pdftk-java 来导入书签文件。
brew install pdftk-java
书签文件如下的格式,相同页码的单词数字相同。
@ 1
a or an 1
a-/an- 2
-a 2
a fortiori 3
à la 3
à la carte 4
a posteriori 4
a priori 4
abacus 4
abbreviations 4
abide and abode 6
-ability 6
ablative 6
able and able to 7
-able/-ible 7
abled 8
abolition or abolishment 8
Aboriginal and Aborigine 8
about, about to, and not about to 9
about face or about turn 9
abridgement or abridgment 9
abscissa 9
absent 10
absolute 10
abstract nouns 11
academia, academe and academy 11
accents and diacritics 12
acceptance or acceptation 13
accessory or accessary 13
accidentally or accidently 13
acclaim 13
accommodation, accomodation and accommodations 14
accompanist or accompanyist 14
accusative 14
ACE 14
-acious/-aceous 14
acknowledgement or acknowledgment 14
acro- 15
acronyms 15
active verbs 16
acuity or acuteness 16
acute accents 16
ad or advert 16
AD or A.D. 17
ad hoc, ad-hoc and adhoc 17
ad hominem 18
ad infinitum 18
ad lib, ad-lib or adlib 18
ad personam 18
ad rem 18
adage 18
adaptation or adaption 18
adapter or adaptor 19
addendum 19
addition or additive 19
addresses 19
adherence or adhesion 19
adieu 19
adjacent, adjoining and adjunct 20
在实际处理中发现几个问题
- 导入的书签一页只能对应一个书签
# 请将 'input.txt' 替换为你的输入文件名
input_file_name = 'input.txt'
# 请将 'output.txt' 替换为你的输出文件名
output_file_name = 'output.txt'
def merge_lines_with_same_number(lines):
merged_lines = {}
for line in lines:
parts = line.split('\t')
if len(parts) == 2:
text, number = parts
number = number.strip()
text = text.strip()
if number in merged_lines:
merged_lines[number].append(text)
else:
merged_lines[number] = [text]
return merged_lines
def write_merged_lines_to_file(merged_lines, output_file):
for number, texts in merged_lines.items():
merged_text = ' | '.join(texts)
output_file.write(f"{merged_text}\t{number}\n")
with open(input_file_name, 'r') as input_file:
lines = input_file.readlines()
merged_lines = merge_lines_with_same_number(lines)
with open(output_file_name, 'w') as output_file:
write_merged_lines_to_file(merged_lines, output_file)
print(f"已将每行后面数字相同的英文内容合并,并用'|'分隔,并保存为'{output_file_name}'")
- 导出的内容需要处理成以下格式,其中
PageMediaBegin
往下的内容都不是必须的,头部的信息是必须的,是我在 PDF Expert 中添加了一个书签后导出的格式。InfoBegin InfoKey: ModDate InfoValue: D:20231110091307+08'00' InfoBegin InfoKey: CreationDate InfoValue: D:20231107093455Z InfoBegin InfoKey: Producer InfoValue: macOS Version 13.6.1 (Build 22G313) Quartz PDFContext, AppendMode 1.1 PdfID0: 27ee54f1394bde6950015ebab5958d48 PdfID1: a0546a75c44ec4b30b12873737127f5b NumberOfPages: 793 BookmarkBegin BookmarkTitle: My Bookmarks BookmarkLevel: 1 BookmarkPageNumber: 0 BookmarkBegin BookmarkTitle: also BookmarkLevel: 2 BookmarkPageNumber: 54 PageMediaBegin PageMediaNumber: 1 PageMediaRotation: 0 PageMediaRect: 0 0 3,500 5,325.673 PageMediaDimensions: 3,500 5,325.673
- 上面书签的格式和论坛提供的书签不一致,需要处理成这样的格式。
# 请将 'input.txt' 替换为你的输入文件名 input_file_name = 'input.txt' # 请将 'output.txt' 替换为你的输出文件名 output_file_name = 'output.txt' def convert_to_bookmark(line, page_number): # 格式化为Bookmark文本 return f"BookmarkBegin\nBookmarkTitle: {line.strip()}\nBookmarkLevel: 2\nBookmarkPageNumber: {page_number}\n" with open(input_file_name, 'r') as input_file: lines = input_file.readlines() bookmarks = [convert_to_bookmark(line, i+1) for i, line in enumerate(lines)] with open(output_file_name, 'w') as output_file: output_file.writelines(bookmarks) print(f"已将每一行转换为Bookmark格式,并保存为'{output_file_name}'")
- 书签导入并不会按照对应的页码,这里我开始意识到和我理解的 bookmarks 不是一个东西,
pdftk-java
添加的应该是Ouline
。因此需要将错位的条目校对好以及填充好头部,这样才会让 Outline 与页面一一对应。最终处理后的结果是,可以非常方便地检索扫描件当中的内容了。封面 1 封面黑白 2 出版信息 3 目录 4 出版说明 5 编译者序 6 编译者序 7 著者前言 8 著者前言 9 著者前言 10 内容提要、体例及查询方法 11 内容提要、体例及查询方法 12 空白页 13 @ | a or an 14 a-/an- | -a 15 a fortiori | à la 16 à la carte | a posteriori | a priori | abacus | abbreviations 17 abbreviations 18 abide and abode | -ability | ablative 19 able and able to | -able/-ible 20 abled | abolition or abolishment | Aboriginal and Aborigine 21 about, about to, and not about to | about face or about turn | abridgement or abridgment | abscissa 22 absent | absolute 23
Source: Create bookmarks for your PDF with pdftk | Opensource.com
2023-11-13 更新
原来是可以都设置成一级 bookmark,同时对于开始的位置作设置。