PDF Utils
The module ricecooker.utils.pdf contains helper functions for manipulating PDFs.
PDF splitter
When importing source PDFs like books that are very long documents (100+ pages), it is better for the Kolibri user experience to split them into multiple shorter PDF content nodes.
The PDFParser class in ricecooker.utils.pdf is a wrapper around the PyPDF2
library that allows us to split long PDF documents into individual chapters,
based on either the information available in the PDF’s table of contents, or user-defined page ranges.
Split into chapters
Here is how to split a PDF document located at pdf_path, which can be either
a local path or a URL:
from ricecooker.utils.pdf import PDFParser
pdf_path = '/some/local/doc.pdf' or 'https://somesite.org/some/remote/doc.pdf'
with PDFParser(pdf_path) as pdfparser:
chapters = pdfparser.split_chapters()
The output chapters is list of dictionaries with title and path attributes:
[
{'title':'First chapter', 'path':'downloads/doc/First-chapter.pdf'},
{'title':'Second chapter', 'path':'downloads/doc/Second-chapter.pdf'},
...
]
Use this information to create an individual DocumentNode for each PDF and store
them in a TopicNode that corresponds to the book:
from ricecooker.classes import nodes, files
book_node = nodes.TopicNode(title='Book title', description='Book description')
for chapter in chapters:
chapter_node = nodes.DocumentNode(
title=chapter['title'],
files=[files.DocumentFile(chapter['path'])],
...
)
book_node.add_child(chapter_node)
By default, the split PDFs are saved in the directory ./downloads. You can customize
where the files are saved by passing the optional argument directory when initializing
the PDFParser class, e.g., PDFParser(pdf_path, directory='somedircustomdir').
The split_chapters method uses the internal get_toc method to obtain a list
of page ranges for each chapter. Use pdfparser.get_toc() to inspect the PDF’s
table of contents. The table of contents data returned by the get_toc method
has the following format:
[
{'title': 'First chapter', 'page_start': 0, 'page_end': 10},
{'title': 'Second chapter', 'page_start': 10, 'page_end': 20},
...
]
If the page ranges automatically detected form the PDF’s table of contents are
not suitable for the document you’re processing, or if the PDF document does not
contain table of contents information, you can manually create the title and
page range data and pass it as the jsondata argument to the split_chapters().
page_ranges = pdfparser.get_toc()
# possibly modify/customize page_ranges, or load from a manually created file
chapters = pdfparser.split_chapters(jsondata=page_ranges)
Split into chapters and subchapters
By default the get_toc will detect only the top-level document structure,
which might not be sufficient to split the document into useful chunks.
You can pass the subchapters=True optional argument to the get_toc() method
to obtain a two-level hierarchy of chapters and subchapter from the PDF’s TOC.
For example, if the table of contents of textbook PDF has the following structure:
Intro
Part I
Subchapter 1
Subchapter 2
Part II
Subchapter 21
Subchapter 22
Conclusion
then calling pdfparser.get_toc(subchapters=True) will return the following
chapter-subchapter tree structure:
[
{ 'title': 'Part I', 'page_start': 0, 'page_end': 10,
'children': [
{'title': 'Subchapter 1', 'page_start': 0, 'page_end': 5},
{'title': 'Subchapter 2', 'page_start': 5, 'page_end': 10}
]},
{ 'title': 'Part II', 'page_start': 10, 'page_end': 20,
'children': [
{'title': 'Subchapter 21', 'page_start': 10, 'page_end': 15},
{'title': 'Subchapter 22', 'page_start': 15, 'page_end': 20}
]},
{ 'title': 'Conclusion', 'page_start': 20, 'page_end': 25 }
]
Use the split_subchapters method to process this tree structure and obtain the
tree of title and paths:
[
{ 'title': 'Part I',
'children': [
{'title': 'Subchapter 1', 'path': '/tmp/0-0-Subchapter-1.pdf'},
{'title': 'Subchapter 2', 'path': '/tmp/0-1-Subchapter-2.pdf'},
]},
{ 'title': 'Part II',
'children': [
{'title': 'Subchapter 21', 'path': '/tmp/1-0-Subchapter-21.pdf'},
{'title': 'Subchapter 22', 'path': '/tmp/1-1-Subchapter-22.pdf'},
]},
{ 'title': 'Conclusion', 'path': '/tmp/2-Conclusion.pdf'}
]
You’ll need to create a TopicNode for each chapter that has children and
create a DocumentNode for each of the children of that chapter.
Accessibility notes
Do not use PDFParser for tagged PDFs because splitting and processing loses
the accessibility features of the original PDF document.