PDF Utils
The module ricecooker.utils.pdf
contains helper functions for manipulating PDFs.
PDF splitter
When importing source PDFs like books that are very long documents (100+ pages), it is better for the Kolibri user experience to split them into multiple shorter PDF content nodes.
The PDFParser
class in ricecooker.utils.pdf
is a wrapper around the PyPDF2
library that allows us to split long PDF documents into individual chapters,
based on either the information available in the PDF’s table of contents, or user-defined page ranges.
Split into chapters
Here is how to split a PDF document located at pdf_path
, which can be either
a local path or a URL:
from ricecooker.utils.pdf import PDFParser
pdf_path = '/some/local/doc.pdf' or 'https://somesite.org/some/remote/doc.pdf'
with PDFParser(pdf_path) as pdfparser:
chapters = pdfparser.split_chapters()
The output chapters
is list of dictionaries with title
and path
attributes:
[
{'title':'First chapter', 'path':'downloads/doc/First-chapter.pdf'},
{'title':'Second chapter', 'path':'downloads/doc/Second-chapter.pdf'},
...
]
Use this information to create an individual DocumentNode
for each PDF and store
them in a TopicNode
that corresponds to the book:
from ricecooker.classes import nodes, files
book_node = nodes.TopicNode(title='Book title', description='Book description')
for chapter in chapters:
chapter_node = nodes.DocumentNode(
title=chapter['title'],
files=[files.DocumentFile(chapter['path'])],
...
)
book_node.add_child(chapter_node)
By default, the split PDFs are saved in the directory ./downloads
. You can customize
where the files are saved by passing the optional argument directory
when initializing
the PDFParser
class, e.g., PDFParser(pdf_path, directory='somedircustomdir')
.
The split_chapters
method uses the internal get_toc
method to obtain a list
of page ranges for each chapter. Use pdfparser.get_toc()
to inspect the PDF’s
table of contents. The table of contents data returned by the get_toc
method
has the following format:
[
{'title': 'First chapter', 'page_start': 0, 'page_end': 10},
{'title': 'Second chapter', 'page_start': 10, 'page_end': 20},
...
]
If the page ranges automatically detected form the PDF’s table of contents are
not suitable for the document you’re processing, or if the PDF document does not
contain table of contents information, you can manually create the title and
page range data and pass it as the jsondata
argument to the split_chapters()
.
page_ranges = pdfparser.get_toc()
# possibly modify/customize page_ranges, or load from a manually created file
chapters = pdfparser.split_chapters(jsondata=page_ranges)
Split into chapters and subchapters
By default the get_toc
will detect only the top-level document structure,
which might not be sufficient to split the document into useful chunks.
You can pass the subchapters=True
optional argument to the get_toc()
method
to obtain a two-level hierarchy of chapters and subchapter from the PDF’s TOC.
For example, if the table of contents of textbook PDF has the following structure:
Intro
Part I
Subchapter 1
Subchapter 2
Part II
Subchapter 21
Subchapter 22
Conclusion
then calling pdfparser.get_toc(subchapters=True)
will return the following
chapter-subchapter tree structure:
[
{ 'title': 'Part I', 'page_start': 0, 'page_end': 10,
'children': [
{'title': 'Subchapter 1', 'page_start': 0, 'page_end': 5},
{'title': 'Subchapter 2', 'page_start': 5, 'page_end': 10}
]},
{ 'title': 'Part II', 'page_start': 10, 'page_end': 20,
'children': [
{'title': 'Subchapter 21', 'page_start': 10, 'page_end': 15},
{'title': 'Subchapter 22', 'page_start': 15, 'page_end': 20}
]},
{ 'title': 'Conclusion', 'page_start': 20, 'page_end': 25 }
]
Use the split_subchapters
method to process this tree structure and obtain the
tree of title and paths:
[
{ 'title': 'Part I',
'children': [
{'title': 'Subchapter 1', 'path': '/tmp/0-0-Subchapter-1.pdf'},
{'title': 'Subchapter 2', 'path': '/tmp/0-1-Subchapter-2.pdf'},
]},
{ 'title': 'Part II',
'children': [
{'title': 'Subchapter 21', 'path': '/tmp/1-0-Subchapter-21.pdf'},
{'title': 'Subchapter 22', 'path': '/tmp/1-1-Subchapter-22.pdf'},
]},
{ 'title': 'Conclusion', 'path': '/tmp/2-Conclusion.pdf'}
]
You’ll need to create a TopicNode
for each chapter that has children
and
create a DocumentNode
for each of the children of that chapter.
Accessibility notes
Do not use PDFParser
for tagged PDFs because splitting and processing loses
the accessibility features of the original PDF document.