Commit c19a6f46 authored by Bernard Stephan's avatar Bernard Stephan
Browse files

Suppression du fichier README-en qui n'est plus du tout à jour

parent cca6bbd7
# Python version of pdf2blocks
## What does it do ?
The aim of this piece of code is to retreive, as best as possible,
the structure (title, subtitle, …) of a pdf document from different
pdf text extractors.
For now, it uses *pdftotext* and *pdftohtml*, two tools included
in the poppler library, which is derived from Xpdf.
**pdftotext** is able to return an xml document containing the text
of the pdf document, divided into pages, flows and blocks. This gives
a not-so-bad stucture of the document, particularly for the detection
of long paragraphs. On the other side, it makes difficult
the detection of title level recognition because it doesn't give any
information on font sizes, styles and colours used for the text.
**pdftohtml**'s result can also be an xml document, which contains,
word by word, the font used for this word and its colour. It also
gives its position in the page (and the page number, of course).
*pdf2blocks* prints its result on standard output using markdown
syntax, to make easy to convert it into other formats. But it could
be transformed to give the result in another format.
The next chapter explains how *pdf2blocks* retreives the results
of *pdftotext* and *pdftohtml*, then which treatment it does on it.
## The algorithm of pdf2blocks
### pdftotext
A call of the following command line is done within the program :
pdftotext -bbox-layout -eol unix /path/to/file.pdf
The result of this command is stored into a list, badly-named ***blocks***, which has become in the code the terminology used
to refer to "the result of pdftotext".
Each element of this list is a *python dictionnary* with the
following structure :
- **page :** Contains the page number where the block is.
- **flow :** There is no identification of a flow in the xml
result of *pdftotext*, but this is a number increased everytime
a "flow" xml-tag is reached, to be able to identify the blocks
contained in a same flow. This value is actually not used
in the treatments.
- **x_min**, **x_max**, **y_min** and **y_max** : The coordinates
of the block within its page.
- **h_min** and **h_max** : The lowest and highest height of
lines contained in this block. This value is computed in
pdf2blocks, it's not a result of *pdftotext*.
- **nb_cars** and **nb_words** : The number of characters and
the number of words in this block. *(nb_cars should have been*
*named nb_chars ; it's a mitake due to the french word "caractère").*
These values are also computed.
- **flags :** A 16 bit value, intitialized to 0, to store the result
of certain treatments.
- **lines :** Is a list containing the lines composing the block.
It's also composed of dictionnaries having the following structure :
- **text :** The text of this line (of text). This line of text
is not given by *pdftotext*, which returns the text word by word.
It is composed of each word in the line separated by a space
character, excepted when the first word is a single character
higher than the second word. In this case, no space is added.
This is to avoid kind of artwork text style consisting in an
extra-sized first character of a paragraph.
- **height :** The value of ```yMax - yMin``` (from line coordinates
returned by *pdftotext*).
- **nb_words**, **nb_cars** and **flags** : are the same than
in block structure, but contain information related to the line.
- **words :** This is another list of dictionnaries, having a very
simple structure :
- **height :** the same than in *lines* structure,
- **text :** which is finally the way *pdftotext* returns the text
it extracts.
### pdftohtml
After pdftotext is called, another call is done to *pdftohtml*,
corresponding to the following command line :
pdftohtml -xml -i -stdout /path/to/file.pdf
The result of this command is also an xml file, containing two
important informations :
- the fonts used in the pdf document,
- the text contained in the document and the font used for it.
The text is returned grouped by same-font segments. For example,
the line
> The 2<sup>nd</sup> big bug.
will be returned in three parts :
- The 2
- nd
- big bug.
So will be the line
> The ***yellow*** butterfly.
because another style (slanted, bold, …) of the same font is also
considered as another font.
#### Fonts returned by *pdftohtml*
The fonts returned by *pdftohtml* are stored in a list, called
**fontspec**, which is the xml tag name, but quickly (and shortly)
called **fonts**. The elements of this list are also dictionnaries,
with the structure :
- **id :** An unique number to identify the font. This number is used
to associate a text segment to its font.
- **size :** The font size in "pixels" (px).
- **family :** The font's name, including font styles if any, separated
by commas. Examples :
- "ABCDEE+Calibri"
- "ABCDEE+Calibri,Bold"
- "ABCDEE+Calibri,BoldItalic"
- "ABCDEE+Symbol"
- **color :** Text's colour, in html format. Examples :
- "#000000"
- "#b366b3"
#### Text returned by *pdftohtml*
Because each segment of text returned by *pdftohtml* is most of the time
an entire line of text, the list of them have quickly been called
**lines**, which is also a bad name. Its elements have the following
structure :
- **text :** The text.
- **font :** The text font id (see *fontspec* definition above).
- **page :** The page number.
- **top**, **left**, **width** and **height** : These attributes
are attached to *text* xml-tags to locate the text within the page.
### Treatments
At this state, we've got the three lists defined above : **blocks**,
**fontspec** and **lines**.
Most of the treatments are done to **blocks** list, which contains
the base structure. The aim is to identify the role of the blocks
composing this structure. This is sometimes done with the information
stored in the two other lists, but the result mostly applies
to the **blocks** list.
Considering that the **blocks** is the result of *pdftotext*, whose goal
is to make ascii text to be read from top to bottom (the "Bulletin de
santé du végétal" are written in french language), we consider that
the xml file result is also ordered by the reading direction,
and no particular treatment is done to re-order the blocks. We also
notice that the order of blocks we get that way is the same than when
*pdftotext* has a text output.
In multi-column documents, the order of blocks are sometimes wrong,
and we consider that it's due to *pdftotext* algorithm limits.
But we don't have a better algorithm to suggest, so we let the ordering
unchanged for now.
#### Choose a default font size
The most used font size (in terms of character number) is considered
as the default font size.
Mostly for historical reasons (a bunch of tests have been firstly done),
it is computed from the word's height (rounded) in the **blocks**
structure, which seems to be quite the same than font sizes.
Computing it from **lines** and **fontspec** would give better results
but there shouldn't be any consequences to the final result.
The choose of a default font size allows to make hypothesis such as :
*smaller fonts are used for footnotes, comments or credits and*
*can be ignored*, or *bigger fonts are certainly titles*, …
To improve testability of different treatments, **blocks** have been
tagged that way :
- `if blocks[i]['h_max'] < default_font_size then blocks[i]['flags'] |= SMALL_FONT`
- `if blocks[i]['h_min'] > default_font_size then blocks[i]['flags'] |= BIG_FONT`
- `if blocks[i]['lines'][j]['height'] < default_font_size then blocks[i]['lines'][j]['flags'] |= SMALL_FONT`
- `if blocks[i]['lines'][j]['height'] > default_font_size then blocks[i]['lines'][j]['flags'] |= BIG_FONT`
#### Detect page bottom
We call page bottom the numbering of pages, a few word
identifying the ducument, such as "BSV n°17 du 15 juin 2019",
and whatever is written the same way in the bottom of every page.
Footnotes are not considered as part of page bottom.
The algorithm tests if the last line of every page are equal,
considering only characters ([a-zA-Z]) to avoid page numbers.
If so, it tests the previous line, and so on.
As for font sizes, the corresponding lines are flagged with
a BOTTOM_PAGE flag.
#### Assign a fontspec *id* to block's lines
This algorithm uses the **lines** list and the lines contained
in the **blocks** list **(we'll call them *“b-lines”*)**.
For each b-line, it computes its
[Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance)
with every line of the same page in the **lines** list.
The line having the best score (smaller distance) is considered
to be the matching line and the corresponding b-line will be
assigned the same **fontspec** id.
This treatment is a necessity due to the segmentation of lines
returned by *pdftohtml*. There is also some differences with
lines recognized by *pdftotext* and *pdftohtml* in a multi-line
document.
Note that this algorithm is not optimized at all. For now, it has
been coded to test the algorithm, and the first results look
not bad. But it can certainly be faster.
#### Group fonts
Considering that font styles (bold, italic, underlined, …) are
only used to make some text more visible,
(or noticeable), all the styles applied to a same fontname, of
the same size and the same colour, are considered to be the same.
Then, to reduce the number of fonts composing the document,
all *b-lines* having the same font, with the same size and the same
colour, regardless of the style, are assigned to an unique font id,
in a *short_font* attribute.
> **Note** that some blocks identified by *pdftotext* have a first
> line bold with the same font, making this a title
> for the next paragraph(s). For now, no treatment have been done
> to detect this, and those subtitles are not identified as
> subtitles. This is a possible improvement.
#### Guess document structure
This is done considering the 'short_font' attribute of *b-lines*.
It seems important to reduce the number of fonts used. A perfect
situation would be a font for titles, another one for subtitles,
…, and another one for the standard paragraphs. That's what we
wanted to be close to at this point.
We also remove by default (it's possible to change that) the lines
having a SMALL_FONT or a BOTTOM_PAGE flag.
Then, here are the treatments done :
1. Characters are counted for each font, to find the most used
font *id*
1. During this counting, fonts never used for lines longer than
two characters are considered used for bullets or other
style effects. Those fonts are removed and the corresponding *b-lines*
are assigned to next line's font.
1. A list of font *id* succesion (**t**) is created, associated
to the number of successive lines (**n**) of this font. Exemple :
[font id] Line
--------------
[3] OÏDIUM
[5] Eléments de biologie
[4] Situation actuelle
[2] Des contaminations sur grappes sont observées (…)
[2] avec symptômes d’oïdium (…)
[2] la plupart des parcelles sont indemnes.
[4] Analyse de risque
[2] La sensibilité de la vigne est (…)
[2] Le risque de nouvelles contaminations est très faible(…)
[2] Faites le point sur l’état sanitaire de vos parcelles.
Will give the following lists :
> - **t** : 3 5 4 2 4 2 *(fonts ids)*
> - **n** : 1 1 1 3 1 3 *(number of lines)*
1. For each font in **t**, the higher number of lines is stored with
the attribute *maxl* into a dictionnary called **f**.
In the upper example, **f** would have the following content :
f : { '2' : { maxl:3 }, '3' : { maxl:1 },
'4' : { maxl:1 }, '5' : { maxl:1 } }
1. Then, considering that a title hasn't more than TITLE_MAX_LINES (actually two), we define :
f['id']['isnt_title'] = (f['id']['maxl'] > TITLE_MAX_LINES)
As an exception, the font of the last element in **t** table is
not considered to be a title (a document doesn't finish with a title).
This also avoids having no font identified as "certainly not a title".
1. A two-dimentional table, **b**, is defined so that
b[i][j] = The number of transitions from font *i* to font *j*
In the upper example, **b** would contain :
| j \ i | 2 | 3 | 4 | 5 | *comments* |
| :----: |:-:|:-:|:-:|:-:|:----------- |
| **2** | | | 2 | |← It goes from font 4 to font 2 twice |
| **3** | | | | | |
| **4** | 1 | | | 1 |← It goes once from 2 to 4 and once from 5 to 4|
| **5** | | 1 | | |← …and one transition from 3 to 5. |
*(blank squares are zeroes, but are not filled for readability)*
1. In **f**, a new attribute called *'deep'* (instead of depth, sorry)
is created, and is initialized so that :
f['id']['deep'] = 0 if f['id']['isnt_title'] is true.
(otherwise, it's set to *None*, so that it's known that the depth
of the font is not defined yet).
1. Then, using the **b** table, we adjust the *deep* value of any font
whose successor's depth is known, considering the following rules :
1. If there is only one transition, the depth is considered equal for
the two fonts.
1. Otherwise, the *deep* value is increased (if it's not already set).
This is repeated until this process doesn't affect any value. At this point,
all fonts should have a *deep* value because we assign the last line's
font *deep* to 0. This make all fonts considered, because all other fonts
preceed the last line.
At the end, the *deep* values are reversed from [0..max_deep] to [max_deep..0]
In the above example, we get :
> | font | 2 | 3 | 4 | 5 |
> | :------: |:-:|:-:|:-:|:-:|
> | **deep** | 1 | 0 | 0 | 0 |
but, if you consider that the following text in the same document is :
[font id] Line
--------------
[3] TORDEUSES
[5] Eléments de biologie
[4] Situation actuelle
[2] Cochylis : 0 à 10 captures (…)
then it becomes :
> | font | 2 | 3 | 4 | 5 |
> | :------: |:-:|:-:|:-:|:-:|
> | **deep** | 3 | 0 | 2 | 1 |
because titles succession is used more than once.
#### *TODO*
Des petites choses ont changé :
- On n'unifie plus que les polices de caractères qu'on a trouvées
présentes sur une même ligne.
- Meilleur traitement des puces
Je crois que j'en ai oublié. Il faut que je refasse un peu de
ménage dans le code et en même temps que je reprenne la rédaction de l'algo.
## Source files
*(À réactualiser)*
### pdf2blocks.py
The main program. It calls differnt functions written in p2b_*.py files.
### p2b_config.py
Has to be edited to adjust parameters for execution of pdf2blocks.py.
#### CMD_PDFTOTEXT and CMD_PDFTOHTML
Should contain full path and name of *pdftotext* and *pdftohtml* binaries.
### p2b_file.py
Contains functions for reading pdf files.
### p2b_utils.py
Contains some utlities, mostly used for debugging puposes.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment