0% found this document useful (0 votes)

47 views

Reader-LM: Efficient HTML To Markdown Conversion With AI

Learn how Reader-LM, an open-source Small Language Model, is transforming the way we convert HTMLs into Markdowns. With multilingual support for use in different countries and the ability to handle long documents of up to 256K tokens, it outperforms much larger models in quantitative evaluations.

Uploaded by

My Social

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Reader-LM: Efficient HTML To Markdown Conversion With AI

Uploaded by

My Social

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

To read more such articles, please visit our blog https://socialviews81.blogspot.

com/

Reader-LM: Efficient HTML to Markdown Conversion with AI

Introduction

Markdown is a language that is used for formatting content. Users able

to format text using plain text which later shall be converted to HTML
format. A well formatted use of Markdown files is important in order to
ensure that the files are easy to read and well organized. It makes the
handling of content much easier especially where it is being shared
across different groups and teams or when the same content is required
to be posted on different social media platforms. There are now several
ways of converting HTML to Markdown including HTML2Markdown,
Turndown, and even online tools.

Some of the main issues are complex HTML structure, problem in format
preservation and noise in HTML. Reader-LM has been developed to flux
these problems by applying AI to enhance and full auto the conversion.
This means that through AI, enhancements have been made to be able
to create models such as Reader-LM, which can easily convert HTML to
Markdown as it comprehends and parses the content better.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

Who Developed Reader-LM?

Reader-LM is built by Jina AI — the company whose mission is to

democratize Artificial Intelligence and make them open for everyone
through Open-Source and Open-Science. The model was based on Jina
Reader and contributed by different AI researcher and developers. The
goal for Reader-LM was to build a fast and cheap tool that takes such
raw, noisy HTML and converts it into clean Markdown. The primary
purpose of this model is to make the process of converting the content
simpler and at the same time enhancing the quality of the converted
content.

What is Reader-LM?

Reader-LM is a suite of small language models for converting HTMLs

into Markdowns. These models are developed to recognize the structure
of HTML tables and generate neat and well-formatted Markdowns.

Model Variants

● Reader-LM 0.5B: A new release of better optimized, less powerful

version intended for simple tasks.
● Reader-LM 1.5B: A version with larger size that allows for
additional features focused to parse more complicated structure of
HTML tags.

This means that these variants are tailored to suit the different needs of
users, 0.5B model has efficiency at the center. while 1.5B model is more
powerful and has higher processing capabilities than the other one.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Key Features of Reader-LM

● Multilingual Support: It has provision for multilingual support and

this makes it ideal for use in different countries.
● Long-Context Handling: Effective in handling long documents of
up to 256K tokens of context length ; particularly HTML
documents.
● Efficient Performance: Originally intended for optimization on
edge devices with less than 1 billion parameters.
● Selective Copying: Concentrate on the transfer of selected HTML
content to Markdown without losing much of the information.

Capabilities/Use Cases of Reader-LM

● Content Conversion: Translates raw HTML of web pages and

cleans it to Markdown format for documentation and content
management.
● Data Cleaning: Eliminates certain unwanted components such as
headers, footers, and sidebars giving out a cleaner input.
● Real-World Examples: Other than documentation, blogging, and
content management system where clean Markdown is desirable,
Reader-LM also has other real time utilization. For instance, it can

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

be applied to build clean feed readers by parsing the raw HTML

from various sources and translating them to structured Markdown
which are easier to summarize and to identify topics. Due to its
information extraction and structuring features, it can be applied in
enhancing the quality of web for the visually impaired, developing
individualized feeds and constructing content feeds, and extracting
data for market research.

How Reader-LM Works

This is unlike most other reader-LM that uses a specific method to

transform raw HTML to clean Markdown. Thus, instead of conventional
approaches such as headless Chrome, Readability, regex, and
Turndown library, Reader-LM makes use of a small language model
(SLM) in this regard. This SLM is especially designed to learn how to
work with the data input in the HTML format and output the Markdown
format without the need for extensive use of rules that define the
conversions. The following figure graphically illustrates this transition
from a complex linear model that incorporates several stages, to the
efficient model of SLM.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Architecture/Design and Workflow

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

This SLM has been the key to Reader-LM’s architectural design for
dealing with the challenges of converting HTML to Markdown. The
HTML to markdown translator is trained on a huge training corpus of
HTML and Markdown samples which helped the model learn the full
features of HTML, Markdown and their interactions. Whenever a new
HTML input is passed to Reader-LM, it moves from left to right and
computes the most likely Markdown tokens according to the training set
as well as the input HTML. This way, Reader-LM is able to retain the
layout and content of the HTML whilst providing the reader with clean,
properly formatted Markdown.

Uniqueness in Training Strategy

The training strategy adopted for Reader-LM is very important for it to be

effective. This model in particular goes through a two-stage training
process, namely on the ‘short-and-simple’ HTML as well as on the
‘long-and-hard’ HTML. It also helps the model to first learn basic
concepts of HTML to Markdown then slowly it is trained with real world
and long HTML documents. Further, the developers have used some
strategies towards the difficulties in the degeneration and the training
when the inputs are long such as contrastive search, repetition stop
criteria and chunk-wise model forwarding. Combined with the selective
copying and long-context policies, these strategies make for a high
efficacy of Reader’s LM to convert HTML to Markdown.

Performance Evaluation of Reader-LM

To assess the performance of Reader-LM, the developers benchmarked

it against Large Language Models such as GPT-4 and Gemini-1.5,
measured by using the four metrics; Recycle Option for Ubiquitous
Generation and Evaluation of reference summaries, TER and WER. The
ROUGE-L evaluation computes the number of overlapping tokens which
provides a measure of the model’s performance in capturing the content.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

TER, intended to assess hallucination, quantifies the rates of generated

Markdown tokens which are unique to the generated output but were not
present in the original HTML. WER which is often used in tasks such as
OCR targets the word sequence and then gives a breakdown of
insertions, deletions and substitution in a detailed manner in order to
compare the output Markdown to the actual Markdown that is expected.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

Reader-LM, particularly the 1.5B model, offered very promising

outcomes, with the highest score, 0.72 of ROUGE-L, as well as the
lowest WER which was 1.87 and TER 0.19, which proves that the 1.5B
model outperforms much larger ones in its aim to accurately translate
HTML into Markdown with the lowest levels of errors and hallucinations
can be considered.

source - https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/

In addition, there was a qualitative analysis that received a visual

analysis of Markdown-language outcoming from 22 HTML sources that

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

represent diverse language and website types. This evaluation

considered four key dimensions: The first four skills include header
extraction, main content extraction, rich structure preservation, and
Markdown syntax usage, all rated from 1 to 5. The study highlighted
Reader-LM-1.5B achieves high awareness in structure preservation and
Markdown standard syntax while comparing with it's competitors . It
also always can not outperform the Jina Reader API , but it was
comparable to bigger models, like Gemini 1.5 Pro.

How to access and Use Reader-LM

Reader-LM is now released to Hugging Face where it is possible to

download the latest 0.5B and 1.5B parameter models. For reading the
inputs locally using Reader-LM, transformers need to be installed and
then the steps listed on the Hugging Face model page of the selected
version have to be followed. For the followers, who would rather use an
easily understandable approach, there is a link to the Colab notebook to
play with the model. Reader-LM is open-source and licensed under the
CC BY-NC 4. 0 license. One has to reach out to Jina AI for commercial
access.

Limitations and Future Work

Reader-LM is proved to be effective in practice yet it can experience

difficulties while dealing with highly nested html structures or the
information which contains a lot of noise. Future research could focus on
enhancing the capacity for handling such cases of patient management.
Also, it is multilingual to a certain extent, but there is a possibility for
development in this direction.

To read more such articles, please visit our blog https://socialviews81.blogspot.com/

Conclusion

Reader-LM is a considerable improvement in the process of converting

HTML to Markdown in comparison with methods that primarily rely on
simple pattern matching and heuristics. Hence, Reader-LM that
leverages SLMs will offer a more efficient and arguably more accurate
solution. By this advancement it becomes easier both in the usage of
web content as well as the creation and management of the content
hence bringing an improvement in the organization of the environment in
the internet.

Source
Jina AI website: https://jina.ai/
reader lm post: https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/
Hugging Face reader-lm 0.5b: https://huggingface.co/jinaai/reader-lm-0.5b
Hugging Face reader-lm 1.5b: https://huggingface.co/jinaai/reader-lm-1.5b
google Colab : https://colab.research.google.com/drive/1wXWyj5hOxEHY6WeHbOwEzYAC0WB1I5uA#scrollTo=lHBHjlwgQesA

Disclaimer - This article is intended purely for informational purposes. It is not sponsored or endorsed by any company or organization, nor does it serve as an
advertisement or promotion for any product or service. All information presented is based on publicly available resources and is subject to change. Readers are
encouraged to conduct their own research and due diligence.