AI Ants

迈向通用人工智能之路

0%

文档理解(版面分析/表格检测/表格结构解析)方法及数据

1、PubLayNet

PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations.

https://github.com/ibm-aur-nlp/PubLayNet

annotations

2、PubTabNet

PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables

https://github.com/ibm-aur-nlp/PubTabNet

Image-based table recognition: data, model, and evaluation:ECCV20

2-1、Tree-Edit-Distance-based Similarity (TEDS)

Evaluation metric for table recognition. This metric measures both the structure similarity and the cell content similarity between the prediction and the ground truth. The score is normalized between 0 and 1, where 1 means perfect matching.

https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src

github group:https://github.com/ibm-aur-nlp

2-2、dataset-tools

Java command-line tools for comparing results to ground truth for table location and structure detection as used in the ICDAR 2013 Table Competition.

https://github.com/tamirhassan/dataset-tools

3、DocBank

DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing.

https://github.com/doc-analysis/DocBank

We provide a dataset loader named DocBankLoader and it can also convert DocBank to the Object Detection models’ format

LayoutLM (repo, paper) is an effective pre-training method of text and layout and archives the SOTA result on DocBank

4、TableBank

https://github.com/doc-analysis/TableBank

We release an official split for the train/val/test datasets and re-train both of the Table Detection and Table Structure Recognition models using Detectron2 and OpenNMT tools. The benchmark results, the MODEL ZOO, and the download link of TableBank have been updated.

Task Definition

Table Detection

Table detection aims to locate tables using bounding boxes in a document. Given a document page in the image format, generating several bounding box that represents the location of tables in this page.

Table Structure Recognition

Table structure recognition aims to identify the row and column layout structure for the tables especially in non-digital document formats such as scanned images. Given a table in the image format, generating an HTML tag sequence that represents the arrangement of rows and columns as well as the type of table cells.

github group:https://github.com/doc-analysis

5、DocumentLayoutAnalysis

https://github.com/BobLd/DocumentLayoutAnalysis

文档解析(包括表格)资料汇总

5-1、SciTSR

https://github.com/Academic-Hammer/SciTSR

Table structure recognition dataset of the paper: Complicated Table Structure Recognition

SciTSR is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF format and their corresponding structure labels obtained from LaTeX source files.

5-2、DocParser

https://github.com/DS3Lab/DocParser/

Hierarchical Structure Parsing of Document Renderings

6、TIES-2.0

https://github.com/shahrukhqasim/TIES-2.0

Table Information Extraction System

Rethinking Table Recognition using Graph Neural Networks:ICDAR19

6-1、TIES_DataGeneration(TIES-2.0对应的数据生成)

https://github.com/hassan-mahmood/TIES_DataGeneration

7、CascadeTabNet

https://github.com/DevashishPrasad/CascadeTabNet

CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents

The paper was presented (Orals) at CVPR 2020 Workshop on Text and Documents in the Deep Learning Era

CVPR Teaser

8、GFTE

https://github.com/Irene323/GFTE

GFTE: Graph-based Financial Table Extraction

9、GANs for tabular data

We well know GANs for success in the realistic image generation. However, they can be applied in tabular data generation.

https://github.com/Diyago/GAN-for-tabular-data

https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342

-------------本文结束感谢您的阅读-------------

本文标题:文档理解(版面分析/表格检测/表格结构解析)方法及数据

文章作者:杨苏辉

发布时间:2021年01月07日 - 10:43

最后更新:2021年01月07日 - 15:36

原始链接:https://yangsuhui.github.io/p/3238.html

许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。

如果您觉得内容不错,可以对我打赏哦!