BCE-Arabic Project

The project introduces a large database of Arabic-document images and their annotations, called BCE-Arabic. BCE stands for team members from Boston University, Cairo University, and Electronics Research Institute. It is the first large dataset that provides a representative variety of document content, including text and non-text elements, for document layout analysis (DLA) research for the Arabic language.

BCE-Arabic is intended to be available as a training and performance-evaluation benchmark for development of machine learning systems that analyze Arabic documents with normal and complex layouts.

The work was partially funded by

National Science Foundation, grant 1337866 (to M.B.)
Cairo Initiative Scholarship Program (to R.E.).

Project milestones:

Phase 1: Launching pilot version BCE-Arabic v1 (some limitations).
Phase 2: Massive acquisition of new samples (large variety)
Phase 3: Crowd sourcing for annotation
Phase 4: Constructing searchable samples database for user customization

BCE-Arabic is an ongoing project in which the 1st stage has been completed by collecting over 1,800 images of Arabic book pages with significant layout and page content variability. Annotation is achieved using Alethia® annotation tool from PRIma research. More details here

Download BCE-V1

Cite this dataset version
"Saad, R.S., Elanwar, R.I., Kader, N.S., Mashali, S. and Betke, M., 2016, June. BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments (p. 25-32). ACM."

Visualization Demo

The second stage of the BCE-Arabic project extends the work, and adds a substantial amount of data to BCE-Arabic and consider crowdsourcing for ground-truthing (IN PROGRESS)

Collected from BU, MIT and Harvard Arabic book collections from numerous Arabic publishers via Fair-Use scanning of the available layouts at 200 and 300 dpi resolution (Gray scale) and stored in raster-image PDF format

So far we have more than 9000 pages from 788 books in addition to more than 300 Book cover. Continuous dataset releases will follow with their annotations.

ECDP (Ensemble-based Classification of Document Patches) analyzes the physical (geometric) layout of the document, classifies image regions as containing text or graphics. The classification step uses the majority decision of an ensemble of support vector machines to make a binary decision about the text or non-text class of small image patches. ECDP was trained and tested using the first version of BCE-Arabic. The results obtained an average patch classification accuracy of 97.3% and average F1-score of 95.26% for text patches. The system succeeded in extracting text zones in both paragraphs and text-embedded graphics, even if the text was rotated by 90 degrees.

Download training set
Download test set
Download source code
Instructions to benchmark with ECDP

ECDP keeps a high level of performance when applied to images with different content than the training examples. It shows outperforming results when benchmarked with both classical layout analysis method (RLSA: run-length smearing algorithm) and the preprocessing stage of a commercial OCR package (RDI CleverPage). It also keeps good performance with most documents of other state-of-the art researchers' datasets.

Cite ECDP system
R. Elanwar, W. Qin, M. Betke, Making scanned Arabic documents machine-accessible using an ensemble of SVM classifiers, submitted to International Journal of Document analysis and Recognition since January 2017 (Under review)

LABA (Layout Analysis of scanned pages of Books in Arabic) analyzes the logical layout of the document, classifies image text regions as: title, caption, page number, paragraph, image or noise. The classification step uses a voting mechanism of 5 support vector machines, each trained to detect a specific class based on structural features of connected components (CC). LABA was trained and tested using the first version of BCE-Arabic. The results obtained for different classes are shown below. The system outperformed a re-implementation of LUNET system by Hadjar and Ingold[1] on BCE dataset.

[1]Hadjar K, Ingold R. Logical labeling of Arabic newspapers using artificial neural nets, Proceedings of the Eighth International Conference on Document Analysis and Recognition, 2005. IEEE, 2005: 426 430

Coming soon:
Download source code
Instructions to benchmark with LABA

Boston University

Margrit Betke, Professor
Project Sponsor and adviser

Wenda Qin
Software development (BCE-v2)

Electronics Research Institute (Egypt)

Randa Elanwar, Ph.D.
Project Planning and management

Rana Saad
BCE-v1 collection and annotation

Samia Mashali, Professor
Consultancy (BCE-v1)

Cairo University (Egypt)

Naemat Abdelkader, Professor
Consultancy (BCE-v1)

The project team welcomes collaboration of all types

Software development
Research assistance
High-performance computing expertise
Dataset collection
Joint funding
others

* if interested kindly select the type of collaboration and fill in the online form.

BCE-Arabic Project

BCE-Arabic Dataset

ECDP System

People

Contact Us