The project introduces a large database of Arabic-document images and their annotations, called BCE-Arabic. BCE stands for team members from Boston University, Cairo University, and Electronics Research Institute.
It is the first large dataset that provides a representative variety of document content, including text and non-text elements, for document layout analysis (DLA) research for the Arabic language.
BCE-Arabic is intended to be available as a training and performance-evaluation benchmark for development of machine learning systems that analyze Arabic documents with normal and complex layouts.
The work was partially funded by
National Science Foundation, grant 1337866 (to M.B.)
Cairo Initiative Scholarship Program (to R.E.).
Project milestones:
Phase 1: Launching pilot version BCE-Arabic v1 (some limitations).
Phase 2: Massive acquisition of new samples (large variety)
Phase 3: Crowd sourcing for annotation
Phase 4: Constructing searchable samples database for user customization
BCE-Arabic is an ongoing project in which the 1st stage has been completed by collecting over 1,800 images of Arabic book pages with significant layout and page content variability. Annotation is achieved using Alethia® annotation tool from PRIma research. More details here
Cite this dataset version
"Saad, R.S., Elanwar, R.I., Kader, N.S., Mashali, S. and Betke, M., 2016, June. BCE-Arabic-v1 dataset: Towards interpreting Arabic document images for people with visual impairments. In Proceedings of the 9th ACM International Conference on PErvasive Technologies Related to Assistive Environments (p. 25-32). ACM."
The second stage of the BCE-Arabic project extends the work, and adds a substantial amount of data to BCE-Arabic and consider crowdsourcing for ground-truthing (IN PROGRESS)
Collected from BU, MIT and Harvard Arabic book collections from numerous Arabic publishers via
Fair-Use scanning of the available layouts at 200 and 300 dpi resolution (Gray scale) and stored in raster-image PDF format
So far we have more than 9000 pages from 788 books in addition to more than 300 Book cover. Continuous dataset releases will follow with their annotations.
ECDP (Ensemble-based Classification of Document Patches) analyzes the physical (geometric) layout of the document, classifies image regions as containing text or graphics. The classification
step uses the majority decision of an ensemble of support vector machines to make a binary
decision about the text or non-text class of small image patches. ECDP was trained and tested using the first version of BCE-Arabic. The results obtained an average patch classification accuracy of 97.3% and average F1-score of 95.26% for text patches. The system succeeded in extracting text zones in both paragraphs and text-embedded graphics, even if the text was rotated by 90 degrees.
ECDP keeps a high level of performance when applied to images with different content than the training examples. It shows outperforming results when benchmarked with both classical layout analysis method (RLSA: run-length smearing algorithm) and the preprocessing stage of a commercial OCR package (RDI CleverPage). It also keeps good performance with most documents of other state-of-the art researchers' datasets.
Cite ECDP system
R. Elanwar, W. Qin, M. Betke, Making scanned Arabic documents machine-accessible using an ensemble of SVM classifiers, submitted to International Journal of Document analysis and Recognition since January 2017 (Under review)
LABA (Layout Analysis of scanned pages of Books in Arabic) analyzes the logical layout of the
document, classifies image text regions as: title, caption, page number, paragraph, image or
noise. The classification step uses a voting mechanism of 5 support vector machines, each
trained to detect a specific class based on structural features of connected components (CC).
LABA was trained and tested using the first version of BCE-Arabic. The results obtained for
different classes are shown below. The system outperformed a re-implementation of LUNET
system by Hadjar and Ingold[1] on BCE dataset.
[1]Hadjar K, Ingold R. Logical labeling of Arabic newspapers using artificial neural nets, Proceedings of the Eighth International Conference on Document Analysis and Recognition, 2005. IEEE, 2005: 426 430
Coming soon:
Download source code
Instructions to benchmark with LABA
Boston University
Margrit Betke, Professor
Project Sponsor and adviser