Top "Text-extraction" questions

Text extraction is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents (text).

How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane …

c# html d text-extraction
Extract filename with extension from filepath string

I am looking to get the filename from the end of a filepath string, say $text = "bob/hello/myfile.zip"; …

php substring filenames filepath text-extraction
How to detect Text Area from image?

i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well …

c++ image-processing tesseract text-extraction
How to extract values from HTML using RegEx?

Given the following HTML: <p><span class="xn-location">OAK RIDGE, N.J.</span>, <…

regex html-content-extraction text-extraction
How do I extract lines from a file using their line number on unix?

Using sed or similar how would you extract lines from a file? If I wanted lines 1, 5, 1010, 20503 from a file, how …

unix sed awk line-numbers text-extraction
PDFminer: extract text with its font information

I find this question, but it uses command line, and I do not want to call a Python script in …

python text-extraction pdfminer
Is there a way to get all text from the rendered page with JS?

Is there an (unobtrusive, to the user) way to get all the text in a page with Javascript? I could …

javascript text text-extraction
How to extract Heading tags in PHP from a string?

From a string that contains a lot of HTML, how can I extract all the text from <h1>&…

php text-extraction domparser
Extracting text from PDF with Poppler (C++)

I'm trying to get my way through Poppler and its (lack of) documentation. What I want to do is a …

c++ pdf text-extraction poppler
How to install textract in python3

sudo python3 -m pip install textract sudo apt-get install textract pip install textract sudo apt-get install swig I want to …

python-3.5 text-extraction