I am trying to find the text position in PDF page?
What I have tried is to get the text in the PDF page by PDF Text Extractor using simple text extraction strategy. I am looping each word to check if my word exists. split the words using:
var Words = pdftextextractor.Split(new char[] { ' ', '\n' });
What I wasn't able to do is to find the text position. The problem is I wasn't able to find the location of the text. All I need to find is the y co-ordinates of the word in the PDF file.
I was able to manipulate it with my previous version for Itext5. I don't know if you are looking for C# but that is what the below code is written in.
using iText.Kernel.Geom;
using iText.Kernel.Pdf.Canvas.Parser;
using iText.Kernel.Pdf.Canvas.Parser.Data;
using iText.Kernel.Pdf.Canvas.Parser.Listener;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class TextLocationStrategy : LocationTextExtractionStrategy
{
private List<textChunk> objectResult = new List<textChunk>();
public override void EventOccurred(IEventData data, EventType type)
{
if (!type.Equals(EventType.RENDER_TEXT))
return;
TextRenderInfo renderInfo = (TextRenderInfo)data;
string curFont = renderInfo.GetFont().GetFontProgram().ToString();
float curFontSize = renderInfo.GetFontSize();
IList<TextRenderInfo> text = renderInfo.GetCharacterRenderInfos();
foreach (TextRenderInfo t in text)
{
string letter = t.GetText();
Vector letterStart = t.GetBaseline().GetStartPoint();
Vector letterEnd = t.GetAscentLine().GetEndPoint();
Rectangle letterRect = new Rectangle(letterStart.Get(0), letterStart.Get(1), letterEnd.Get(0) - letterStart.Get(0), letterEnd.Get(1) - letterStart.Get(1));
if (letter != " " && !letter.Contains(' '))
{
textChunk chunk = new textChunk();
chunk.text = letter;
chunk.rect = letterRect;
chunk.fontFamily = curFont;
chunk.fontSize = curFontSize;
chunk.spaceWidth = t.GetSingleSpaceWidth() / 2f;
objectResult.Add(chunk);
}
}
}
}
public class textChunk
{
public string text { get; set; }
public Rectangle rect { get; set; }
public string fontFamily { get; set; }
public int fontSize { get; set; }
public float spaceWidth { get; set; }
}
I also get down to each individual character because it works better for my process. You can manipulate the names, and of course the objects, but I created the textchunk to hold what I wanted, rather than have a bunch of renderInfo objects.
You can implement this by adding a few lines to grab the data from your pdf.
PdfDocument reader = new PdfDocument(new PdfReader(filepath));
FilteredEventListener listener = new FilteredEventListener();
var strat = listener.AttachEventListener(new TextExtractionStrat());
PdfCanvasProcessor processor = new PdfCanvasProcessor(listener);
processor.ProcessPageContent(reader.GetPage(1));
Once you are this far, you can pull the objectResult from the strat by making it public or creating a method within your class to grab the objectResult and do something with it.