Parsing file and regexp

Discussion:

(too old to reply)

o***@algosyn.com

2010-02-13 13:34:45 UTC

Hello,

I need to extract info from some text files. And I want to do it with
Perl !

The file I need to parse has the following layout:

keywordA word1, word2, word3;

Here we can have some free text
...
...

keywordB word4,
word5, word6, word7, word8,
word9, word10;

KeywordA
word1, word2;

...

I want to extract all the "keywords" with their associated words.
For example, with this file, I would like to have:
keywordA: (word1, word2, word3)
keywordB: (word4, word5, word6, word7, word8, word9, word10)
keywordA: (word1, word2)

Is it possible to do this with regular expression ?
Or should I write a small parser ?

I have tried pattern matching with the 's' and also with the 'm'
option,
but with no good result ...

Thanks to help me !

Olivier

Uri Guttman

2010-02-13 19:38:46 UTC

Permalink

osc> keywordA word1, word2, word3;

osc> Here we can have some free text
osc> ...
osc> ...

osc> keywordB word4,
osc> word5, word6, word7, word8,
osc> word9, word10;

osc> KeywordA
osc> word1, word2;

osc> ...

how do you know when a keyword section begins or ends? how large is this
file? could free text have keywords? i see a ; to end a word list but
that isn't enough to properly parse this if you have 'free text'.

osc> I want to extract all the "keywords" with their associated words.
osc> For example, with this file, I would like to have:
osc> keywordA: (word1, word2, word3)
osc> keywordB: (word4, word5, word6, word7, word8, word9, word10)
osc> keywordA: (word1, word2)

osc> Is it possible to do this with regular expression ?
osc> Or should I write a small parser ?

yes and yes.

osc> I have tried pattern matching with the 's' and also with the 'm'
osc> option,
osc> but with no good result ...

please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.

uri

--
Uri Guttman ------ ***@stemsystems.com -------- http://www.sysarch.com --
----- Perl Code Review , Architecture, Development, Training, Support ------
--------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------

Olivier Scalbert

2010-02-14 09:46:36 UTC

Permalink

Post by Uri Guttman
how do you know when a keyword section begins or ends? how large is this
file? could free text have keywords? i see a ; to end a word list but
that isn't enough to properly parse this if you have 'free text'.
osc> Is it possible to do this with regular expression ?
osc> Or should I write a small parser ?
yes and yes.
osc> I have tried pattern matching with the 's' and also with the 'm'
osc> option,
osc> but with no good result ...
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri

Hi Uri,

Sorry, code is at my office !!!!

The free text can not contain keywords. And keywords start at the
beginning of a line. The list of words is terminated by a ";".

For the pattern matching I have used the option s:
m/pattern/s, to swallow the different \n.

Olivier

Olivier Scalbert

2010-02-18 12:14:05 UTC

Permalink

Post by Uri Guttman
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri

Here it is ...

$ cat test.txt
keyword1 word1, word2
word3;
blabla

blabla

keyword2
word4, word5,
word6, word7, word8,
word9;

bla bla
bla bla

keyword1
word10, word11;

$ cat parse.pl
use warnings;

open FILE, "< test.txt" or die "Could not open $!";
$/ = undef;
$source = <FILE>;
close(FILE);

if ($source =~ m/keyword1\s*(\w*)(,\w*)*/s) {
print("Match !\n");
print("$1\n");
print("$2\n");
}

$ perl parse.pl
Match !
word1
,

Here I would like to have 2 matches:
word1, word2
word3;
and word10, word11;

Thanks to help me !

Olivier

o***@algosyn.com

2010-02-19 07:58:06 UTC

Permalink

(Sorry but I have problem with my ISP, so I repost !)

o***@algosyn.com

2010-02-19 07:59:29 UTC

Permalink

Post by Uri Guttman
please show your code. there is no way to help otherwise. s/// is not a
pattern matcher but a substitution operator. it uses regexes and can be
used to parse things.
uri

Shawn H Corey

2010-02-19 19:43:31 UTC

Permalink

Post by Olivier Scalbert
$ cat test.txt
keyword1 word1, word2
word3;
blabla
blabla
keyword2
word4, word5,
word6, word7, word8,
word9;
bla bla
bla bla
keyword1
word10, word11;

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

# Make Data::Dumper pretty
$Data::Dumper::Sortkeys = 1;
$Data::Dumper::Indent = 1;

# Set maximum depth for Data::Dumper, zero means unlimited
$Data::Dumper::Maxdepth = 0;

my $file = shift @ARGV;

my $source;
open my $source_fh, '<', $file or die "could not open $file: $!\n";
{
local $/;
$source = <$source_fh>;
}
close $source_fh;

my %keywords;
my @captured = $source =~ m{ ( keyword\d+ ) ( [^;]+ ) \; }gmsx;
while( @captured ){
my $keyword = shift @captured;
my $words = shift @captured;
$words =~ s{ \A \s+ }{}msx;
$words =~ s{ \s+ \z }{}msx;
my @words = split m{ \s* \, \s* }msx, $words;
push @{ $keywords{$keyword} }, @words;
}

print 'keywords: ', Dumper \%keywords;

__END__
--
Just my 0.00000002 million dollars worth,
Shawn

Programming is as much about organization and communication
as it is about coding.

I like Perl; it's the only language where you can bless your
thingy.

Eliminate software piracy: use only FLOSS.