MacOS下配置Hadoop和Spark

首先,准备MacOS环境

略过Java、Scala、Python的环境安装,从Hadoop和Spark说起

安装Hadoop

安装Hadoop,最简单的安装方式:

1
brew install hadoop

找到安装目录

安装完成后,找到Hadoop配置文件目录:
1
cd /usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop

修改core-site.xml

1
2
3
4
5
6
7
8
9
10
<configuration>  
<property>
<name>hadoop.tmp.dir</name>
<value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:8020</value>
</property>
</configuration>

修改hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
<configuration>  
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp/dfs/name</value>
</property>
<property>
<name>dfs.namenode.data.dir</name>
<value>file:/usr/local/Cellar/hadoop/2.7.3/libexec/tmp/dfs/data</value>
</property>
</configuration>

添加环境变量

1
2
3
#Hadoop environment configs  
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.3/libexec
export PATH=$PATH:${HADOOP_HOME}/bin

格式化HDFS

1
2
cd /usr/local/Cellar/hadoop/2.7.3/bin  
./hdfs namenode -format

启动Hadoop

1
2
cd /usr/local/Cellar/hadoop/2.7.3/sbin  
./start-all.sh

在终端输入 jps 查看java进程

1
2
3
1206 DataNode  
1114 NameNode
1323 SecondaryNameNode

安装Spark

Spark的安装也是使用 brew

1
brew install apache-spark

找到安装目录

找到Spark配置文件目录

1
cd /usr/local/Cellar/apache-spark/2.1.0/libexec/conf

修改spark-env.sh

1
2
3
4
cp spark-env.sh.template spark-env.sh
vi spark-env.sh
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home

加入环境变量

1
2
export SPARK_HOME=/usr/local/Cellar/apache-spark/2.1.0/libexec  
export PATH=$PATH:${SPARK_HOME}/bin

启动Spark

1
2
cd /usr/local/Cellar/apache-spark/1.6.0/bin  
./start-all.sh

查看进程

1
2
3
4
5
6
7
8
9
10
jps

6052 Worker
6022 Master
6728 Jps
5546 NameNode
5739 SecondaryNameNode
5947 NodeManager
5630 DataNode
5855 ResourceManager

配置Pycharm开发spark应用

打开Pycharm(我的python版本是2.7)
新建xxxx,新建类:一个简单的wordcount

1
2
3
4
5
6
7
8
9
10
from pyspark import SparkContext

logFile = "/Users/admin/Desktop/BackUp"
sc = SparkContext("local","Simple App")
logData = sc.textFile(logFile).cache()

numAs = logData.filter(lambda s: 'a' in s).count()
numBs = logData.filter(lambda s: 'b' in s).count()

print("Lines with a: %i, lines with b: %i"%(numAs, numBs))

F4打开当前可运行代码的配置项
Environment Variables 选项填写:

1
PYTHONPATH    /usr/local/Cellar/apache-spark/2.1.0/libexec/python

至此,环境完成。